wiki:Fence
Last modified 3 years ago Last modified on 05/22/11 18:22:02

What it Is

I/O fencing, or fencing is an active countermeasure taken by a cluster in order to prevent a presumed-dead or misbehaving cluster member from writing data to a piece of critical shared media. The act of cutting this presumed-dead member prevents data corruption on shared media.

Let me reinforce this point: there is an active component in I/O fencing: action is taken by the cluster to prevent a node from writing data.

While not strictly required in order for something to be considered fencing in the classical sense, linux-cluster adds an additional requirement in order to protect your data to all supported fencing agents: verification. If a fencing agent can not verify an action has completed, then the action is presumed to have failed.

What it is not

Fencing is not synonymous with power cycling. A host may be fenced from shared storage without its power being cut. Power cycling is a form of fencing, but it is certainly not the only form of fencing.

Why You Need It

Fencing prevents data corruption and increases availability by reducing uncertainty in a cluster of computers.

Consider the classic case for why fencing is required:

  • node 1 takes a lock
  • node 1 hangs
  • node 2 thinks node 1 is dead
  • node 2 takes the same lock node 1 took
  • node 2 writes data
  • node 1 wakes up, still believing it has the lock
  • node 1 overwrites data that node 2 just wrote out

Now, add fencing:

  • node 1 takes a lock
  • node 1 hangs
  • node 2 thinks node 1 is dead
  • node 2 fences node 1
  • node 2 takes the same lock node 1 took
  • node 2 writes data
  • node 1 wakes up, still believing it has the lock
  • node 1 tries to write data, but can't since it has lost its I/O paths to the disk

In the second case, data corruption is prevented.

Technologies used by cluster software Today

Note, this might be non-exhaustive.

I/O Fencing Variants

  • power fencing - As the saying goes, "Dead Nodes corrupt no Data". If a host does not have power, it can not issue I/O. This can be done using external power switches like those available from APC and WTI, or with integrated power management, such as iLO, IPMI, DRAC, RSA, etc.
  • fibre channel zoning - typically done on fibre channel switches, the host's paths to a shared SAN are cut off. A reboot is required prior to restoring a node's connectivity to shared storage in order to ensure any backed-up I/Os are removed.
  • SCSI-2 reservations (old) - A reservation is revoked for a particular LUN on a shared SCSI disk. linux-cluster does not support and has no intentions to support this model of fencing, as most implementations were device-specific. Like FC zoning, a reboot is required prior to letting the cluster node issue I/Os again.
  • SCSI-3 reservations (group) - A registration is taken by each node in the cluster for a given LUN or set of LUNs, and a single reservation is taken by the group. Fencing a node involves revoking its registration for the LUN(s) in question. Like FC zoning, a reboot is required prior to letting the cluster node issue I/Os again.
  • Network Disconnect (NAS only) - When used in conjunction with a NAS appliance (NFS, iSCSI, etc.), a managed switch can close off network ports to a given node, thereby preventing access. Like FC zoning, a reboot is required prior to letting the cluster node issue I/Os again.
  • ssh "reboot" - while, if successful, the node clearly cannot issue I/Os, this is not supported by linux-cluster because of the verification requirement. When you issue a reboot -fn or call reboot(RB_AUTOBOOT), the ssh connection hangs and there is no easy way to verify the node has rebooted. Additionally, if the node is hanging or misbehaving, this fencing method is very unreliable.
  • virtual machine destruction - Instruction is given to a hypervisor to destroy a given virtual machine. This is functionally equivalent to power fencing

Fencing: The Masquerade

Various kinds of things which are also used in some cluster solutions, but are do not qualify as I/O Fencing as there is no active countermeasure taken by the surviving cluster. Generally, there is a blind assumption in place.

  • timeout - After some amount of time, just assume the node is dead and not coming back.
  • suicide - An extension of a timeout, but software on the host decides to reboot the node after it detects loss of cluster connectivity.
  • watchdog - An extension to a timeout, a watchdog is a hardware device which hard-reboots the system if an update from userland is not received. On Linux, we have the watchdog daemon, which also adds heuristics to a plain watchdog, making it a combination suicide+watchdog.
  • manual fencing, meatware, or manual override - These are not actually fencing (again, the cluster isn't actually doing anything to prevent I/Os), but serve as a substitute in emergency situations. An administrator is expected to reset or power-off affected machines prior to issuing the override command.

Generally speaking, if you have to assume or there is a timeout, a particular solution is almost never fencing.

Also, some methods which are not fencing are safe from a data integrity standpoint. For example, manual override is safe as long as the administrator takes actions to ensure data integrity is preserved prior to issuing the override command.

Data Corruption Prevention in Split-Site Clusters

Split-site clusters introduce problem for clustering and traditional fencing. Effectively, if you are running a single cluster across two sites and the inter-site link is cut, you have no real way to take action to prevent I/O nor do we have an automatic way to confirm the death of the remote site.

In configurations like this, there is typically a replicated SAN or resource (e.g. DRBD). That is, there are two pieces of hardware in sync with one another across the inter-site link.

When the inter-site link fails, there is no way to reliably fence the remote site in the traditional sense: any configured fencing devices will likely be unvailable. Instead, data integrity is preserved in a fairly simplistic way: one site is chosen to win by an administrator or technological mean while the other site loses.

At this point, winning site's copy of the shared data becomes authoritative and the losing site's copy of the shared data is overwritten when the inter-site link returns.

Methodologies

  • administrator intervention - an administrator makes the call as to which site wins
  • third-site arbitration - a third site picks one to survive and coordinates (Pacemaker 1.2)