wiki:FAQ/Fencing
Last modified 3 years ago Last modified on 05/26/11 14:52:17

Fencing Questions

What is fencing and why do I need it?

Fencing is the component of cluster project that cuts off access to a resource (hard disk, etc.) from a node in your cluster if it loses contact with the rest of the nodes in the cluster.

The most effective way to do this is commonly known as STONITH, which is an acronym that stands for "Shoot The Other Node In The Head." In other words, it forces the system to power off or reboot. That might seem harsh to the uninitiated, but really it's a good thing. If a node that is not cooperating with the rest of the cluster can seriously damage the data unless it's forced off. So by fencing an errant node, we're actually protecting the data.

Fencing is often accomplished with a network power switch, which is a power switch that can be controlled through the network. This is known as power fencing.

Fencing can also be accomplished by cutting off access to the resource, such as using SCSI reservations. This is known as fabric fencing.

What fencing devices are supported by Cluster Suite?

This is constantly changing. Manufacturers come out with new models and new microcode all the time, forcing us to change our fence agents. Your best bet is to look at the source code in git and see if your device is mentioned:

fence-agents.git

You may also want to check out the Hardware Configuration Guidelines listed here:

https://access.redhat.com/kb/docs/DOC-30004

We are looking into ways to improve this.

Can't I just use my own watchdog or manual fencing?

No. Fencing is absolutely required in all production environments. That's right. We do not support people using only watchdog timers anymore.

Manual fencing is absolutely not supported in any production environment, ever, under any circumstances.

Should I use power fencing or fabric fencing?

Both do the job. Both methods guarantee the victim can't write to the file system, thereby ensuring file system integrity.

However, we recommend to customers to use power-cycle fencing anyway for a number of reasons. There are arguments where fabric level fencing is useful. The common "Fabric Fencing" arguments go something like this:

"What if the node has a reproducible failure that keeps happening over and over if we reset it each time?" and "What if I have non-clustered, but mission-critical tasks running on the node, and it is evicted from the cluster but is not actually dead (say, the cluster software crashed)? Power-cycling the machine would kill the Mission Critical tasks running on it..."

However, once a node is fabric fenced, you need to reboot it before it can rejoin the cluster.

What should I do it fenced dies or is killed by accident?

Killing fenced, or having it otherwise exit, while the node is using gfs isn't good. If the node fails without fenced running it won't be fenced. Fenced can simply be restarted if it exits somehow, which is what you should do if you find it's been killed. I don't think we can really prevent it from being intentionally killed, though.

HP's iLo (Integrated Lights Out) fencing doesn't work for me. How do I debug it?

The first step is to try fencing it from a command line that looks something like this:

/sbin/fence_ilo -a myilo -l login -p passwd -o off -v

Second, check the version of RIBCL you are using. You may want to consider upgrading your firmware. Also, you may want to scan bugzilla to see if there are any issues regarding your level of firmware.

What are fence methods, fence devices and fence levels and what good are they?

A node can have multiple fence methods and each fence method can have multiple fence devices.

Multiple fence methods are set up for redundancy/insurance. For example, you may be using a baseboard management fencing method for a node in your cluster such as IPMI, or iLO, or RSA, or DRAC. All of these depend on a network connection. If this connection would fail, fencing could not occur, so as a backup fence method you could declare a second method of fencing that used a power switch or somesuch to fence the node. If the first method failed to fence the node, the second fence method would be employed.

Multiple fence devices per method are used, for example, if a node has dual power supplies and power fencing is the fence method of choice. If only one power supply were fenced, the node would not reboot - as the other power supply would keep it up and running. In this case you would want two fence devices in one method: one for power supply A and one for power supply B.

All fence devices within a fence method must succeed in order for the method to succeed.

If someone refers to fence "levels" they are the same thing as methods. The term "method" used to refer to "power" versus "fabric" fencing. But the technology has outgrown that but the config file has not. So the term "fencing level" might be more accurate, but we still refer to them as "fencing methods" because "method" is how you specify it in the config file.

Why does node X keep getting fenced?

There can be multiple causes for nodes that repeatedly get fenced, but the bottom line is that one of the nodes in your cluster isn't seeing enough "heartbeat" network messages from the node that's getting fenced.

Most of the time, these come down to flaky or faulty hardware, such as bad cables and bad ports on the network hub or switch.

Test your communications paths thoroughly without the cluster software running to make sure your hardware is okay.

Why does node X keep getting fenced at startup?

If your network is busy, your cluster may decide it's not getting enough heartbeat packets, but that may be due to other activities that happen when a node joins a cluster. You may have to increase the post_join_delay setting in your cluster.conf. It's basically a grace period to give the node more time to join the cluster. For example:

<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600"/>

Is manual fencing supported?

No. No. A thousand times no. Oh sure. You can use it. But don't complain when a node needs to be fenced and the cluster locks up, and services don't fail over.

But I don't want to buy a network power switch. Why isn't manual fencing supported?

Because we can't be responsible when this happens:

  • A node stops sending heartbeats long enough to be dropped from the cluster but has not panicked or the hardware has not failed. There a number of ways this could happen: faulty network switch, rogue application on the system that locks out other applications, someone trips over the network cable, or perhaps someone did a modprobe for a device driver that takes a very long time to download firmware and initialize the hardware, etc.
  • Fencing of the node is initiated in the cluster by one of the other members. Fence_manual is called, lock manager operations are put on hold until the fencing operation is complete. NOTE: Existing locks are still valid and I/O still continues for those activities not requiring additional lock requests.
  • System Administrator sees the fence_manual and immediately enters fence_ack_manual to the get cluster running again, prior to checking on the status of the failed node.
  • Journals for the fenced node are replayed and locks cleared for those entries so other operations can continue.
  • The fenced node continues to do read/write operations based on its last lock requests. File system is now corrupt.
  • Administrator now gets to the fenced node and resets it because operations ground to a halt due to no longer having status in the cluster.
  • File system corruption causes other nodes to panic.
  • System Administrator runs gfs_fsck for five days of lost production time trying to fix the corruption.
  • Adminstrator now complains that gfs is not stable and can't survive node failures. Ugly and untrue rumors start spreading about GFS corrupting data.

When will a node withdraw vs. getting fenced?

When a node can't talk to the rest of the cluster through its normal heartbeat packets, it will be fenced by another node.

If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".

What's the right way to configure fencing when I have redundant power supplies?

You have to be careful when configuring fencing for redundant power supplies. If you configure it wrong, each power supply will be fenced separately and the other power supply will allow the system to continue running. The system won't be fenced at all. What you really want is to configure it so that both power supplies are shut off and the system is taken completely down. What you want is a set of two fencing devices inside a single fencing method.

If you're using dual power supplies, both of which are plugged into the same power switch, using ports 1 and 2, you can do something like this:

<clusternode name="node-01" votes="1">
   <fence>
      <method name="1">
         <device name="pwr01" action="off" switch="1" port="1"/>
         <device name="pwr01" action="off" switch="1" port="2"/>
         <device name="pwr01" action="on" switch="1" port="1"/>
         <device name="pwr01" action="on" switch="1" port="2"/>
      </method>
   </fence>
</clusternode>
...
<fencedevices>
   <fencedevice agent="fence_apc" ipaddr="192.168.0.101"
login="admin" name="pwr01" passwd="XXXXXXXXXXX"/>
</fencedevices>

The intrinsic problem with this, of course, is that if your UPS fails or needs to be swapped out, your system will lose power to both power supplies and you have down time. This is unacceptable in a High Availability (HA) cluster. To solve that problem, you'd really want redundant power switches and UPSes for the dual power supplies.

For example, let's say you have two APC network power switches (pwr01 and pwr02), each of which runs on its own separate UPS and has its own unique IP address. Let's assume that the first power supply of node 1 is plugged into port 1 of pwr01, and the second power supply is plugged into port 1 of pwr02. That way, port 1 on both switches is reserved for node 1, port 2 for node 2, etc. In your cluster.conf you can do something like this:

<clusternode name="node-01" votes="1">
   <fence>
      <method name="1">
         <device name="pwr01" action="off" switch="1" port="1"/>
         <device name="pwr02" action="off" switch="1" port="1"/>
         <device name="pwr01" action="on" switch="1" port="1"/>
         <device name="pwr02" action="on" switch="1" port="1"/>
      </method>
   </fence>
</clusternode>
...
<fencedevices>
   <fencedevice agent="fence_apc" ipaddr="192.168.0.101"
login="admin" name="pwr01" passwd="XXXXXXXXXXX"/>
   <fencedevice agent="fence_apc" ipaddr="192.168.1.101"
login="admin" name="pwr02" passwd="XXXXXXXXXXX"/>
</fencedevices>

Do you have any specific recommendations for configuring fencing devices?

We have some. For WTI please visit this link: http://lon.fedorapeople.org/wti_devices.html

Two Node: fencing gets stuck in JOIN_START_WAIT on boot, but the nodes can see each other. Why?

This has to do with how quorum and fencing work in two node clusters. Typically, fencing requires quorum in order to be allowed, which requires a majority of nodes to be online. In a two node cluster, this would mean both need to be online. So, the quorum issue is solved by allowing a quorum with one node in a two_node cluster. This, in turn, allows both nodes to fence.

Now, when the two nodes boot disconnected of one another, they both become quorate and, therefore, both try to fence each other. If this operation fails, fencing is retried. All is fine.

A problem, however, arises if the two nodes later become connected but neither has been fenced. The nodes each abort their fencing operations since the remote node has come "online". Unfortunately, because both nodes have state, this shouldn't happen. Or, put more bluntly, once the guns are out, someone must win the Showdown at the Cluster Corral.

Now, why does this matter? It turns out that spanning tree delays in intelligent (and often expensive) network switches can cause this behavior. Multicast packets are not forwarded immediately, causing openais/cman to form two one-node partitions. They try to fence each other, but fail because they can not reach the fencing devices. Much (30-60 seconds) later, the partitions merge as the spanning tree algorithm completes. Fencing is averted, but no one won the showdown. Since fence partitions can not merge by their very nature, the two nodes end up with the fence domain stuck in JOIN_START_WAIT.

You can fix this one of several ways at the moment:

  • Enable portfast or the equivalent behavior on your switch. This makes spanning tree algorithms run immediately, bypassing other phases of the discovery process.
  • Use a $19.99 ethernet switch from the local computer store instead of a $10,000 switch
  • Use a crossover cable temporarily
  • Use qdiskd in "dumb" (no heuristics) mode. This will ensure only one node forms quorum in a simultaneous boot state. If you also then disable the allow_kill and reboot options, doing this will not introduce any other significant behavioral changes in the cluster. The downside of this is that it requires shared storage.