wiki:ResourceActions
Last modified 3 years ago Last modified on 08/18/11 18:11:10

Resource Actions

RGManager somewhat follows the OCF RA API 1.0 Draft with respect to resource actions.

There are only a few resource agent actions which rgmanager calls:

  • start - start the resource
  • stop - stop the resource
  • status - check the status of the resource
    • multiple depths are supported
    • per-resource overrides of the status actions are supported
  • metadata - report the OCF RA XML metadata

monitor vs. status

The monitor action is specified by the OCF RA API 1.0 Draft. It differs from the status operation in that the exit codes for monitor are different from the LSB-defined return codes for the status operation. Rgmanager does not call monitor, and even if it did, it treats all nonzero return codes as not running -- which means recovery is in order. More sophisticated resource managers such as the Pacemaker CRM use the monitor action in order to determine current existence of resource(s) cluster-wide in order to avoid double-starts of resources. Rgmanager, instead, uses a more simple (or, primitive, if you prefer) stop-before-start behavior to ensure the cluster is clear prior to starting any resources.

Depth, Intervals, and Timeouts

Rgmanager supports multiple depths for the status action. Generally speaking, a depth is a measure of how intensive the operation is supposed to be. For example, a file system agent might only check the presence of the mount at a depth of 0, while at a higher depth level of 10, the agent may attempt to read from the file system. At an even higher depth level of 20, the file system agent may attempt to write to the file system. The value of a depth has no meaning by itself and is resource-agent dependent. That said, given a resource agent which supports multiple depths, 0 must be the least intensive check, and the highest depth value provided must be the most intensive check. In our example, depths of 1, 2, and 3 (instead of 0, 10, 20, respectively) would have achieved the same result.

Rgmanager supports per-depth status intervals. Generally speaking, a more intensive check should occur less frequently so as to not waste resources. Checking that a file system is mounted takes fewer resources than also writing to the file system, for example, so it should be run less frequently. Rgmanager handles multiple depths by periodically scanning the resource tree and looking for any resources which have not been checked at their desired interval. When a resource needs to be checked, the highest depth check which has expired is chosen. Continuing with our example, suppose our depth 0 check was every 30 seconds, our depth 10 check was every 60 seconds, and our depth 20 check was every 120 seconds.

Starting from a point in time t, here is what will happen given the above example:

  • Around t+30, a depth 0 check is performed (0 is expired)
  • Around t+60, a depth 10 check is performed (0 and 10 are expired)
  • Around t+90, a depth 0 check is performed (0 is expired)
  • Around t+120, a depth 20 check is performed (0, 10, and 20 are expired)
  • (repeat)

Notes:

  • If a status check has not completed, rgmanager will not queue up status checks for the resource.
  • It is recommended that subsequent depths be a multiple of the previous depth, but this is not required.

Timeouts are not used by rgmanager by default. Rgmanager has two methods of handling timeouts:

  • It passes the timeout value to the resource agent which then may deal with it as it sees fit, and
  • It is possible to turn timeout enforcement for a resource at its position in the resource tree in order to make rgmanager kill the resource agent if the timeout is exceeded. This is not a recommended practice. See the section below on enabling action timeouts.

Return Values

OCF has a wide range of return codes for the monitor operation, but since rgmanager calls status, it relies almost exclusively on SysV-style return codes.

  • 0 - success
    • stop after stop or stop when not running must return success
    • start after start or start when running must return success
  • nonzero - failure
    • if the stop operation ever returns a nonzero value, the service enters the failed state and the service must be recovered manually.

Customizing Actions

Resource actions may be customized in cluster.conf. To add or change a resource action, you can just add it as a child of the resource definition.

Typically, this is done from within the <resources> section:

<resources>
  <fs name="foo" ... >
    <action name="start" timeout="3" />
  </fs>
</resources>
<service name="bar">
  <fs ref="foo"/>
</service>

You may also override the action on resourced defined inline in the tree:

<resources/>
<service name="bar">
  <fs name="foo" ... >
    <action name="start" timeout="3" />
  </fs>
</service>

You may not, however, redefine actions to references. The following will not work:

<resources>
  <fs name="foo" ... />
</resources>
<service name="bar">
  <fs ref="foo">
    <action name="start" timeout="3" />
  </fs>
</service>

Customizing Actions with Multiple Depths

Some agents implement multiple monitoring or status depths with different timeouts in order to perform different levels of checking. For example, a '0' depth for the status check of an IP address may be a simple existence check, a '10' depth might include ethernet link checking, and a '20' depth may include pinging an upstream router.

If you wish, you may override all depths of a given action by using an asterisk as the depth in cluster.conf. For example:

<ip ... >
  <action name="status" depth="*" interval="30" />
</ip>

Note that doing this causes only the highest or most invasive checks to be performed.

Using Action Timeouts

Timeouts are not normally checked or enforced by RGManager In order for a timeout to be enforced, you must explicitly set the __enforce_timeouts special parameter to 1. For example:

<fs name="foo" __enforce_timeouts="1" ... />

Note that the action is considered failed if the timeout expires. If you set a timeout on the stop action and enable timeout enforcement, the service will enter the failed state if the timeout expires!

A better approach is to have your agent look at OCF_RESKEY_RGMANAGER_meta_timeout and implement its own timeout checking for any given action.