Changes between Version 30 and Version 31 of ResourceTrees


Ignore:
Timestamp:
05/22/11 17:17:19 (3 years ago)
Author:
digimer
Comment:

Fixed the last copy from the old wiki to work on the new Trac syntax.

Legend:

Unmodified
Added
Removed
Modified
  • ResourceTrees

    v30 v31  
     1[[TOC]] 
     2 
    13== Resource Trees - Basics / Definitions == 
    24{{{ 
     
    1214 * ''fs:myfs'' is the '''parent''' of ''script:script_child'' 
    1315 * ''script:script_child'' is the '''child''' of ''fs:myfs'' 
    14 == Parent / Child Relationships & Start Ordering == 
     16== Parent / Child Relationships, Dependencies & Start Ordering == 
    1517The rules for parent/child relationships in the resource tree are fairly simple: 
    1618 * Parents are started before children 
    1719 * Children must all stop (cleanly) before a parent may be stopped 
     20 * From these two, you could say that a ''child resource is dependent on its parent resource'' 
    1821 * In order for a resource to be considered in good health, all of its dependent children must also be in good health 
    1922== Sibling Start Ordering & Resource Child Ordering == 
    20 [[RGManager]] allows specification of a start/stop ordering relationship for classes of child resources.  At the top level, we have the '''service''' resource - a special resource which acts as a link between rgmanager's service placement, dependency, and real resources themselves.  That is, the '''service''' resource is '''container''' for other resources.  Let's examine the service resource's defined child ordering:{{{ 
     23[wiki:RGManager] allows specification of a start/stop ordering relationship for classes of child resources.  At the top level, we have the '''service''' resource - a special resource which acts as a link between rgmanager's service placement, dependency, and real resources themselves.  That is, the '''service''' resource is '''container''' for other resources.  Let's examine the service resource's defined child ordering: 
     24{{{ 
    2125    <special tag="rgmanager"> 
    2226        <attributes root="1" maxinstances="1"/> 
     
    3236        <child type="smb" start="8" stop="3"/> 
    3337        <child type="script" start="9" stop="1"/> 
    34     </special>}}} 
    35 The ''start'' attribute is the order (1..100) of that class of resources.  This means, as you have already guessed, that ''all'' lvm children are started first, followed by all fs children, followed by all script children, and so forth.  Ordering within a given resource type is preserved as it exists in cluster.conf.  For example, consider the following service:{{{ 
     38    </special> 
     39}}} 
     40The ''start'' attribute is the order (1..100) of that class of resources.  This means, as you have already guessed, that ''all'' lvm children are started first, followed by all fs children, followed by all script children, and so forth.  Ordering within a given resource type is preserved as it exists in cluster.conf.  For example, consider the following service: 
     41{{{ 
    3642    <service name="foo"> 
    3743        <script name="1" .../> 
     
    4046        <fs name="1" .../> 
    4147        <lvm name="2" .../> 
    42     </service>}}} 
    43  * The start ordering would be:{{{ 
     48    </service> 
     49}}} 
     50 * The start ordering would be: 
     51{{{ 
    4452  lvm:1            # All lvms are started first...  
    4553  lvm:2            # 
    4654  fs:1             # then file systems... 
    4755  ip:10.1.1.1      # then ip addresses... 
    48   script:1         # finally, scripts.}}}  
    49  * The stop ordering would be: {{{ 
     56  script:1         # finally, scripts. 
     57}}} 
     58 * The stop ordering would be: 
     59{{{ 
    5060  script:1 
    5161  ip:1 
    5262  fs:1 
    5363  lvm:2 
    54   lvm:1}}} 
    55 With all the type-specified children, it's also important to note that all ''untyped children'' - children of a given resource node which do not have a <child> definition in the resource agent metadata - are all started according to their order in cluster.conf and stopped in reverse order.  They are started after all type-specified children and stopped before any typed children.  For example:{{{ 
     64  lvm:1 
     65}}} 
     66With all the type-specified children, it's also important to note that all ''untyped children'' - children of a given resource node which do not have a <child> definition in the resource agent metadata - are all started according to their order in cluster.conf and stopped in reverse order.  They are started after all type-specified children and stopped before any typed children. 
     67 
     68For example: 
     69{{{ 
    5670    <service name="foo"> 
    5771        <script name="1" .../> 
     
    6276        <fs name="1" .../> 
    6377        <lvm name="2" .../> 
    64     </service>}}} 
    65  * The start ordering would be:{{{ 
     78    </service> 
     79}}} 
     80 * The start ordering would be: 
     81{{{ 
    6682  lvm:1 
    6783  lvm:2 
     
    7086  script:1 
    7187  untypedresource:foo 
    72   untypedresourcetwo:bar}}} 
    73  * The stop ordering would be: {{{ 
     88  untypedresourcetwo:bar 
     89}}} 
     90 * The stop ordering would be: 
     91{{{ 
    7492  untypedresourcetwo:bar 
    7593  untypedresource:foo 
     
    7896  fs:1 
    7997  lvm:2 
    80   lvm:1}}} 
     98  lvm:1 
     99}}} 
    81100== Inheritance, the <resources> Block, and Reusing Resources == 
    82 Some resources benefit from inheriting values from a parent resource.  The most common, practical example I can give you is an NFS service.  Here's a typical NFS service configuration, set up for resource reuse and inheritance:{{{ 
     101Some resources benefit from inheriting values from a parent resource.  The most common, practical example I can give you is an NFS service.  Here's a typical NFS service configuration, set up for resource reuse and inheritance: 
     102{{{ 
    83103    <resources> 
    84104        <nfsclient name="bob" target="bob.test.com" options="rw,no_root_squash"/> 
     
    108128        </fs> 
    109129        <ip address="10.2.13.20"/> 
    110     </service>}}} 
     130    </service> 
     131}}} 
    111132If we were to have flat service (a service with no parent/child relationships), there are a couple of things that would be needed: 
    112133 * We'd need four nfsclient resources - one per file system (2) * one per per target machine (2) = 4 
    113134 * We would have to specify export path & file system ID to each nfsclient, which introduces a greater chance for an error in the configuration. 
    114135With the above configuration, however, the resources named ''nfsclient:bob'' and ''nfsclient:jim'' are defined once, as is ''nfsexport:exports''.  Everything which is needed to be known by those resources is inherited.  Because the inherited attributes are dynamic (and do not conflict with one another), it is possible to reuse these resources - which is why they are defined in the '''resources''' block.  Some resources can not be used in multiple places (e.g. ''fs'' resources - mounting a file system on 2 nodes is a bad idea!), but there's no harm in defining them in the '''resources''' block if that is your preference. 
     136== Customizing Resource Actions == 
     137See ResourceActions 
     138 
    115139== Failure Recovery & Independent Subtrees == 
    116 The normal course of action is to restart the whole service if any component fails.  Suppose we have the following service:{{{ 
     140When a '''start''' operation fails for any resource, the operation immediately fails for the whole service and no additional resource '''start''' operations are attempted. 
     141 
     142When a '''stop''' operation fails for any resource, all other remaining resources in the service are attempted to be stopped, even if the expected result is a failure to stop.  This is done in order to reduce the amount of running resources as much as possible before placing the service in to the ''failed'' state. 
     143 
     144=== Independent Subtrees === 
     145When a '''status''' check fails for a resource, the normal course of action is to restart the whole service.  Suppose we have the following service: 
     146{{{ 
    117147    <service name="foo"> 
    118148        <script name="script_one" ...> 
    119149            <script name="script_two" .../> 
     150            <script name="script_three" .../> 
    120151        </script> 
    121         <script name="script_three" .../> 
    122     </service>}}} 
    123 If any of the scripts defined in this service fail, the normal course of action is to restart (or relocate/disable, according to the service recovery policy) the service.  What if, however, we wanted parts of the service to be considered non-critical?  What if we wanted to restart only part of the service in place - before attempting normal recovery?  The solution is what we call the {{{__independent_subtree}}} attribute.  It's used in the following way:{{{ 
     152        <script name="script_four" .../> 
     153    </service> 
     154}}} 
     155If any of the scripts defined in this service fail, the normal course of action is to restart (or relocate/disable, according to the service recovery policy) the service.  What if, however, we wanted parts of the service to be considered non-critical?  What if we wanted to restart only part of the service in place - before attempting normal recovery?  The solution is what we call the {{{__independent_subtree}}} attribute.  It's used in the following way: 
     156{{{ 
    124157    <service name="foo"> 
    125158        <script name="script_one" __independent_subtree="1" ...> 
    126             <script name="script_two" .../> 
     159            <script name="script_two" __independent_subtree="1" .../> 
     160            <script name="script_three" .../> 
    127161        </script> 
    128         <script name="script_three" .../> 
    129     </service>}}} 
    130  * If ''script:script_one'' fails, we restart just ''script:script_one''. 
    131  * If ''script:script_two'' fails, we restart ''script:script_two'' and ''script:script_one'' 
    132  * If ''script:script_three'' fails, we restart the whole service 
     162        <script name="script_four" .../> 
     163    </service> 
     164}}} 
     165 * If ''script:script_one'' fails, we restart ''script:script_two, script:script_three'' and ''script:script_one'' 
     166 * If ''script:script_two'' fails, we restart just ''script:script_two'' 
     167 * If ''script:script_three'' fails, we restart ''script:script_one, script:script_two,'' and ''script:script_three'' 
     168 * If ''script:script_four'' fails, we restart the whole service 
     169 
     170If an independent subtree is successfully restarted, rgmanager performs no other recovery actions. 
     171 
     172Independent subtrees may also have per-subtree restart counters, similar to service restart counters.  They are declared by adding {{{__max_restarts}}} and {{{__restart_expire_time}}} to a given {{{__independent_subtree}}} declaration.  If a subtree's restart counters are exceeded, the service goes in to recovery.  Otherwise, successful restarts of an independent subtree are not considered errors. 
     173 
     174=== Non-Critical Subtrees === 
     175What if you want a subset of resources to be considered non-critical?  For example, suppose ''script:script_one'' and its children are far less important than ''script::script_four'', and we want to keep ''script:script_four'' up even if the others fail.  You can tag a subtree not only as independent in rgmanager, but also '''non-critical'''.  This is done by setting the {{{__independent_subtree}}} attribute to 2: 
     176{{{ 
     177    <service name="foo"> 
     178        <script name="script_one" __independent_subtree="2" ...> 
     179            <script name="script_two" __independent_subtree="1" .../> 
     180            <script name="script_three" .../> 
     181        </script> 
     182        <script name="script_four" .../> 
     183    </service> 
     184}}} 
     185 * If ''script:script_one'' fails, we stop ''script:script_two, script:script_three'' and ''script:script_one'' 
     186 * If ''script:script_two'' fails, we restart just ''script:script_two'' (Notice how script_two is a normal independent subtree!) 
     187 * If ''script:script_three'' fails, we stop ''script:script_one, script:script_two,'' and ''script:script_three'' 
     188 * If ''script:script_four'' fails, we restart the whole service 
     189 
     190A non-critical subtree is immediately stopped if an error occurs at any level of the subtree and {{{__max_restarts}}} or {{{__restart_expire_time}}} are unset. 
     191 
     192Whenever a non-critical subtree's maximum restart threshold is exceeded, the subtree is stopped, and the service gains a {{{P}}} flag (partial).  It is possible to restore a service to full operation by using the {{{clusvcadm -c}}} (convalesce) operation. 
     193 
    133194== Testing your Configuration == 
    134195We provide a utility for debugging/testing services and resource ordering called ''rg_test''.  ''rg_test'' can: 
    135  * Show you the resource rules it understands:{{{ 
    136 rg_test rules}}} 
    137  * Test your configuration (and /usr/share/cluster) for errors or redundant resource agents:{{{ 
    138 rg_test test /etc/cluster/cluster.conf}}} 
    139  * Show you the start/stop ordering of a given service{{{ 
     196 * Show you the resource rules it understands: 
     197{{{ 
     198rg_test rules 
     199}}} 
     200 * Test your configuration (and /usr/share/cluster) for errors or redundant resource agents: 
     201{{{ 
     202rg_test test /etc/cluster/cluster.conf 
     203}}} 
     204 * Show you the start/stop ordering of a given service 
     205{{{ 
    140206rg_test noop /etc/cluster/cluster.conf start service <servicename> 
    141 rg_test noop /etc/cluster/cluster.conf stop service <servicename>}}} 
    142  * Explicitly start/stop a service (NOTE: Only do this on one node, and always disable the service in rgmanager first!):{{{ 
     207rg_test noop /etc/cluster/cluster.conf stop service <servicename> 
     208}}} 
     209 * Explicitly start/stop a service (NOTE: Only do this on one node, and always disable the service in rgmanager first!).  This is useful for debugging configurations or looking for errors before putting a service into production: 
     210{{{ 
    143211rg_test test /etc/cluster/cluster.conf start service <servicename> 
    144 rg_test test /etc/cluster/cluster.conf stop service <servicename>}}} 
    145  * Calculate and display the resource tree delta between two cluster.confs :o{{{ 
    146 rg_test delta /etc/cluster/cluster.conf.bak /etc/cluster/cluster.conf}}} 
     212rg_test test /etc/cluster/cluster.conf stop service <servicename> 
     213}}} 
     214 * Explicitly start/stop a resource (NOTE: Only do this on one node, and always disable the parent service in rgmanager first  Also, this does NOT start the rest of the service(s) which reference this resource; only the resource itself): 
     215{{{ 
     216rg_test test /etc/cluster/cluster.conf start <resource_type> <primary_attribute> 
     217rg_test test /etc/cluster/cluster.conf stop <resource_type> <primary_attribute> 
     218}}} 
     219 * Calculate and display the resource tree delta between two cluster.confs: 
     220{{{ 
     221rg_test delta /etc/cluster/cluster.conf.bak /etc/cluster/cluster.conf 
     222}}}