Last modified 6 months ago Last modified on 10/29/13 20:52:29
SANLOCK(8)                  System Manager's Manual                 SANLOCK(8)

       sanlock - shared storage lock manager

       sanlock [COMMAND] [ACTION] ...

       The sanlock daemon manages leases for applications running on a cluster
       of hosts with shared storage.  All lease management and coordination is
       done  through  reading  and  writing blocks on the shared storage.  Two
       types of leases are used, each based on a different algorithm:

       "delta leases" are slow to acquire and require regular  i/o  to  shared
       storage.   A delta lease exists in a single sector of storage.  Acquir‐
       ing a delta lease involves reads and writes to that sector separated by
       specific  delays.  Once acquired, a lease must be renewed by updating a
       timestamp in the sector regularly.  sanlock uses a delta  lease  inter‐
       nally  to  hold a lease on a host_id.  host_id leases prevent two hosts
       from using the same host_id and provide basic host liveness information
       based on the renewals.

       "paxos  leases"  are  generally  fast to acquire and sanlock makes them
       available to applications as general purpose resource leases.  A  paxos
       lease  exists in 1MB of shared storage (8MB for 4k sectors).  Acquiring
       a paxos lease involves reads and writes to max_hosts (2000) sectors  in
       a  specific  sequence  specified  by  the  Disk Paxos algorithm.  paxos
       leases use host_id's internally to indicate the owner of the lease, and
       the algorithm fails if different hosts use the same host_id.  So, delta
       leases provide the unique host_id's used in paxos leases.  paxos leases
       also refer to delta leases to check if a host_id is alive.

       Before  sanlock  can be used, the user must assign each host a host_id,
       which is a number between 1 and 2000.  Two hosts should  not  be  given
       the  same host_id (even though delta leases attempt to detect this mis‐

       sanlock views a pool of storage as a "lockspace".  Each  distinct  pool
       of  storage, e.g. from different sources, would typically be defined as
       a separate lockspace, with a unique lockspace name.

       Part of this storage space must be reserved and initialized for sanlock
       to  store delta leases.  Each host that wants to use the lockspace must
       first acquire a delta lease on its host_id number within the lockspace.
       (See  the add_lockspace action/api.)  The space required for 2000 delta
       leases in the lockspace (for 2000 possible host_id's) is 1MB  (8MB  for
       4k  sectors).   (This  is  the  same  size  required for a single paxos

       More storage space must be reserved and initialized for  paxos  leases,
       according to the needs of the applications using sanlock.

       The  following  steps illustrate these concepts using the command line.
       Applications may choose to do these same steps through libsanlock.

       1. Create storage pools and reserve and initialize host_id leases
       two different LUNs on a SAN: /dev/sdb, /dev/sdc
       # vgcreate pool1 /dev/sdb
       # vgcreate pool2 /dev/sdc
       # lvcreate -n hostid_leases -L 1MB pool1
       # lvcreate -n hostid_leases -L 1MB pool2
       # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
       # sanlock direct init -s LS2:0:/dev/pool2/hostid_leases:0

       2. Start the sanlock daemon on each host
       # sanlock daemon

       3. Add each lockspace to be used
       # sanlock client add_lockspace -s LS1:1:/dev/pool1/hostid_leases:0
       # sanlock client add_lockspace -s LS2:1:/dev/pool2/hostid_leases:0
       # sanlock client add_lockspace -s LS1:2:/dev/pool1/hostid_leases:0
       # sanlock client add_lockspace -s LS2:2:/dev/pool2/hostid_leases:0

       4. Applications can now reserve/initialize space for  resource  leases,
       and then acquire the leases as they need to access the resources.

       The  resource  leases that are created and how they are used depends on
       the application.  For example, say application A, running on host1  and
       host2,   needs   to   synchronize   access   to   data   it  stores  on
       /dev/pool1/Adata.  A could use a resource lease as follows:

       5. Reserve and initialize a single resource lease for Adata
       # lvcreate -n Adata_lease -L 1MB pool1
       # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0

       6. Acquire the lease from the app using libsanlock (see  sanlock_regis‐
       ter,  sanlock_acquire).   If the app is already running as pid 123, and
       has registered with the sanlock daemon, the lease can be added  for  it
       # sanlock client acquire -r LS1:Adata:/dev/pool1/Adata_lease:0 -p 123


       offsets  must  be  1MB aligned for disks with 512 byte sectors, and 8MB
       aligned for disks with 4096 byte sectors.

       offsets may be used to place leases on  the  same  device  rather  than
       using  separate  devices  and offset 0 as shown in examples above, e.g.
       these commands above:
       # sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
       # sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
       could be replaced by:
       # sanlock direct init -s LS1:0:/dev/pool1/leases:0
       # sanlock direct init -r LS1:Adata:/dev/pool1/leases:1048576


       If a process holding resource leases fails or exits  without  releasing
       its leases, sanlock will release the leases for it automatically.

       If  the  sanlock daemon cannot renew a lockspace host_id for a specific
       period of time (usually because storage access is lost),  sanlock  will
       kill any process holding a resource lease within the lockspace.

       If  the  sanlock  daemon crashes or gets stuck, it will no longer renew
       the expiry time of its per-host_id connections to the wdmd daemon,  and
       the watchdog device will reset the host.


       sanlock  uses  the  wdmd(8) daemon to access /dev/watchdog.  A separate
       wdmd connection is maintained with wdmd for each host_id being renewed.
       Each  host_id  connection  has  an  expiry time for some seconds in the
       future.  After each successful host_id  renewal,  sanlock  updates  the
       associated  expiry time in wdmd.  If wdmd finds any connection expired,
       it will not pet /dev/watchdog.  After enough successive  expired/failed
       checks, the watchdog device will fire and reset the host.

       After a number of failed attempts to renew a host_id, sanlock kills any
       process using that lockspace.  Once all those  processes  have  exited,
       sanlock  will  unregister the associated wdmd connection.  wdmd will no
       longer find the expired connection, and will resume petting /dev/watch‐
       dog  (assuming  it finds no other failed/expired tests.)  If the killed
       processes did not exit quickly enough, the expired wdmd connection will
       not be unregistered, and /dev/watchdog will reset the host.

       Based on these known timeout values, sanlock on another host can calcu‐
       late, based on the last host_id renewal, when the failed host will have
       been reset by its watchdog (or killed all the necessary processes).

       If  the  sanlock  daemon  itself  fails, crashes, get stuck, it will no
       longer update the expiry time for  its  host_id  connections  to  wdmd,
       which will also lead to the watchdog resetting the host.


       sanlock leases are meant to guarantee that two process on two hosts are
       never allowed to hold the same resource lease at once.  If  they  were,
       the  resource being protected may be corrupted.  There are three levels
       of protection built into sanlock itself:

       1. The paxos leases and delta leases themselves.

       2. If the  leases  cannot  function  because  storage  access  is  lost
       (host_id's  cannot be renewed), the sanlock daemon kills any pids using
       resource leases in the lockspace.

       3. If the pids do not exit after being killed, or if the sanlock daemon
       fails, the watchdog device resets the host.

       COMMAND can be one of three primary top level choices

       sanlock daemon start daemon
       sanlock client send request to daemon (default command if none given)
       sanlock direct access storage directly (no coordination with daemon)

       sanlock daemon [options]

       -D no fork and print all logging to stderr

       -Q 0|1 quiet error messages for common lock contention

       -R 0|1 renewal debugging, log debug info for each renewal

       -L pri write logging at priority level and up to logfile (-1 none)

       -S pri write logging at priority level and up to syslog (-1 none)

       -U uid user id

       -G gid group id

       -t num max worker threads

       -g sec seconds for graceful recovery

       -w 0|1 use watchdog through wdmd

       -h 0|1 use high priority (RR) scheduling

       -l num use mlockall (0 none, 1 current, 2 current and future)

       -a 0|1 use async i/o

       sanlock client action [options]

       sanlock client status

       Print processes, lockspaces, and resources being managed by the sanlock
       daemon.  Add -D to show extra internal  daemon  status  for  debugging.
       Add  -o  p  to  show  resources  by  pid,  or -o s to show resources by

       sanlock client host_status

       Print state of host_id delta  leases  read  during  the  last  renewal.
       State  of  all  lockspaces  is shown (use -s to select one).  Add -D to
       show extra internal daemon status for debugging.

       sanlock client gets

       Print lockspaces being managed by the sanlock  daemon.   The  LOCKSPACE
       string  will  be  followed  by ADD or REM if the lockspace is currently
       being added or removed.  Add -h 1 to also show hosts in each lockspace.

       sanlock client log_dump

       Print the sanlock daemon internal debug log.

       sanlock client shutdown

       Ask the sanlock daemon to exit.  Without the force option (-f  0),  the
       command will be ignored if any lockspaces exist.  With the force option
       (-f 1), any registered processes will be killed, their resource  leases
       released, and lockspaces removed.

       sanlock client init -s LOCKSPACE

       Tell  the  sanlock  daemon  to  initialize a lockspace on disk.  The -o
       option can be used to specify the io  timeout  to  be  written  in  the
       host_id leases.  (Also see sanlock direct init.)

       sanlock client init -r RESOURCE

       Tell  the sanlock daemon to initialize a resource lease on disk.  (Also
       see sanlock direct init.)

       sanlock client read -s LOCKSPACE

       Tell the sanlock daemon to  read  a  lockspace  from  disk.   Only  the
       LOCKSPACE  path and offset are required.  If host_id is zero, the first
       record at offset (host_id 1) is used.  The complete  LOCKSPACE  and  io
       timeout are printed.

       sanlock client read -r RESOURCE

       Tell  the  sanlock daemon to read a resource lease from disk.  Only the
       RESOURCE path and  offset  are  required.   The  complete  RESOURCE  is
       printed.  (Also see sanlock direct read_leader.)

       sanlock client align -s LOCKSPACE

       Tell  the  sanlock  daemon to report the required lease alignment for a
       storage path.  Only path is used from the LOCKSPACE argument.

       sanlock client add_lockspace -s LOCKSPACE

       Tell the sanlock  daemon  to  acquire  the  specified  host_id  in  the
       lockspace.   This will allow resources to be acquired in the lockspace.
       The -o option can be used to specify the io timeout  of  the  acquiring
       host, and will be written in the host_id lease.

       sanlock client inq_lockspace -s LOCKSPACE

       Inquire about the state of the lockspace in the sanlock daemon, whether
       it is being added or removed, or is joined.

       sanlock client rem_lockspace -s LOCKSPACE

       Tell the sanlock  daemon  to  release  the  specified  host_id  in  the
       lockspace.   Any  processes  holding  resource leases in this lockspace
       will be killed, and the resource leases not released.

       sanlock client command -r RESOURCE -c path args

       Register with the sanlock daemon, acquire the specified resource lease,
       and  exec  the  command at path with args.  When the command exits, the
       sanlock daemon will release the lease.  -c must be the final option.

       sanlock client acquire -r RESOURCE -p pid
       sanlock client release -r RESOURCE -p pid

       Tell the sanlock daemon to acquire or release  the  specified  resource
       lease  for  the given pid.  The pid must be registered with the sanlock
       daemon.  acquire  can  optionally  take  a  versioned  RESOURCE  string
       RESOURCE:lver,  where  lver  is  the  version of the lease that must be
       acquired, or fail.

       sanlock client inquire -p pid

       Print the resource leases held the given pid.  The  format  is  a  ver‐
       sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
       lease held.

       sanlock client request -r RESOURCE -f force_mode

       Request the owner of a resource do something specified  by  force_mode.
       A  versioned  RESOURCE:lver  string must be used with a greater version
       than is presently held.  Zero lver and force_mode clears the request.

       sanlock client examine -r RESOURCE

       Examine the request record for the currently held  resource  lease  and
       carry out the action specified by the requested force_mode.

       sanlock client examine -s LOCKSPACE

       Examine  requests  for  all resource leases currently held in the named
       lockspace.  Only lockspace_name is used from the LOCKSPACE argument.

       sanlock direct action [options]

       -a 0|1 use async i/o

       -o sec io timeout in seconds

       sanlock direct init -s LOCKSPACE
       sanlock direct init -r RESOURCE

       Initialize storage for  2000  host_id  (delta)  leases  for  the  given
       lockspace,  or initialize storage for one resource (paxos) lease.  Both
       options require 1MB of space.  The host_id in the LOCKSPACE  string  is
       not  relevant to initialization, so the value is ignored.  (The default
       of 2000 host_ids  can  be  changed  for  special  cases  using  the  -n
       num_hosts  and -m max_hosts options.)  With -s, the -o option specifies
       the io timeout to be written in the host_id leases.

       sanlock direct read_leader -s LOCKSPACE
       sanlock direct read_leader -r RESOURCE

       Read a leader record from disk and print the fields.  The leader record
       is  the  single sector of a delta lease, or the first sector of a paxos

       sanlock direct dump path[:offset]

       Read disk sectors and print leader records for delta or  paxos  leases.
       Add  -f  1  to  print  the  request record values for paxos leases, and
       host_ids set in delta lease bitmaps.

   LOCKSPACE option string
       -s lockspace_name:host_id:path:offset

       lockspace_name name of lockspace
       host_id local host identifier in lockspace
       path path to storage reserved for leases
       offset offset on path (bytes)

   RESOURCE option string
       -r lockspace_name:resource_name:path:offset

       lockspace_name name of lockspace
       resource_name name of resource
       path path to storage reserved for leases
       offset offset on path (bytes)

   RESOURCE option string with version
       -r lockspace_name:resource_name:path:offset:lver

       lver leader version or SH for shared lease

       sanlock help shows the default values for the options above.

       sanlock version shows the build version.

       The first part of making a  request  for  a  resource  is  writing  the
       request  record  of  the  resource  (the  sector  following  the leader
       record).  To make a successful request:

       ·  RESOURCE:lver must be greater than the lver presently  held  by  the
          other host.  This implies the leader record must be read to discover
          the lver, prior to making a request.

       ·  RESOURCE:lver must be greater than or equal to  the  lver  presently
          written to the request record.  Two hosts may write a new request at
          the same time for the same lver, in which case both  would  succeed,
          but the force_mode from the last would win.

       ·  The force_mode must be greater than zero.

       ·  To  unconditionally  clear  the  request  record  (set both lver and
          force_mode to 0), make request with RESOURCE:0 and force_mode 0.

       The owner of the requested resource will not know of the request unless
       it  is  explicitly  told  to  examine  its  resources via the "examine"
       api/command, or otherwise notfied.

       The second part of making a request is  notifying  the  resource  lease
       owner  that  it  should  examine  the  request  records of its resource
       leases.  The notification will cause the lease owner  to  automatically
       run  the  equivalent  of  "sanlock client examine -s LOCKSPACE" for the
       lockspace of the requested resource.

       The notification is made using a bitmap in each  host_id  delta  lease.
       Each  bit represents each of the possible host_ids (1-2000).  If host A
       wants to notify host B to examine its resources, A sets the bit in  its
       own  bitmap  that  corresponds to the host_id of B.  When B next renews
       its delta lease, it reads the delta leases for  all  hosts  and  checks
       each  bitmap  to see if its own host_id has been set.  It finds the bit
       for its own host_id set  in  A's  bitmap,  and  examines  its  resource
       request  records.   (The bit remains set in A's bitmap for request_fin‐

       force_mode determines the action the resource lease owner should take:

       1 (FORCE): kill the process  holding  the  resource  lease.   When  the
       process  has  exited, the resource lease will be released, and can then
       be acquired by anyone.  The kill  signal  is  SIGKILL  (or  SIGTERM  if
       SIGKILL is restricted.)

       2  (GRACEFUL):  run  the program configured by sanlock_killpath against
       the process holding the resource lease.  If  no  killpath  is  defined,
       then FORCE is used.

   Graceful recovery
       When  a  lockspace  host_id  cannot be renewed for a specific period of
       time, sanlock enters a recovery mode in which it attempts  to  forcibly
       release  any  resource leases in that lockspace.  If all the leases are
       not released within 60 seconds, the watchdog will fire,  resetting  the

       The  most  immediate way of releasing the resource leases in the failed
       lockspace is by sending SIGKILL to all pids  holding  the  leases,  and
       automatically  releasing  the  resource leases as the pids exit.  After
       all pids have exited, no resource leases are held in the lockspace, the
       watchdog  expiration  is  removed,  and the host can avoid the watchdog

       A slightly more graceful approach is to send SIGTERM to  a  pid  before
       escalating  to  SIGKILL.   sanlock does this by sending SIGTERM to each
       pid, once a second, for the first N  seconds,  before  sending  SIGKILL
       once a second for the remaining M seconds (N/M can be tuned with the -g
       daemon option.)

       An even more graceful approach is to configure a program for sanlock to
       run that will terminate or suspend each pid, and explicitly release the
       leases it held.  sanlock will run this program for each pid.  It has  N
       seconds  to  terminate  the pid or explicitly release its leases before
       sanlock escalates to SIGKILL for the remaining M seconds.


                                  2011-08-05                        SANLOCK(8)


Using status/debug commands;a=summary