SANLOCK(8) System Manager's Manual SANLOCK(8)
sanlock - shared storage lock manager
sanlock [COMMAND] [ACTION] ...
sanlock is a lock manager built on shared storage. Hosts with access
to the storage can perform locking. An application running on the
hosts is given a small amount of space on the shared block device or
file, and uses sanlock for its own application-specific synchroniza‐
tion. Internally, the sanlock daemon manages locks using two disk-
based lease algorithms: delta leases and paxos leases.
· delta leases are slow to acquire and demand regular i/o to shared
storage. sanlock only uses them internally to hold a lease on its
"host_id" (an integer host identifier from 1-2000). They prevent two
hosts from using the same host identifier. The delta lease renewals
also indicate if a host is alive. ("Light-Weight Leases for Storage-
Centric Coordination", Chockler and Malkhi.)
· paxos leases are fast to acquire and sanlock makes them available to
applications as general purpose resource leases. The disk paxos
algorithm uses host_id's internally to represent different hosts, and
the owner of a paxos lease. delta leases provide unique host_id's
for implementing paxos leases, and delta lease renewals serve as a
proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie
Externally, the sanlock daemon exposes a locking interface through lib‐
sanlock in terms of "lockspaces" and "resources". A lockspace is a
locking context that an application creates for itself on shared stor‐
age. When the application on each host is started, it "joins" the
lockspace. It can then create "resources" on the shared storage. Each
resource represents an application-specific entity. The application
can acquire and release leases on resources.
To use sanlock from an application:
· Allocate shared storage for an application, e.g. a shared LUN or LV
from a SAN, or files from NFS.
· Provide the storage to the application.
· The application uses this storage with libsanlock to create a
lockspace and resources for itself.
· The application joins the lockspace when it starts.
· The application acquires and releases leases on resources.
How lockspaces and resources translate to delta leases and paxos leases
· A lockspace is based on delta leases held by each host using the
· A lockspace is a series of 2000 delta leases on disk, and requires
1MB of storage.
· A lockspace can support up to 2000 concurrent hosts using it, each
using a different delta lease.
· Applications can i) create, ii) join and iii) leave a lockspace,
which corresponds to i) initializing the set of delta leases on disk,
ii) acquiring one of the delta leases and iii) releasing the delta
· When a lockspace is created, a unique lockspace name and disk loca‐
tion is provided by the application.
· When a lockspace is created/initialized, sanlock formats the sequence
of 2000 on-disk delta lease structures on the file or disk, e.g.
/mnt/leasefile (NFS) or /dev/vg/lv (SAN).
· The 2000 individual delta leases in a lockspace are identified by
· Each delta lease is a 512 byte sector in the 1MB lockspace, offset by
its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset
512, delta lease 2000 is offset 1023488.
· When an application joins a lockspace, it must specify the lockspace
name, the lockspace location on shared disk/file, and the local
host's host_id. sanlock then acquires the delta lease corresponding
to the host_id, e.g. joining the lockspace with host_id 1 acquires
delta lease 1.
· The terms delta lease, lockspace lease, and host_id lease are used
· sanlock acquires a delta lease by writing the host's unique name to
the delta lease disk sector, reading it back after a delay, and veri‐
fying it is the same.
· If a unique host name is not specified, sanlock generates a uuid to
use as the host's name. The delta lease algorithm depends on hosts
using unique names.
· The application on each host should be configured with a unique
host_id, where the host_id is an integer 1-2000.
· If hosts are misconfigured and have the same host_id, the delta lease
algorithm is designed to detect this conflict, and only one host will
be able to acquire the delta lease for that host_id.
· A delta lease ensures that a lockspace host_id is being used by a
single host with the unique name specified in the delta lease.
· Resolving delta lease conflicts is slow, because the algorithm is
based on waiting and watching for some time for other hosts to write
to the same delta lease sector. If multiple hosts try to use the
same delta lease, the delay is increased substantially. So, it is
best to configure applications to use unique host_id's that will not
· After sanlock acquires a delta lease, the lease must be renewed until
the application leaves the lockspace (which corresponds to releasing
the delta lease on the host_id.)
· sanlock renews delta leases every 20 seconds (by default) by writing
a new timestamp into the delta lease sector.
· When a host acquires a delta lease in a lockspace, it can be referred
to as "joining" the lockspace. Once it has joined the lockspace, it
can use resources associated with the lockspace.
· A lockspace is a context for resources that can be locked and
unlocked by an application.
· sanlock uses paxos leases to implement leases on resources. The
terms paxos lease and resource lease are used interchangably.
· A paxos lease exists on shared storage and requires 1MB of space. It
contains a unique resource name and the name of the lockspace.
· An application assigns its own meaning to a sanlock resource and the
leases on it. A sanlock resource could represent some shared object
like a file, or some unique role among the hosts.
· Resource leases are associated with a specific lockspace and can only
be used by hosts that have joined that lockspace (they are holding a
delta lease on a host_id in that lockspace.)
· An application must keep track of the disk locations of its
lockspaces and resources. sanlock does not maintain any persistent
index or directory of lockspaces or resources that have been created
by applications, so applications need to remember where they have
placed their own leases (which files or disks and offsets).
· sanlock does not renew paxos leases directly (although it could).
Instead, the renewal of a host's delta lease represents the renewal
of all that host's paxos leases in the associated lockspace. In
effect, many paxos lease renewals are factored out into one delta
lease renewal. This reduces i/o when many paxos leases are used.
· The disk paxos algorithm allows multiple hosts to all attempt to
acquire the same paxos lease at once, and will produce a single win‐
ner/owner of the resource lease. (Shared resource leases are also
possible in addition to the default exclusive leases.)
· The disk paxos algorithm involves a specific sequence of reading and
writing the sectors of the paxos lease disk area. Each host has a
dedicated 512 byte sector in the paxos lease disk area where it
writes its own "ballot", and each host reads the entire disk area to
see the ballots of other hosts. The first sector of the disk area is
the "leader record" that holds the result of the last paxos ballot.
The winner of the paxos ballot writes the result of the ballot to the
leader record (the winner of the ballot may have selected another
contending host as the owner of the paxos lease.)
· After a paxos lease is acquired, no further i/o is done in the paxos
lease disk area.
· Releasing the paxos lease involves writing a single sector to clear
the current owner in the leader record.
· If a host holding a paxos lease fails, the disk area of the paxos
lease still indicates that the paxos lease is owned by the failed
host. If another host attempts to acquire the paxos lease, and finds
the lease is held by another host_id, it will check the delta lease
of that host_id. If the delta lease of the host_id is being renewed,
then the paxos lease is owned and cannot be acquired. If the delta
lease of the owner's host_id has expired, then the paxos lease is
expired and can be taken (by going through the paxos lease algo‐
· The "interaction" or "awareness" between hosts of each other is lim‐
ited to the case where they attempt to acquire the same paxos lease,
and need to check if the referenced delta lease has expired or not.
· When hosts do not attempt to lock the same resources concurrently,
there is no host interaction or awareness. The state or actions of
one host have no effect on others.
· To speed up checking delta lease expiration (in the case of a paxos
lease conflict), sanlock keeps track of past renewals of other delta
leases in the lockspace.
· If a host fails to renew its delta lease, e.g. it looses access to
the storage, its delta lease will eventually expire and another host
will be able to take over any resource leases held by the host. san‐
lock must ensure that the application on two different hosts is not
holding and using the same lease concurrently.
· When sanlock has failed to renew a delta lease for a period of time,
it will begin taking measures to stop local processes (applications)
from using any resource leases associated with the expiring lockspace
delta lease. sanlock enters this "recovery mode" well ahead of the
time when another host could take over the locally owned leases.
sanlock must have sufficient time to stop all local processes that
are using the expiring leases.
· sanlock uses three methods to stop local processes that are using
1. Graceful shutdown. sanlock will execute a "graceful shutdown"
program that the application previously specified for this case. The
shutdown program tells the application to shut down because its
leases are expiring. The application must respond by stopping its
activities and releasing its leases (or exit). If an application
does not specify a graceful shutdown program, sanlock sends SIGTERM
to the process instead. The process must release its leases or exit
in a prescribed amount of time (see -g), or sanlock proceeds to the
next method of stopping.
2. Forced shutdown. sanlock will send SIGKILL to processes using the
expiring leases. The processes have a fixed amount of time to exit
after receiving SIGKILL. If any do not exit in this time, sanlock
will proceed to the next method.
3. Host reset. sanlock will trigger the host's watchdog device to
forcibly reset it. sanlock carefully manages the timing of the
watchdog device so that it fires shortly before any other host could
take over the resource leases held by local processes.
If a process holding resource leases fails or exits without releasing
its leases, sanlock will release the leases for it automatically
(unless persistent resource leases were used.)
If the sanlock daemon cannot renew a lockspace delta lease for a spe‐
cific period of time (see Expiration), sanlock will enter "recovery
mode" where it attempts to stop and/or kill any processes holding
resource leases in the expiring lockspace. If the processes do not
exit in time, sanlock will force the host to be reset using the local
If the sanlock daemon crashes or hangs, it will not renew the expiry
time of the per-lockspace connections it had to the wdmd daemon. This
will lead to the expiration of the local watchdog device, and the host
will be reset.
sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐
plexes multiple timeouts onto the single watchdog timer. This is
required because delta leases for each lockspace are renewed and expire
sanlock maintains a wdmd connection for each lockspace delta lease
being renewed. Each connection has an expiry time for some seconds in
the future. After each successful delta lease renewal, the expiry time
is renewed for the associated wdmd connection. If wdmd finds any con‐
nection expired, it will not renew the /dev/watchdog timer. Given
enough successive failed renewals, the watchdog device will fire and
reset the host. (Given the multiplexing nature of wdmd, shorter over‐
lapping renewal failures from multiple lockspaces could cause spurious
The direct link between delta lease renewals and watchdog renewals pro‐
vides a predictable watchdog firing time based on delta lease renewal
timestamps that are visible from other hosts. sanlock knows the time
the watchdog on another host has fired based on the delta lease time.
Furthermore, if the watchdog device on another host fails to fire when
it should, the continuation of delta lease renewals from the other host
will make this evident and prevent leases from being taken from the
If sanlock is able to stop/kill all processing using an expiring
lockspace, the associated wdmd connection for that lockspace is
removed. The expired wdmd connection will no longer block /dev/watch‐
dog renewals, and the host should avoid being reset.
On devices with 512 byte sectors, lockspaces and resources are 1MB in
size. On devices with 4096 byte sectors, lockspaces and resources are
8MB in size. sanlock uses 512 byte sectors when shared files are used
in place of shared block devices. Offsets of leases or resources must
be multiples of 1MB/8MB according to the sector size.
Using sanlock on shared block devices that do host based mirroring or
replication is not likely to work correctly. When using sanlock on
shared files, all sanlock io should go to one file server.
This is an example of creating and using lockspaces and resources from
the command line. (Most applications would use sanlock through libsan‐
lock rather than through the command line.)
1. Allocate shared storage for sanlock leases.
This example assumes 512 byte sectors on the device, in which case
the lockspace needs 1MB and each resource needs 1MB.
# vgcreate vg /dev/sdb
# lvcreate -n leases -L 1GB vg
2. Start sanlock on all hosts.
The -w 0 disables use of the watchdog for testing.
# sanlock daemon -w 0
3. Start a dummy application on all hosts.
This sanlock command registers with sanlock, then execs the sleep
command which inherits the registered fd. The sleep process acts
as the dummy application. Because the sleep process is registered
with sanlock, leases can be acquired for it.
# sanlock client command -c /bin/sleep 600 &
4. Create a lockspace for the application (from one host).
The lockspace is named "test".
# sanlock client init -s test:0:/dev/vg/leases:0
5. Join the lockspace for the application.
Use a unique host_id on each host.
# sanlock client add_lockspace -s test:1:/dev/vg/leases:0
# sanlock client add_lockspace -s test:2:/dev/vg/leases:0
6. Create two resources for the application (from one host).
The resources are named "RA" and "RB". Offsets are used on the
same device as the lockspace. Different LVs or files could also be
# sanlock client init -r test:RA:/dev/vg/leases:1048576
# sanlock client init -r test:RB:/dev/vg/leases:2097152
7. Acquire resource leases for the application on host1.
Acquire an exclusive lease (the default) on the first resource, and
a shared lease (SH) on the second resource.
# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
8. Acquire resource leases for the application on host2.
Acquiring the exclusive lease on the first resource will fail
because it is held by host1. Acquiring the shared lease on the
second resource will succeed.
# export P=`pidof sleep`
# sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P
9. Release resource leases for the application on both hosts.
The sleep pid could also be killed, which will result in the san‐
lock daemon releasing its leases when it exits.
# sanlock client release -r test:RA:/dev/vg/leases:1048576 -p $P
# sanlock client release -r test:RB:/dev/vg/leases:2097152 -p $P
10. Leave the lockspace for the application.
# sanlock client rem_lockspace -s test:1:/dev/vg/leases:0
# sanlock client rem_lockspace -s test:2:/dev/vg/leases:0
11. Stop sanlock on all hosts.
# sanlock shutdown
COMMAND can be one of three primary top level choices
sanlock daemon start daemon
sanlock client send request to daemon (default command if none given)
sanlock direct access storage directly (no coordination with daemon)
sanlock daemon [options]
-D no fork and print all logging to stderr
-Q 0|1 quiet error messages for common lock contention
-R 0|1 renewal debugging, log debug info for each renewal
-L pri write logging at priority level and up to logfile (-1 none)
-S pri write logging at priority level and up to syslog (-1 none)
-U uid user id
-G gid group id
-t num max worker threads
-g sec seconds for graceful recovery
-w 0|1 use watchdog through wdmd
-h 0|1 use high priority (RR) scheduling
-l num use mlockall (0 none, 1 current, 2 current and future)
-b sec seconds a host id bit will remain set in delta lease bitmap
-e str local host name used in delta leases
sanlock client action [options]
sanlock client status
Print processes, lockspaces, and resources being managed by the sanlock
daemon. Add -D to show extra internal daemon status for debugging.
Add -o p to show resources by pid, or -o s to show resources by
sanlock client host_status
Print state of host_id delta leases read during the last renewal.
State of all lockspaces is shown (use -s to select one). Add -D to
show extra internal daemon status for debugging.
sanlock client gets
Print lockspaces being managed by the sanlock daemon. The LOCKSPACE
string will be followed by ADD or REM if the lockspace is currently
being added or removed. Add -h 1 to also show hosts in each lockspace.
sanlock client log_dump
Print the sanlock daemon internal debug log.
sanlock client shutdown
Ask the sanlock daemon to exit. Without the force option (-f 0), the
command will be ignored if any lockspaces exist. With the force option
(-f 1), any registered processes will be killed, their resource leases
released, and lockspaces removed. With the wait option (-w 1), the
command will wait for a result from the daemon indicating that it has
shut down and is exiting, or cannot shut down because lockspaces exist
sanlock client init -s LOCKSPACE
Tell the sanlock daemon to initialize a lockspace on disk. The -o
option can be used to specify the io timeout to be written in the
host_id leases. (Also see sanlock direct init.)
sanlock client init -r RESOURCE
Tell the sanlock daemon to initialize a resource lease on disk. (Also
see sanlock direct init.)
sanlock client read -s LOCKSPACE
Tell the sanlock daemon to read a lockspace from disk. Only the
LOCKSPACE path and offset are required. If host_id is zero, the first
record at offset (host_id 1) is used. The complete LOCKSPACE and io
timeout are printed.
sanlock client read -r RESOURCE
Tell the sanlock daemon to read a resource lease from disk. Only the
RESOURCE path and offset are required. The complete RESOURCE is
printed. (Also see sanlock direct read_leader.)
sanlock client align -s LOCKSPACE
Tell the sanlock daemon to report the required lease alignment for a
storage path. Only path is used from the LOCKSPACE argument.
sanlock client add_lockspace -s LOCKSPACE
Tell the sanlock daemon to acquire the specified host_id in the
lockspace. This will allow resources to be acquired in the lockspace.
The -o option can be used to specify the io timeout of the acquiring
host, and will be written in the host_id lease.
sanlock client inq_lockspace -s LOCKSPACE
Inquire about the state of the lockspace in the sanlock daemon, whether
it is being added or removed, or is joined.
sanlock client rem_lockspace -s LOCKSPACE
Tell the sanlock daemon to release the specified host_id in the
lockspace. Any processes holding resource leases in this lockspace
will be killed, and the resource leases not released.
sanlock client command -r RESOURCE -c path args
Register with the sanlock daemon, acquire the specified resource lease,
and exec the command at path with args. When the command exits, the
sanlock daemon will release the lease. -c must be the final option.
sanlock client acquire -r RESOURCE -p pid
sanlock client release -r RESOURCE -p pid
Tell the sanlock daemon to acquire or release the specified resource
lease for the given pid. The pid must be registered with the sanlock
daemon. acquire can optionally take a versioned RESOURCE string
RESOURCE:lver, where lver is the version of the lease that must be
acquired, or fail.
sanlock client convert -r RESOURCE -p pid
Tell the sanlock daemon to convert the mode of the specified resource
lease for the given pid. If the existing mode is exclusive (default),
the mode of the lease can be converted to shared with RESOURCE:SH. If
the existing mode is shared, the mode of the lease can be converted to
exclusive with RESOURCE (no :SH suffix).
sanlock client inquire -p pid
Print the resource leases held the given pid. The format is a ver‐
sioned RESOURCE string "RESOURCE:lver" where lver is the version of the
sanlock client request -r RESOURCE -f force_mode
Request the owner of a resource do something specified by force_mode.
A versioned RESOURCE:lver string must be used with a greater version
than is presently held. Zero lver and force_mode clears the request.
sanlock client examine -r RESOURCE
Examine the request record for the currently held resource lease and
carry out the action specified by the requested force_mode.
sanlock client examine -s LOCKSPACE
Examine requests for all resource leases currently held in the named
lockspace. Only lockspace_name is used from the LOCKSPACE argument.
sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num
Set an event for another host. When the sanlock daemon next renews its
delta lease for the lockspace it will: set the bit for the host_id in
its bitmap, and set the generation, event and data values in its own
delta lease. An application that has registered for events from this
lockspace on the destination host will get the event that has been set
when the destination sees the event during its next delta lease
sanlock client set_config -s LOCKSPACE
Set a configuration value for a lockspace. Only lockspace_name is used
from the LOCKSPACE argument. The USED flag has the same effect on a
lockspace as a process holding a resource lease that will not exit.
The USED_BY_ORPHANS flag means that an orphan resource lease will have
the same effect as the USED.
-u 0|1 Set (1) or clear (0) the USED flag.
-O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag.
sanlock direct action [options]
-o sec io timeout in seconds
sanlock direct init -s LOCKSPACE
sanlock direct init -r RESOURCE
Initialize storage for 2000 host_id (delta) leases for the given
lockspace, or initialize storage for one resource (paxos) lease. Both
options require 1MB of space. The host_id in the LOCKSPACE string is
not relevant to initialization, so the value is ignored. (The default
of 2000 host_ids can be changed for special cases using the -n
num_hosts and -m max_hosts options.) With -s, the -o option specifies
the io timeout to be written in the host_id leases.
sanlock direct read_leader -s LOCKSPACE
sanlock direct read_leader -r RESOURCE
Read a leader record from disk and print the fields. The leader record
is the single sector of a delta lease, or the first sector of a paxos
sanlock direct dump path[:offset]
Read disk sectors and print leader records for delta or paxos leases.
Add -f 1 to print the request record values for paxos leases, and
host_ids set in delta lease bitmaps.
LOCKSPACE option string
lockspace_name name of lockspace
host_id local host identifier in lockspace
path path to storage reserved for leases
offset offset on path (bytes)
RESOURCE option string
lockspace_name name of lockspace
resource_name name of resource
path path to storage reserved for leases
offset offset on path (bytes)
RESOURCE option string with suffix
lver leader version
SH indicates shared mode
sanlock help shows the default values for the options above.
sanlock version shows the build version.
The first part of making a request for a resource is writing the
request record of the resource (the sector following the leader
record). To make a successful request:
· RESOURCE:lver must be greater than the lver presently held by the
other host. This implies the leader record must be read to discover
the lver, prior to making a request.
· RESOURCE:lver must be greater than or equal to the lver presently
written to the request record. Two hosts may write a new request at
the same time for the same lver, in which case both would succeed,
but the force_mode from the last would win.
· The force_mode must be greater than zero.
· To unconditionally clear the request record (set both lver and
force_mode to 0), make request with RESOURCE:0 and force_mode 0.
The owner of the requested resource will not know of the request unless
it is explicitly told to examine its resources via the "examine"
api/command, or otherwise notfied.
The second part of making a request is notifying the resource lease
owner that it should examine the request records of its resource
leases. The notification will cause the lease owner to automatically
run the equivalent of "sanlock client examine -s LOCKSPACE" for the
lockspace of the requested resource.
The notification is made using a bitmap in each host_id delta lease.
Each bit represents each of the possible host_ids (1-2000). If host A
wants to notify host B to examine its resources, A sets the bit in its
own bitmap that corresponds to the host_id of B. When B next renews
its delta lease, it reads the delta leases for all hosts and checks
each bitmap to see if its own host_id has been set. It finds the bit
for its own host_id set in A's bitmap, and examines its resource
request records. (The bit remains set in A's bitmap for set_bit‐
force_mode determines the action the resource lease owner should take:
· FORCE (1): kill the process holding the resource lease. When the
process has exited, the resource lease will be released, and can then
be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if
SIGKILL is restricted.)
· GRACEFUL (2): run the program configured by sanlock_killpath against
the process holding the resource lease. If no killpath is defined,
then FORCE is used.
Persistent and orphan resource leases
A resource lease can be acquired with the PERSISTENT flag (-P 1). If
the process holding the lease exits, the lease will not be released,
but kept on an orphan list. Another local process can acquire an
orphan lease using the ORPHAN flag (-O 1), or release the orphan lease
using the ORPHAN flag (-O 1). All orphan leases can be released by
setting the lockspace name (-s lockspace_name) with no resource name.
Using status/debug commands