Last modified 6 years ago Last modified on 05/26/11 04:15:06
                 GFS1 Fast Statfs() Implementation


Installation and Run Script
Changes checked into CVS, archived at:

The usage is a little bit awkward - the "gfs_tool" command has to be 
run on each and every node after mount (to turn this option on). Upon 
any unclean umount (e.g. node crashes) and re-mount, the procedure has
to be re-run again (i.e. run gfs_tool on each and every node).

shell> gfs_tool settune <mount point> statfs_fast 1

The old behavior can be dynamically brought back any time by:

shell> gfs_tool settune <mount point> statfs_fast 0

A quick test on a quiet cluster results:

dhcp145 (1 cpu HP): old df took 0.875 seconds, new df 0.008 second
dhcp146 (4 cpus DELL): old df took 0.808 seconds, new df 0.006 second.

The Problem and GFS2's Approach
GFS disk blocks are managed by its Resource Groups (RG) in a distributed 
manner. The filesystem is divided into 256MB-per-RG sections. Each RG 
manages its own disk blocks and stores the block usage statistics in its 
own control structures. The "statfs" system call (a frequently invoked
function that has been used by some popular commands such as "df") goes
thru all RGs to add the usage counts together. This implies in a 1-TB 
filesystem, each statfs() call would need to scan totally 4096 
(1024*1024/256) RGs to obtain the required data. The most troublesome 
aspect of this implementation is that it has to obtain 4096 shared (read) 
RG locks across the cluster before this call can be completed.

GFS2 alleviates the issue by writing the local (per node) statfs changes 
into a per node file upon disk block changes. Every "gt_statfs_quantum" 
seconds (a tunable, default to 30), the "quotad" daemon adds the local 
changes into a cluster-wide master file (one per filesystem) and 
subsequently zeros out its local copy. The original author commented:

"The end effect is that a df can be completed without any network access 
(just a local spinlock) without affecting [de]allocation performance.  
What you give up is the ability to see statfs changes that have happened 
very recently on other nodes in the cluster.  I believe it will be good 
enough for most uses."

GFS1 Port
There are few compromises made while porting GFS2 approach over (to 
GFS1), mostly to avoid on-disk structure changes. Note that GFS2 
allocates (number-of-nodes + 1) physical files into disk during mkfs 
time but GFS1 only has one extra space (the unused license file) for 
this purpose. We deviate from from GFS2 implementation by writing the 
local per-node changes into a memory buffer. This, in turns, creates 
a recovery issue - upon unclean shutdown (say, one node crashes before 
it can syncs the changes into the master file), the local in-memory 
changes will be lost. There are few possible approaches currently on 
the table to handle this. One of them is adding an on-disk version 
number (as part of the master file contents). With unclean umount, 
right after journal recovery, the on-disk version number is bumped up 
by one and the master copy is updated with the statfs data obtained 
from the old method. Whenever a node is ready for flushing its changes, 
seeing on-disk version number is higher than the local (saved) version 
number, instead of adding its local changes into the master file, it 
should zero out the local copy and bumps up its local version number, 
assuming the local changes have been incorporated into the data obtained 
via old method. The side effect of this approach is that after each 
unclean umount, the statistics will be off (hopefully) in a negligible 
scale from that point on. This "negligible" side effect is debatable but 
one could argue that under GFS's distributed nature (not having a 
centralized meta data server), no matter what we do, the statistics is 
always an approximation, even with current performance-plagued old method 
(where the lock is released asynchronously as soon as its RG data is read).

Neverthelessly, the current code works as the following:

1. Upon each mount, the local copy is zeroed but fast statfs logic is 
not triggered.
2. Fast statfs is started by issuing "gfs_tool settune" command on each 
and every node after mount.

shell> gfs_tool settune <mount point> statfs_fast 1

Changes made by nodes that fail to have fast statfs started would not be 
collected (seen) by fast statfs system call on any node in the cluster. 
The start call:

2.1 Invokes old method to obtain the "almost-correct" statfs info.
2.2 Obtain master file exclusive glock.
2.3 Write the (from 2.1) statfs data into master file.
2.4 If everything goes well, set gt_statfs_fast flag to 1.
2.5 The local change starts to get picked up (based on gs_statfs_fast 
2.6 Local change (delta) is synced to disk whenever quota daemon is 
waked up and the (a tunable, default to 5 seconds). It is then 
subsequently zeroed out.
2.7 Repeat from step 2.5 as long as gt_statfs_fast is non-zero.

3. Whenever statfs() system call is invoked and if gt_statfs_fast is on, 
the call returned with the last round read-in master file contents, 
adjusted with its local (delta) changes. If gt_statfs_fast is zero, old 
method is invoked.
4. Upon node recovery (with unclean shutdown), "gfs_tool settune" can be 
invoked on each and every node to resume statfs activities. If this is 
not done on a relatively quiet (with negligible write activities) 
cluster, statfs data could be off to an unspecified degree. Note that 
each call into "gfs_tool settune" restarts the statistics collection by 
repeating the steps described in step 2.
5. Fast statfs can be turned off dynamically (anytime) by using gfs_tool 
command to get gt_statfs_fast to back zero on each node.

shell> gfs_tool settune <mount point> statfs_fast 0

6. On and off can be mixed and repeated after mounts. However, user is 
expected to understand how Step 2 works in order to fully interpret the 
fast statfs statistics.

To Do Items
Research ways to implement a cman-base (or any other) command to start 
and stop the cluster wide fast statfs on one node. This should greatly 
reduce the awkward usage of this implementation. 

-------------- end of write-up