wiki:readme.gfs_glock_trimming.R4
Last modified 3 years ago Last modified on 05/26/11 04:13:26
                        Glock Trimming Patch 

                         wcheng@redhat.com
                           Jan. 25, 2007


Installation and Run Script
===========================
Install: Test RPMs are available based on request. For a quick test to 
         see whether this patch solve your issue is:

shell> umount /mnt/your_gfs_partition
shell> rmmod gfs
shell> insmod /this_new_ko/gfs.ko
shell> mount /mnt/your_gfs_partition
 
Tunable setup: There are two tunables to play around with:

1. glock_purge

   After the gfs.ko is loaded and filesystem mounted, issue:
   shell> gfs_tool settune <mount_point> glock_purge <percentage>
   (e.g. "gfs_tool settune /mnt/gfs1 glock_purge 50")

   This will tell GFS to trim roughly 50% of unused glocks every 5 
   seconds. The default is 0 percent (no trimming). The operation 
   can be dynamically turned off by explicitly set the percentage 
   to 0.

2. demote_secs

   This tunable is already in RHEL4 gfs.

   shell> gfs_tool settune <mount_point> demote_secs <seconds>
   (e.g. "gfs_tool settune /mnt/gfs1 demote_secs 200")

   This will demote gfs write locks into less restricted states and 
   subsequently flush the cache data into disk. Shorter demote second(s) 
   can be used to avoid gfs accumulating too much cached data that results 
   with burst mode flushing activities or prolong another nodes' lock 
   access.  It is default to 300 seconds. This command can be issued 
   dynamically but has to be done after mount time. 


The following are some glory details if you care to read.

The Original Base Kernel Patch
==============================
Other than relying on VM flush daemons and/or application specific APIs 
or commands, GFS also flushes its data into storage during glock state 
transitions - that is, whenever an inode glock is moved from an 
exclusive state (write) into a less restricted state (e.g. shared 
state), the memory cached write data is synced into the disk based on a 
set of criteria. As the disk write operation is generally expensive, 
there are few policies implemented to retain the glocks in its current 
state as much as possible.

As reported via bugzilla 214239 (and several others), we've found GFS 
needs to fine-tune it current retain policy to meet the latency 
sensitive application requirement. Two particular issues we've found via 
the profiling data (collected from several customers' run time 
environment) are:

* Glocks stay in "exclusive" state for too long that end up with burst 
mode flushing activities (and other memory/io issues) that could 
subsequently push file access time out of bound for latency sensitive 
applications.
* System could easily spend half of it CPU cycles in lock hash search 
calls due to large amount of glocks accumulation
   (ref: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=214239#c1).

We have been passing few VM tuning tips, together with a shorter 
(tunable) demote_secs, to customers and find they do relieve problem#1 
symptoms greatly. Note that the "demote_secs" is the time interval used 
by the existing glock scan daemon to move locks into less restricted 
states if unused. This implies on an idle system, all locks will be 
moved into "unlocked" state eventually. Unfortunately,
"unlocked" does not imply these glocks will be removed from the system. 
Actually they'll stay there forever until:

1. the inode (file) is explicitly deleted (on-disk deletion), or
2. VM issues prune_icache() call due to memory pressure, or
3. Umount command kicks in, or
4. Lock manager issues an LM_CB_DROPLOCKS callback.

When problem#2 first popped up in RHEL3 time frame, we naturally went 
thru the above 4 routes to look for solution. I forgot under what 
conditions Lock Manager could issue the DROPLOCK callback. However, in 
reality, (3) and (4) share "one" very same exported VFS layer call to do 
its core job - that is "invalidate_inodes()". This vfs call walks thru 4 
(global vfs) inodes lists to find the entry that belongs to this 
particular filesystem. For each entry found, it is removed. The 
operation, interestingly, overlaps with (2) (VM prune_icache call). The 
difference is that prune_icache() scans only one list (inode_unused) and 
selectively purges inode, instead of all of them.

As the on-memory inodes are purged, the GFS logic embeded in the inode 
deallocation code will remove the corresponding glocks accordingly. It 
is then the glock could disappear.

So here came the original base kernel patch. As a latency issue, we 
didn't want to disturb the painstaking efforts of retaining the glocks 
done by GFS's original author(s). We ended up with exporting the 
modified prune_icache() that allowed it to function like 
invalidate_inodes() logic *if asked*. It walked thru inode_unused list 
to find the matching mount point. It purges a fixed percentage of inodes 
from that list if the entry belongs to the subject mount point. In 
short, we created a new call that had the logic needed for glock 
trimmming purpose without massive cut-and-pasting the code segment from 
the existing prune_icache base kernel call.

GFS-only Patch
==============
GFS already has a glock scan daemon waking up on a tunable interval to 
do glock demote work. It scans the glock hash table to examine the entry 
one by one. If the reference count and several criteria meet the 
requirement, it demotes the lock into a less restricted state. For
removable glocks, they are transferred into a reclaim list and another 
daemon (reclaimd) will eventually purge them from the system. One of the 
criteria to identify a removable glock is by its "zero" inode reference 
count. Unfortunately, as long as glock is tied to the vfs inode, the 
reference count never goes down unless the vfs inode is purged (and it 
never does unless the vm thinks it is under memory pressure).

For lock trimming purpose, it took several tries to get the gfs-only 
patch works. The following is the logic that seems to work at this moment:

Each vfs inode is tied to a pair of glocks - iopen glock (LM_TYPE_IOPEN) 
and inode glock (LM_TYPE_INODE). The inode glock normally has frequent 
state transitions, depending how and when the file is accessed (read, 
write, delete, etc) but the iopen glock is mostly on SHARED state during 
its life cycle until either:

1. The GFS inode is removed (gfs_inode_destroy), or
2. Some logic (that doesn't exist before this patch) kicks off 
gfs_iopen_go_callback() to explicitly change its state (presumely by 
Lock Manager).

Since these two glocks have been the major contributors to the glock 
accumulation issues, they are our targeted glocks to get trimmed. 
Without disturbing the existing GFS code, we piggy-back the logic into 
gfs_scand daemon that wakes up every 5-second interval to scan the glock 
hash table. If an iopen glock is found, we follow the pointer to obtain 
the inode glock state. If it is in unlocked state, we demotes the iopen 
glock (from shared into unlocked). This triggers gfs_try_toss_vnode() 
logic to prune the associated dentries and subsequently delete the vfs 
inode. It then follows the very same purging logic as base kernel 
approach.  If inode glock is found first (I haven't implemented this 
yet), we check it lock state. If unlocked, we follow the pointer to find 
its iopen lock, then subsequently demote it. It will then trigger 
gfs_try_toss_vnode() logic that generates the same sequence of clean-up 
events as described above.

Few to-do items
===============
1. Current CVS check-in only looks for iopen glock. We should add 
inode-glock as described above to shorten the search process.
2. Have another version of the patch that trims the lock if it is in 
idle (unlocked) state longer than a tunable timeout value. The CVS 
check-in is based on a tunable percentage count. The trimming action 
stops when either the max count reached or we reach the end of the table.
3. Now glocks are trimmed (and gfs lock dump shows the correct result) - 
I'm not sure how DLM side makes these locks disappears from ts hash 
table (?).

=== End of write-up