GFS and GFS2 Questions
- What is GFS and why do I need it?
- What hardware do I need to run GFS in a cluster?
- Can I use GFS to take two off-the-shelf PCs and cluster their storage?
- What is the maximum size of a GFS file system?
- That's theoretical...So what is the biggest GFS file system you've actually seen in production?
- Does GFS or GFS2 support millisecond timestamps?
- Why doesn't gfs_tool setflag inherit_directio affect new_files_directio?
- I heard that GFS NFS failover prevents data loss. Is that true?
- I just did mkfs on my file system, so why do I get 'permission denied' mounting it?
- Is there an easy way to see which nodes in my cluster have my GFS fs mounted?
- gfs_tool df shows 100% of my inodes are used. Is that a problem?
- How much overhead do my files take in GFS?
- Can I use striping, hardware RAID, mirroring, etc., with GFS?
- Why do I get errors (such as implicit mutex_lock) when I try to compile GFS?
- Can I boot a diskless client off of a SAN with GFS?
- Is GFS certified to work with Oracle Real Application Clusters (RAC)?
- Are there any known bugs when running Oracle RAC on top of GFS?
- Is it possible for more than one node to work on the same file without locking each other?
- If GFS doesn't protect my files from multiple data writers, then why use it? What good is it?
- Why is GFS unable to allow a lock when the "group execute" bit has been set?
- Can I control journal size and placement in GFS?
- Is GFS 100% Posix compliant?
- Can I shrink my GFS or GFS2 file system through lvreduce?
- Why is GFS slow doing things like 'ls -lr *' whereas 'ls -r *' is fast?
- Is it possible to create GFS on MD multi-device?
- How can I mount a GFS partition at bootup time?
- What improvements will GFS2 have over GFS(1)?
- What is the expected performance increase of GFS2 over GFS(1)?
- Why is one of my GFS nodes faster or slower to access my file when they're identical hardware?
- Does GFS and GFS2 work with SELinux security?
- How long will my gfs_fsck or gfs2_fsck take?
- Does GFS support the use of sparse files?
- I want to use GFS for MySQL. Is that okay?
- I want to use MySQL in a cluster WITHOUT GFS. Is that okay?
- I want to use GFS for PostgreSQL. Is that okay?
- I want to use GFS for Samba (smb) file serving. Is that okay?
- Why does GFS lock up/freeze when a node gets fenced?
- GFS gave me an error: fatal: filesystem consistency error. What is it and what can I do about it?
- I've got several GFS file systems but when I try to mount more than one I get mount: File exists. What am I doing wrong?
- How does GFS compare to ocfs2? Which is better?
- How can I performance-tune GFS or make it any faster?
- How can I convert a file system from gfs1 to gfs2?
- Why is access to a GFS file system slower right after it's mounted?
- After a node is fenced GFS hangs for X seconds on my other nodes. Can I reduce that time?
- Will my application run properly on a GFS file system?
- What does it mean when a GFS file system is withdrawn?
- After a withdraw, can you simply remount your GFS file system?
- The files in my GFS file system are corrupt; Why did it happen and what should I do?
- I have concurrency/caching issues when I work with a filesystem that is mounted locally and remotely via ATA over Ethernet
- Help! GFS can only be mounted by one node due to SCSI reservation conflicts!
What is GFS and why do I need it?
GFS is the file system that runs on each of the nodes in the cluster. Like all file systems, it is basically a kernel module that runs on top of the vfs (virtual file system) layer of the kernel. It controls how and where the data is stored on a block device or logical volume. In order to make a cluster of computers ("nodes") cooperatively share the data on a SAN, you need GFS's ability to coordinate with a cluster locking protocol. One such cluster locking protocol is dlm, the distributed lock manager, which is also a kernel module. It's job is to ensure that nodes in the cluster who share the data on the SAN don't corrupt each other's data.
Many other file systems like ext3 are not cluster-aware, and therefore data kept on a volume that is shared between multiple computers, would quickly become corrupt otherwise.
What hardware do I need to run GFS in a cluster?
You need some form of shared storage - Fibre Channel and iSCSI are typical. If you don't have Fibre Channel or iSCSI, look at [gnbd GNBD] instead. Also, you need two or more computers and a network connection between them.
Can I use GFS to take two off-the-shelf PCs and cluster their storage?
No. GFS will only allow PCs with storage, such as a SAN with a Fibre Channel switch, to work together cooperatively on the same storage. Off-the-shelf PCs don't have shared storage. However, you might want to look at the DRBD_Cookbook for more information on how to do this.
What is the maximum size of a GFS file system?
GFS 6.1 (on RHEL 4) supports 16TB when any node in the cluster is running 32 bit RHEL. If all nodes in the cluster are 64-bit RHEL (x86-64, ia64) then the theoretical maximum is 8 EB (exabytes). We have field reports of 45 and 50 TB file systems. Testing these configurations is difficult due to our lack of access to very large array systems.
That's theoretical...So what is the biggest GFS file system you've actually seen in production?
I've seen more than one 45TB GFS file system. If you know of a bigger one, I'd love to hear from you.
(We (GIS Center of Excellence, South Dakota State University) have 153TB GFS file system with 102 partitions spanned across multiple FC RAID5 arrays accessed by 6 application server nodes)
Does GFS or GFS2 support millisecond timestamps?
Currently, gfs and gfs2 do not use milliseconds for files. They use seconds. This is to maintain compatibility with the underlying vfs layer of the kernel. If the kernel changes to milliseconds, we will also change.
People don't normally care about milliseconds, so milliseconds only become important to computers when doing things like NFS file serving. For example, to see if another computer has changed the data on disk since the time of the last known request. For GFS2, we're planning to implement inode generation numbers to keep track of these things more accurately than a timestamp can.
Why doesn't gfs_tool setflag inherit_directio affect new_files_directio?
If I do:
[root@node-01#] gfs_tool setflag inherit_directio my_directory [root@node-01#] gfs_tool gettune my_directory
It displays: TBD Here's what's going on: inherit_directio and new_files_directio are two separate things. If you look at the man page, inherit_directio operates on a single directory whereas new_files_directio is a filesystem-wide "settune" value. If you do:
gfs_tool setflag inherit_directio my_directory
You're telling the fs that ONLY your directory and all new files within that directory should have this attribute, which is why your tests are acting as expected, as long as you're within that directory. It basically sets an attribute on an in-memory inode for the directory. If instead you were to do:
gfs_tool settune mount-point new_files_directio 1
The value new_files_directio value would change for the whole mount point, not just that directory. Of course, you're seeing what
gfs_tool gettune my_directory
is reporting for the global flag.
I heard that GFS NFS failover prevents data loss. Is that true?
No, it's not true. What it prevents is data as a result of the node waking up and erroneously issuing writes to the disk when it shouldn't.
The simple fact is that no one can guarantee against loss of data when a computer goes down. If a client goes down in the middle of a write, its cached data will be lost. If a server goes down in the middle of a write, cached data will be lost unless the filesystem is mounted with "sync" option. Unfortunately, the "sync" option has a performance penalty. GFS's journaling should minimize and/or guard against this loss.
With NFS failover, if a server goes down in the middle of an NFS request (which is far more likely), the failed NFS service should be failed over to another GFS server in the cluster. The NFS client should get a timeout on its write request, and that will cause it to retry the request, which should go to the server that has taken over the responsibilities of the failed NFS server. And GFS will ensure the original server having the problem will not corrupt the data.
I just did mkfs on my file system, so why do I get 'permission denied' mounting it?
You probably miss-typed the cluster name on mkfs. Use the 'dmesg' command to see what gfs is complaining about. If that's the problem, you can use gfs_tool or another mkfs to fix it.
Even if this is not your problem, if you have a problem mounting, always use dmesg to view complaints from the kernel.
Is there an easy way to see which nodes in my cluster have my GFS fs mounted?
It depends on whether you're using GULM or DLM locking. If you're using DLM, use this command from a node that has it mounted:
If you're using GULM, or aren't on a node that has it mounted, here's another way to do it:
for i in `grep "<clusternode name" /etc/cluster/cluster.conf | cut -d '"' -f2` ; do ssh $i "mount | grep gfs" ; done
gfs_tool df shows 100% of my inodes are used. Is that a problem?
Unlike ext3, GFS will dynamically allocate inodes as it needs them. Therefore, it's not a problem.
How much overhead do my files take in GFS?
It depends on file size and file system block size. Assuming the file system block size is a standard 4K, let's do the math: A GFS inode is 232 bytes (0xe8) in length. Therefore, the most data you can fit along with an inode is 4096 - 232 = 3864 bytes. By the way, in this case we say the file "height" is 0.
Slightly bigger and the file needs to use a single-level of indirection, also known as height 1. The inode's 3864 bytes will be used to hold a group of block pointers. These pointers are 64-bits each (or 8 bytes) so you can fit exactly 483 of them on the block after the disk inode. If you have all 483 pointers to 4K blocks, you have at most 1.88MB.
If your file gets over 1.88MB, it will need a second-level of indirection (height 2), the block of which will have a 24-byte (0x18 byte) header and 64-bytes of reserved space. That means you're inode will have at most 483 pointers to 4K-blocks which can each hold 501 block pointers. So 483*501 = 241983 blocks, or 991162368 bytes of data (945MB).
If your file is bigger than 945MB, you'll need a third level of indirection (height 3), which means your file can grow to have 945MB of pointers, which is enough for 121233483 pointers. The file can grow to 496572346368 bytes, or 473568MB, also known as 462GB.
Still bigger, at height 4, we get a max file size of 248782745530368, also known as 231696GB or 226TB.
If your file is bigger than 226TB, (egads!) height 5, max file size is 124640155510714368 bytes, also known as 113359TB.
0 - 3864 bytes, it will only consume one 4K block for inode plus data. 3865 - 1.88MB, it will consume one 4K block of inode plus data size. 1.88MB - 945MB, it will consume one 4K block of inode, plus (file size / (509*4096)) blocks, plus data. 945MB - 462GB, it will consume 4K plus (file size / (121233483*4096)) blocks, plus data. 462GB - 226TB, it will consume 4K plus (file size / (496572346368*4096)) blocks, plus data.
Also, extended attributes like ACLs, if used, take up more blocks.
Can I use striping, hardware RAID, mirroring, etc., with GFS?
Yes you can. Since GFS can manage the contents of a block device (SCSI, logical volume, etc), there is still the underlying logical volume manager, LVM2, that takes care of things like spanning physical volumes, striping, hardware RAID, mirroring and such. For clusters, there is a special version of LVM2 called CLVM that is needed, but not much changes other than the locking protocol specified in /etc/lvm/lvm.conf.
Note that GFS won't work properly in a cluster with software RAID (the MD driver). At the time of this writing, software RAID is not cluster-aware. Since software RAID can only be running on one node in the cluster, the other nodes will not be able to see the data properly, or will likely destroy each other's data. However, if GFS is used as a stand-alone file system on a single-node, software RAID should be okay.
Why do I get errors (such as implicit mutex_lock) when I try to compile GFS?
Sometime after 2.6.15, the upstream kernel changed from using the semaphores (i_sem) within the VFS layer to using mutexes (i_mutex). If your Linux distribution is running an older kernel, you may not be able to compile GFS.
Your choices are:
- upgrade your kernel to a newer one, or
- downgrade your GFS or change the source code so that it uses semaphores like before. Older versions are available from CVS.
Because this is an open-source project, it's constantly evolving, as does the Linux kernel. Compile problems are to be expected (and usually easily overcome) unless you are compiling against the exact same kernel the developers happen to be using at the time.
Can I boot a diskless client off of a SAN with GFS?
Surprisingly, yes. Atix Corporate has a SourceForge project called "Open-Sharedroot" for this purpose.
Visit http://www.open-sharedroot.org/ for more information.
There's a quick how-to at: http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto. Mark Hlawatschek from Atix gave a presentation about this at the 2006 Red Hat Summit. His slides can be seen here: http://www.atix.de/downloads/vortrage-und-workshops/ATIX_Shared-Root-Cluster.pdf.
Is GFS certified to work with Oracle Real Application Clusters (RAC)?
Yes, with the following caveats:
RHEL 4 / GFS 6.1 is only certified to work with GULM locking protocol. GULM nodes need to be external to the RAC/GFS cluster, that is, RAC/GFS nodes are not allowed to be GULM nodes (that run the lock manager)
See the following for more information:
- Red Hat GFS: Installing and Configuring Oracle9i RAC with GFS: http://www.redhat.com/docs/manuals/csgfs/oracle-guide/
- RAC Technologies Compatibility Matrix for Linux Clusters: http://www.oracle.com/technology/products/database/clustering/certify/tech_generic_linux.html
- RAC Technologies Compatibility Matrix for Linux x86 Clusters: http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86.html
- RAC Technologies Compatibility Matrix for Linux x86-64 (AMD64/EM64T) Clusters: http://www.oracle.com/technology/products/database/clustering/certify/tech_linux_x86_64.html
- Oracle Certification Environment Program: http://www.oracle.com/technology/software/oce/oce_fact_sheet.htm
Are there any known bugs when running Oracle RAC on top of GFS?
Not currently, however, playing this song at high volume in your data center has been rumored to introduce entropy in to the GFS+RAC configuration. Please consider Mozart or Copin instead.
Yes, that's a joke, ha ha...
Is it possible for more than one node to work on the same file without locking each other?
Yes and No. Yes it's possible, and one application will not block the other. No, since only one node can cache the content of the inode in question at a particular time, so the performance may be poor. The application should use some kind of locking (for example, byte range locking, i.e. fcntl) to protect the data.
However, GFS does not excuse the application from locking to protect the data. Two processes trying to write data to the same file can still clobber each other's data unless proper locking is in place to prevent it.
Here's a good way to think about it: GFS will make two or more processes on two or more different nodes be treated the same as two or more processes on a single node. So if two processes can share data harmoniously on a single machine, then GFS will ensure they share data harmoniously on two nodes. But if two processes would collide on a single machine, then GFS can't protect you against their lack of locking.
If GFS doesn't protect my files from multiple data writers, then why use it? What good is it?
If you have shared storage that you need to mount read/write, then you still need it. Perhaps it's best to explain why with an example.
Suppose you had a fibre-channel linked SAN storage device attached to two computers, and suppose they were running in a cluster, but using EXT3 instead of GFS to access the data. Immediately after they mount, both systems would be able to see the data on the SAN. Everything would be fine as long as the file system was mounted as read-only. But without GFS, as soon as one node writes data, the other node's file system doesn't know what's happened.
Suppose node A creates a file and assigns inode number 4351 to it, and write 16K of data to it in blocks 3120 and 2240. As far as node B is concerned, there is no inode 4351, and blocks 3120 and 2240 are free. So now it is free to try to create inode 4351, and write data to block 2240, but still believing block 3120 is free. The file system maps of which data areas are used and unused would soon overlap, as would the inode numbers. It wouldn't take long before the while file system was hopelessly corrupt, along with the files inside it.
With GFS, when node A assigns inode 4351, node B automatically knows about the change, and the data is kept harmoniously on disk. When one data area is allocated, all nodes in the cluster are aware of the file allocations, and they don't bump into one another. If node B needs to create another inode, it wouldn't choose 4351, and file system would not be corrupted.
However, even with GFS, if nodes A and B both decide to operate on a file X, even though they both agree on where the data is located, they can still overwrite the data within the file unless the program doing the writing uses some kind of locking scheme to prevent it.
Why is GFS unable to allow a lock when the "group execute" bit has been set?
If you have set-group-ID on, and then turn off group-execute you mark a file for mandatory locking. A file with mandatory locking will have the group exec bit as well as set-group-ID on and it would look like this in 'ls' (result of a chmod 2770):
-rwxrws--- 1 tangzx2 idev 347785 Jan 17 10:22 temp.txt
Can I control journal size and placement in GFS?
Not really. The gfs_mkfs command decides exactly where everything should go and you have no choice in the matter. The volume is carved into logical "sections." The first and last sections are for multiple resource groups, based roughly on the rg size specified on the gfs_mkfs commandline. The journals are always placed between the first and last section. Specifying a different number of journals will force gfs_mkfs to carve the section size smaller, thus changing where your journals will end up.
Is GFS 100% Posix compliant?
Only insofar as Linux is. Linux isn't 100% posix compliant, but GFS is as much compliant as any other file system can be under Linux.
Can I shrink my GFS or GFS2 file system through lvreduce?
No. GFS and GFS2 do not currently have the ability to shrink. Therefore, you can not reduce the size of your volume.
Why is GFS slow doing things like 'ls -lr *' whereas 'ls -r *' is fast?
Mostly due to design constraints. An ls -r * can simply traverse the directory structures, which is very fast. An ls -lr * has to traverse the directory, but also has to stat each file to get more details for the ls. That means it has to acquire and release a cluster lock on each file, which can be slow. We've tried to address these problems with the new GFS2 file system.
Is it possible to create GFS on MD multi-device?
It is possible to create GFS on an MD device as long as you are only using it for multipath. Software RAID is not cluster-aware and therefore not supported with GFS. The preferred solution is to use device mapper (DM) multipathing rather than md in these configurations.
How can I mount a GFS partition at bootup time?
Put it in /etc/fstab.
During startup, the "service gfs start" script (/etc/rc.d/init.d/gfs) gets called by init. The script checks /etc/fstab to see if there are any gfs file systems to be mounted. If so, it loads the gfs device driver and appropriate locking module, assuming the rest of the cluster infrastructure has been started.
What improvements will GFS2 have over GFS(1)?
GFS2 will address some of the shortcomings of GFS1:
- Faster fstat calls, making operations like 'ls -l' and 'df' faster.
- Faster directory scans.
- More throughput and better latency.
- Ability to add journals without extending the file system.
- Supports "data=ordered" mode (similar to ext3).
- Truncates appear to be atomic even in the presence of machine failure.
- Unlinked inodes / quota changes / statfs changes recovered without remounting that journal.
- Quotas are turned on and off by a mount option "quota=[on|off|account]"
- No distinction between data and metadata blocks. That means "gfs2_tool reclaim" is no longer necessary/present.
- More efficient allocation of inodes and bitmaps.
- Smaller journals allow for less overhead and more available storage to the users.
- Improved gfs2_edit tool for examining, extracting and recovering file system data.
- Numerous internal improvements, such as:
- Uses Linux standard disk inode mode values.
- Uses Linux standard directory entry types.
- Faster NFS filehandle lookup.
- No glock dependencies (fixes Postmark)
- No metadata generation numbers. Allocating metadata doesn't require reads.
- Copies of metadata blocks in multiple journals are managed by revoking blocks from the journal before lock release.
- Ability to do goal-based allocation in a primary/secondary setup.
- No RG LVBs anymore.
- The inode numbers seen from userspace are no longer linked to disk addresses.
- No cluster quiesce needed to replay journals.
- Much simpler log manager. It knows nothing about unlinked inodes or quota changes.
- A deallocation from an area won't contend with an allocation from that area.
What is the expected performance increase of GFS2 over GFS(1)?
I don't think anyone has speculated about this, and it's still to early for performance comparisons.
Why is one of my GFS nodes faster or slower to access my file when they're identical hardware?
With GFS, the first node to access a file becomes its lock master. Therefore, access to that file will be faster than other nodes.
Does GFS and GFS2 work with SELinux security?
In the RHEL4 and STABLE branches of the code in CVS, SELinux is not currently supported. In the development version (HEAD) and in upcoming releases, this support is built in.
How long will my gfs_fsck or gfs2_fsck take?
That depends highly on the type of hardware that it's running on. File system check (fsck) operations take a long time regardless of the file system, and we'd rather do a thorough job than a fast one.
Running it in verbose mode (-v) will also slow it down considerably.
We recently had report of a 45TB GFS file system on a dual Opteron 275 (4Gb RAM) with 4Gb Fibre Channel to six SATA RAIDs. The 4GB of RAM was not enough to do the fsck. FSCK required about 15GB of RAM to do the job, so a large swap drive was added. It took 48 hours for gfs_fsck to run to completion without verbose mode.
Does GFS support the use of sparse files?
Yes it does.
I want to use GFS for MySQL. Is that okay?
Yes, but you need to be careful.
If you only want one MySQL server running, (Active-Passive) there's no problem. You can use rgmanager to manage a smooth failover to redundant MySQL servers if your MySQL server goes down. However, you should be aware that in some releases, the mysql init script has an easily-fixed problem where it doesn't return the proper return code. TBD See here for more info... That can result in rgmanager problems with starting the service.
If you want multiple MySQL services running on the cluster (Active-Active), that's where things get tricky. You can still use rgmanager to manage your MySQL services for High Availability. However, you need to configure MySQL so that:
Only the MyISAM storage engine is used. Each mysqld service must start with the external-locking parameter on. Each mysqld service has to have the query cache parameter off (other cache mechanisms remain on, since they are automatically invalidated by external locking)
If you don't follow these rules, the multiple mysqld servers will not play nice in the cluster and your database will likely be corrupted.
For information on configuring MySQL, visit the mysql web site: http://www.mysql.com
MySQL also sells a clustered version of MySQL called "MySQL Cluster", but that does its own method of clustering, and is completely separate from Cluster Suite and GFS. I'm not sure how it would interact with our cluster software. For more information, see: http://www.mysql.com/products/database/cluster/
I want to use MySQL in a cluster WITHOUT GFS. Is that okay?
It depends on where you keep your databases.
If you keep your databases on shared storage, such as a SAN or iSCSI, you should use a cluster-aware file system like GFS to keep the file system sane with the multiple nodes trying to access the data at the same time. You can easily use rgmanager to manage the servers, since all the nodes will be seeing the same data. Without a cluster file system like GFS, there's likely to be corruption on your shared storage.
If your databases are on storage that is local to the individual nodes (i.e. local hard drives) then there are no data corruption issues, since the nodes won't have access to the storage on other nodes where the data is kept. However, if you plan to use the rgmanager to provide High Availability (Active-Passive) for each of your database servers, you will probably want to make copies of the database on each of the nodes so that it can also serve the database from any node that fails. You may have to do it often, too, or your backup database may quickly get out of sync with the original it is trying to provide backup service for. It may be tricky to copy these databases between the nodes, so you may need to follow special instructions on the MySQL web site: http://www.mysql.com
I want to use GFS for PostgreSQL. Is that okay?
Yes it is, for high-availability only (like MySQL, PostgreSQL is not yet cluster-aware). We even have a RG Manager resource agent for PostgreSQL 8 (only) which we plan to release in RHEL4 update 5. There is a bugzilla to track this work:
I want to use GFS for Samba (smb) file serving. Is that okay?
It depends on what you want to do with it.
You can serve samba from a single node without a problem.
If you want to use samba to serve the same shared file system from multiple nodes (clustered samba aka samba in active/active mode), you'll have to wait: there are still issues being worked out regarding clustered samba.
If you want to use samba with failover to other nodes (active/passive) it will work but if failover occurs, active connections to samba are severed, so the clients will have to reconnect. Locking states are also lost. Other than that, it works just fine.
Why does GFS lock up/freeze when a node gets fenced?
When a node fails, cman detects the missing heartbeat and begins the process of fencing the node. The cman and lock manager (e.g. lock_dlm) prevent any new locks from being acquired until the failed node is successfully fenced. That has to be done to ensure the integrity of the file system, in case the failed node wants to write to the file system after the failure is detected by the other nodes (and therefore out of communication with the rest of the cluster).
The fence is considered successful after the fence script completes with a good return code. After the fence completes, the lock manager coordinates the reclaiming of the locks held by the node that had failed. Then the lock manager allows new locks and the GFS file system continues on its way.
If the fence is not successful or does not complete for some reason, new locks will continue to be prevented and therefore the GFS file system will freeze for the nodes that have it mounted and try to get locks. Processes that have already acquired locks will continue to run unimpeded until they try to get another lock.
There may be several reasons why a fence operation is not successful. For example, if there's a communication problem with a network power switch.
There may be several reasons why a fence operation does not complete. For example, if you were foolish enough to use manual fencing and forgot to run the script that informs the cluster that you manually fenced the node.
GFS gave me an error: fatal: filesystem consistency error. What is it and what can I do about it?
That pretty much means your file system is corrupt. There are a number of ways that this can happen that can't be blamed on GFS:
If you use fence_manual and manually acknowledge a node is fenced before it really is. Faulty or flakey SAN. Faulty or flakey Host Bus Adapter in any of the nodes. Someone running gfs_fsck while the GFS file system is still mounted on a node. Someone doing a mkfs or other operation on the GFS partition from a node that can see the SAN, but is still outside the cluster. Someone modifying the GFS file system bypassing the GFS kernel module, such as doing some kind of lvm operation with locking_type = 1 in /etc/lvm/lvm.conf.
I've got several GFS file systems but when I try to mount more than one I get mount: File exists. What am I doing wrong?
I'm guessing that maybe you gave them the same locking table on gfs_mkfs, and they're supposed to be different. When you did mkfs, did you use the same -t cluster:fsname for more than one? You can find this out by doing:
gfs_tool sb <device> table
for each device and see if the same value appears. You can change it after the mkfs has already been done with this command:
gfs_tool sb <device> table cluster_name:new_name
How does GFS compare to ocfs2? Which is better?
GFS and OCFS2 have comparable features, depending on the version of the code. Both now have extended attributes, quotas, posix acls.
How can I performance-tune GFS or make it any faster?
You shouldn't expect GFS to perform as fast as non-clustered file systems because it needs to do inter-node locking and file system coordination. That said, there are some things you can do to improve GFS performance.
- Turn on the "fast statfs" feature.
In most recent releases of GFS, there is a "fast statfs" feature that speeds up the statfs calls. This works by keeping track of changes to the statfs information locally on each node and periodically re-syncing the values. That means it's much faster, but not guaranteed to be 100% accurate at any given time. There can be several seconds of time during which a node can be unaware of statfs changes. To turn on "fast statfs" do:
gfs_tool settune <mount point> statfs_fast 1
The "fast statfs" feature has limitations. First, it's not persistent. You have to turn the feature on each time you reboot. Many people add gfs_tool commands (as above) to their "gfs" init script (/etc/init.d/gfs) so they don't have to remember to turn it on every time they reboot. Second, you need to have all the nodes in your cluster agree: they should all be using fast statfs, or else none of them should be using it. If some do and some don't, it will be unpredictable. Third, it has been known to lose its place. For example, if you use "gfs_grow" to extend your GFS file system, the fast statfs feature will not be informed of the new file system size. For normal, day-to-day file system operations, this is not a problem. (This is also not a problem for GFS2). For more information on GFS's "fast statfs" please see: readme.gfs_fast_statfs.R4
- Use -r 2048 on gfs_mkfs and mkfs.gfs2 for large file systems.
The issue has to do with the size of the GFS resource groups, which is an internal GFS structure for managing the file system data. This is an internal GFS structure, not to be confused with rgmanager's Resource Groups. Some file system slowdown can be blamed on having a large number of RGs. The bigger your file system, the more RGs you need. By default, gfs_mkfs carves your file system into 256MB RGs, but it allows you to specify a preferred RG size. The default, 256MB, is good for average size file systems, but you can increase performance on a bigger file system by using a bigger RG size. For example, my 40TB file system needs 156438 RGs of 256MB each and whenever GFS has to run that linked list, it takes a long time. The same 40TB file system can be created with bigger RGs--2048MB--requiring only 19555 of them. The time savings is dramatic: It took nearly 23 minutes for my system to read in all 156438 RG Structures with 256MB RGs, but only 4 minutes to read in the 19555 RG Structures for my 2048MB RGs. The time to do an operation like df on an empty file system dropped from 24 seconds with 256MB RGs, to under a second with 2048MB RGs. I'm sure that increasing the size of the RGs would help gfs_fsck's performance as well. Recent versions of gfs_mkfs and mkfs.gfs2 dynamically choose an RG size to reduce the RG overhead.
Note that this can be a delicate balancing act. If you have tens of thousands of RGs, GFS may waste a lot of time searching for the one it needs, as stated above. However, if you have too few RGs, GFS may waste a lot of time searching through the much-bigger bitmaps when allocating blocks. A lot depends on how your applications will be using the GFS file system and whether they mostly allocate big files or little files, and how often they do block allocations. We have a new, improved, faster "bitfit" algorithm that improves upon this situation, but it can only go so far, and it's not generally available today, except in source code form. Hopefully it will be available for RHEL5, Centos5, and similar, in the 5.3 time-frame.
- Preallocate files.
You can squeak out a little bit more speed out of GFS if you pre-allocate files. When blocks of data are allocated to a file, GFS takes a certain amount of time to run the RG list (see previous bullet) and coordinate the allocation with the other nodes in the cluster. This all takes time. If you can pre-allocate your files, for example, by using "dd if=/dev/zero of=/your/gfs/filesystem bs=1M count=X", the underlying application won't have to take time to do that allocation later. It's not actually saving you any time, it's just managing the time better.
- Break file systems up when huge numbers of files are involved.
There's a certain amount of overhead when dealing with lots (millions) of files. If you have a lot of files, you'll get better performance from GFS if you reorganize your data into several smaller file systems with fewer files each. For example, some people who use GFS for a mail server will break the file system up into groups, rather than having all email on one huge file system.
- Disable GFS quota support if you don't need it.
If you don't need quotas enforced in GFS, you can make your file system a little faster by disabling quotas. To disable quotas, mount with the "-o noquota" option.
- Make sure your hardware isn't slowing you down
Since GFS coordinates locks through the network, you might be able to speed up locking by using faster Ethernet hardware. For example, 1GB Ethernet will work faster than 100MB ethernet. You can also put your cluster on its own network switch or hub to reduce slowdowns due to excessive Ethernet traffic and collisions. In some cases, slowdowns may also be blamed on faulty or flaky network switches or ports, cables, and Host Bus Adapters (HBAs).
- Try increasing or decreasing the number of GFS locks cached.
GFS caches a certain number of locks in memory, but keeping a good amount in cache is a delicate balancing act. Too many locks cached and your performance may be hurt by the overhead to manage them all. Too few locks cached and your performance may be hurt by the overhead of constantly needing to recreate locks that were previously cached. Depending on your applications running on GFS and the types of files and locks they use, it might run faster if you keep more locks cached in memory. Try to bring the number of cached locks closer in line with the number of locks you'll actually have in use. To change this value, do a command like this:
echo "200000" > /proc/cluster/lock_dlm/drop_count
- Increase "statfs_slots"
When statting a file system with asynchronous internode locking, GFS fills in stat data as the locks become available. It normally allocates 64 locks for this task. Increasing the number of locks can often make it go faster because the nodes in the cluster can all work on more lock coordination asychronously. Use this command:
gfs_tool settune /mnt/gfs statfs_slots 128
This causes a bit more of traffic among the nodes but can sustain larger number of files. This value is not persistent so it won't survive a reboot. If you want to make it persistent, you can add it to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.
- Adjust how often the GFS daemons run.
There are six GFS daemons that perform tasks under the covers. By default, each of these daemons wakes up every so often, and that may affect performance. Slight tweaks to these numbers may help. Here's a list of GFS daemons and how often they run:
|gfs_glockd||reclaim unused glock structures||As needed||Unchangable|
|gfs_inoded||reclaim unlinked inodes.||15 secs||inoded_secs|
|gfs_logd||journal maintenance||1 secs||logd_secs|
|gfs_quotad||write cached quota changes to disk||5 secs||quotad_secs|
|gfs_scand||look for cached glocks and inodes to toss from memory||5 secs||scand_secs|
|gfs_recoverd||Recover dead machine's journals||60 secs||recoverd_secs|
To change the frequency that a daemon runs, use a command like this:
gfs_tool settune /mnt/bob3 inoded_secs 30
These values are not persistent so they won't survive a reboot. If you want to make them persistent, you can add them to the gfs init script, /etc/init.d/gfs, after your file systems are mounted.
- Make sure updatedb doesn't run on your GFS mounts.
By default, the updatedb daemon runs every day and it has been known to slow GFS down to a crawl. We've even had some customers with multi-terabyte GFS file systems that don't finish their updatedb overnight, so they can get multiple updatedb processes queued up from several days, all competing for GFS resources. Fortunately, you can tell updatedb not to run on the GFS mount points by editing the config file, /etc/updatedb.conf. See the "updatedb.conf" man page for more information on how to do that.
- Design your applications and/or environment with the DLM in mind.
It's important to understand at least the basics of how the DLM (Distributed Lock Manager) works and how to use that to optimize GFS performance. With the DLM, the first node who opens any given file becomes the "lock master" for that file. Once a lock master has been established, if another node opens or locks that same file, it needs to coordinate the locking with the lock master. That means talking over the network, which is much slower than talking to the disk drives. But if the lock master needs to access the files it's "mastered" it may not have to ask anyone's permission, and therefore file access is much faster. Therefore, you might be able to design your environment or application such that the lock master will do the majority of accesses to its own files.
For example, if you design a cluster for NFS serving and you have a three node cluster and 30,000 files to serve, you may get better performance if each of the three servers handles its own set of 10,000 web pages, rather than trying to use LVS to distribute the load with Piranha or LVS. Again, this is something you should experiment with.
- Use "Glock Trimming"
This concept is similar to the previous bullet. The idea behind "glock trimming" is that you periodically tell GFS to drop all of its unused glocks, which translates at some level into DLM locks. When you trim the glocks, you are doing two things: First, you are redistributing the DLM locks around so that the "lock master" for any given file is moved to wherever it is needed most. Second, you are trimming the list of glocks kept in memory so there are fewer to search through.
For more information on glock trimming, please see:
Other GFS performance considerations:
- Fine-tune your hardware
Before you blame GFS for the slowdown, you should check your hardware. Before you mount your GFS file system for the first time, you can check the performance of your hard drives using some raw block transfers. For example, you can time the 'dd' command for how long it takes to dump the first 100GB of storage to a file, using the raw device without GFS in the picture. You can test both reads and writes this way, but that kind of thing may be destructive and should only be done before the hardware is used as a file system. Also, check if you've got the right hardware for the job: 1. SANs run faster than iSCSI or AoE, but they're more expensive.
- 1000Mbit Ethernet runs faster then 100Mbit for your networking.
- Striped volumes run faster than linear volumes.
- Having a dedicated NIC for your cluster traffic is faster than if you use one NIC for both internal and external Ethernet traffic.
- We've had customers complain about bad performance that ended up being caused by faulty hardware. For example, bad or nearly-bad hard drives will do a ton of retries on bad sectors, which slows things down to a crawl. Another example: I once heard about a customer who had a bad port on their network switch, and that was slowing them down. Their speed increased a lot after they started using a different Ethernet port on the switch.
- Make sure you're not being slowed down by things outside of GFS:
For example, Software RAID (i.e. md raid, which is single-node only thing and should never be run in a cluster) is much slower than hardware RAID. Also, if you're exporting your GFS device through something else like NFS or GNBD, your slowdown might be the fault of NFS or GNBD, rather than GFS. * GFS2 versus GFS1
GFS2 runs faster than GFS, although GFS2 is not production-ready for a clustered environment yet. However, you can still do two things: (a) you can plan for and/or wait for GFS2 to be released for general availability. (b) Red Hat is already supporting some customers with GFS2 in single-node-only mode. This was done because of special circumstances. If you're a Red Hat customer and your application is single-node you might be able to talk Red Hat into supporting you by escalating your case. Of course, this is only available in RHEL5.2 and above. Red Hat has put a lot of effort into performance for GFS2 and they're determined to make it even better before it is released).
How can I convert a file system from gfs1 to gfs2?
There's a tool called gfs2_convert whose job is to convert a file system from gfs1 to gfs2. At this time, gfs2_convert will only convert file systems created with the default 4K block size. I recommend following this procedure:
- Unmount the file system from all nodes.
- gfs_fsck /your/file/system (make sure there's no corruption to confuse the tool)
- IMPORTANT: Make a backup of your file system in case something goes wrong
- gfs2_convert /your/file/system
WARNING: At this time, gfs2 is still being worked on, so you should not use it for a production cluster.
Why is access to a GFS file system slower right after it's mounted?
The first access after a GFS mount will be slower because GFS needs to read in the resource group index and resource groups (internal GFS data structures) from disk. Once they're in memory, subsequent access to the file system will be faster. This should only happen right after the file system is mounted.
It also takes additional time to read in from disk: (1) the inodes for the root directory, (2) the journal index, (3) the root directory entries and other internal data structures.
You should be aware of this when performance testing GFS. For example, if you want to test the performance of the "df" command, the first "df" after a mount will be a lot slower than subsequent "df" commands.
After a node is fenced GFS hangs for X seconds on my other nodes. Can I reduce that time?
After a node fails, there is a certain amount of time where cman is waiting for a heartbeat. When if doesn't get a heartbeat, it performs fencing, and has to wait for the fencing agent to return a good return code, verifying that the node has indeed been fenced. While the node is being fenced, GFS is prevented from taking out new locks (existing locks, however, remain valid, so some IO activity may still take place). After the fence is successful, DLM has to do lock recovery (to reclaim the locks held by the fenced node) and GFS has to replay the fenced node's journals. There's an additional configuration setting called post_fail_delay that can delay things further. So GFS is delayed for three things:
Time for the fence agent to perform fencing This varies widely based on the type of fencing you're using. Some network power switches are fast. Others agents like ilo are slower. Time for DLM to recover the locks. This varies based on how much activity was happening on the file system. For example, if your application had thousands of locks taken, it will take longer to recover those locks than if your node were idle before the failure. Time for GFS to replay the journals Again, this varies, based on the activity of the file system before the fence. If there was lots of writing, there might be lots of journal entries to recover, which would take longer than an idle node.
There's not much you can do about the time taken, other than to reduce post_fail_delay to 0 or buy a faster power switch.
Will my application run properly on a GFS file system?
The GFS file system is like most other file systems with regard to applications, with one exception: It makes an application running on multiple nodes work as if they are multiple instances of the application running on a single node. GFS will maintain file system integrity when multiple nodes are accessing data on the same shared storage. However, the application is free to corrupt data within its own files unless it is cluster-aware.
For example, if you were to run multiple copies of the regular MySQL database on a single computer, you're going to get into trouble. That's because right now, MySQL doesn't do record-level locking on its database, and therefore a second instance would overwrite data from the first instance. Of course, there are safeguards within MySQL to prevent you from running two instances on a single computer. But if you ran MySQL from two clustered nodes on a GFS file system, it would be just like both instances are running on the same computer, except that there are no safeguards: Data corruption is likely. (Note, however, that there is a special version of MySQL that is more cluster friendly.)
The same holds true for other applications. If you can safely run multiple instances on the same computer, then you should be able to run multiple instances within your cluster safely on GFS.
What does it mean when a GFS file system is withdrawn?
If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
What does it mean when a GFS file system is withdrawn?
If a GFS file system detects corruption due to an operation it has just performed, it will withdraw itself. The idea of withdrawning from GFS is just slightly nicer than a kernel panic. It means that the node feels it can no longer operate safely on that file system because it found out that one of its assumptions is wrong. Instead of panicking the kernel, it gives you an opportunity to reboot the node "nicely".
After a withdraw, can you simply remount your GFS file system?
No. The withdrawn node should be rebooted.
The files in my GFS file system are corrupt; Why did it happen and what should I do?
Corruption in GFS is extremely rare and almost always indicates a hardware problem with your storage or SAN. The problem might be in the SAN itself, the motherboards, fibre channel cards (HBAs) or memory of the nodes, although that's still not guaranteed. Many things can cause data corruption, such as rogue machines that have access to the SAN that you're not aware of.
I recommend you:
- Verify the hardware is working properly in all respects.
One way you can do this is to make a backup of the raw data to another device and verify the copy against the original without GFS or any of the cluster software in the mix. For example, unmount the file system from all nodes in the cluster, then do something like:
[root@node-01#] dd if=/dev/my_vg/lvol0 of=/mnt/backup/sanbackup
(assuming of course that /dev/my_vg/lvol0 is the logical volume you have your GFS partition on, and /mnt/backup/ is some scratch area big enough to hold that much data.) The idea here is simply to test that reading from the SAN gives you the same data twice. If that works successfully on one node, try it on the other nodes. You may want to do a similar test, only writing random data to the SAN, then reading it back and verifying the results. Obviously this will destroy the data on your SAN unless you are careful, so if this is a production machine, please take measures to protect the data before trying anything like this. This example only verifies the first 4GB of data:
[root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/sanbackup2 bs=1M count=4096 [root@node-01#] dd if=/dev/urandom of=/tmp/randomjunk bs=1M count=4096 [root@node-01#] dd if=/tmp/randomjunk of=/dev/my_vg/lvol0 bs=1M count=4096 [root@node-01#] dd if=/dev/my_vg/lvol0 of=/tmp/junkverify bs=1M count=4096 [root@node-01#] diff /tmp/randomjunk /tmp/junkverify
The two diffed files should be identical, or else you're having a hardware problem.
- Once you verify the hardware is working properly, run gfs_fsck on it. The
latest version of gfs_fsck (RHEL4 or newer) can repair most file system corruption. If the file system is fixed okay, you should back it up. If you can read and write to the SAN reliably from all the nodes without GFS, then try using it again with GFS and see if the problem comes back.
- Perhaps someone else (the SAN manufacturer?) can recommend hardware
tests you can run to verify the data integrity.
I realize these kinds of tests take a long time to do, but if it's a hardware problem, you really need to know. If you know it's not hardware and can recreate this kind of corruption with some kind of test using GFS, please let us know how and open a bugzilla.
I have concurrency/caching issues when I work with a filesystem that is mounted locally and remotely via ATA over Ethernet
AoE does not allow a machine to discover drives exported from itself. This requires one to mount the block device directly on the machine that hosts the drive. Nodes can get out of sync because the local mount is direct to the block device and the remote mount goes through AoE and AoE's cache.
This is also filed as a bug.
Help! GFS can only be mounted by one node due to SCSI reservation conflicts!
WARNING: Only do this if you are not using fence_scsi
- Install sg3_utils, if not already installed
- Obtain the key holding the reservation:
sg_persist -d $dev -i -k
- Release the reservation on the node holding it:
sg_persist -d $dev -o -G -K $key -S 0
- Run the following on the other nodes:
partprobe service clvmd restart