Tuesday, 12 March 2013

GPFS introduction


GPFS is a concurrent file system. It is a product of IBM and is short for General Parallel File System. It is a high-performance shared-disk file system that can provide fast data access from all nodes in a homogenous or heterogenous cluster of IBM UNIX servers running either the AIX or the Linux operating system.

All nodes in a GPFS cluster have the same GPFS journaled filesystem mounted, allowing multiple nodes to be active at the same time on the same data.

A specific use for GPFS is RAC, Oracle's Real Application Cluster. In a RAC cluster multiple instances are active (sharing the workload) and provide a near "Allways-On" database operation. The Oracle RAC software relies on IBM's HACMP software to achieve high availability for hardware and the operating system AIX. For storage it utilizes the concurrent filesystem called GPFS.

Data availability

GPFS is fault tolerant and can be configured for continued access to data even if cluster nodes or storage systems fail. This is accomplished though robust clustering features and support for data replication. GPFS continuously monitors the health of the file system components. When failures are detected appropriate recovery action is taken automatically. Extensive logging and recovery capabilities are provided which maintain metadata consistency when application nodes holding locks or performing services fail. Data replication is available for journal logs, metadata and data. Replication allows for continuous operation even if a path to a disk or a disk itself fails. GPFS Version 3.2 further enhances clustering robustness with connection retries. If the LAN connection to a node fails GPFS will automatically try and reestablish the connection before making the node unavailable. This provides for better uptime in environments experiencing network issues. Using these features along with a high availability infrastructure ensures a reliable enterprise storage solution.

GPFS interaction with AIX

GPFS is a means to provide a journaled filesystem that can be mounted on multiple nodes simultaneously. GPFS stripes the data across all disks that belong to that file system. GPFS has a somewhat different approach of dealing with AIX volume groups and disks as we're used to; also mirroring is done in a different way.

A standard AIX setup has a device relationship that follows the following rules: A volumegroup is created that holds one or more physical disks. A disk contains one or more logical volumes, or a logical volume may span multiple disks. There is a one-to-one relation between a logical volume and the filesystem it contains. With LVM-mirroring each logical partition of a logical volume is placed on two separate disks. This typical setup is shown in the figure below:
GPFS introduction
The original AIX filesystem structure.

In a SAN environment, this picture looks like this:
GPFS introduction
The AIX filesystem structure in a SAN environment.

Each volume group of GPFS contains only 1 (one) physical disk. Each disk contains only 1 (one) logical volume. Each filesystem contains multiple logical volumes (one for each disk). LVM mirroring is not supported (there is only one disk in a volume group). This translates in the following picture:
GPFS introduction
The GPFS filesystem structure in a SAN environment.

In GPFS 2.3 the GPFS volumes are called Network Storage Devices (NSD), that contain each only one physical disk. No volume groups and/or logical volumes are created in this GPFS version. In migrated clusters (from GPFS 2.2 to GPFS 2.3) you will still see volume groups and logical volumes, but only for the "old" disks. New disks and filesystems will be created without them.

We change the picture in a more "stack"-like representation. Here you see one GPFS filesystem that is made up out of four separate disks. AIX multipath-software has created the hdisk and vpath devices.

On the AIX level GPFS creates a separate volume group for each disk, so 4 volume groups in total. GPFS fills each disk with a logical volume, so 4 logical volumes in total. These logical volumes are represented as disk in the GPFS configuration. These GPFS-disks are used in the filesystem. A file stored in the filesystem is striped across the four disks (in 8kb blocks). The command used to create the GPFS disks is mmcrlv.
GPFS introduction
The stacked GPFS filesystem structure.

Usually, only small LUNs of only 17,5 GB are used instead of big luns (of 400 GB), because of performance.

Mirroring versus replication

Traditional AIX mirrorring on the logical volume level can not be done in a typical GPFS device setup. The volume group holds only one disk that is completely filled with one logical volume, so there is no destination possible for the second copy of the lv's logical partitions. GPFS provides replication as the alternative.

GPFS provides a structure called replication that provides a means of surviving a diskfailure. On the file level you can specify how many copies of that file must be present in the filesystem (one or two). When you specify two copies, GPFS will duplicate the file across two "failuregroups". Setting replication on the file level is error-prone, this can easily be forgotten. It is also possible to specify this globally on the filesystem level. Set the "Default number of replicas = 2" and the "Maximum number of replicas = 2" on each GPFS filesystem, so that every file in all the GPFS filesystem are automatically replicated. Keep in mind that replication stores two file copies in the same filesystem. Each file will use twice the amount of space, so the filesystem free space will drop in size twice as fast. An example: The free space in the filesystem is 15 MB. You want to save a file of 10 MB, the result is a FILE SYSTEM FULL ERROR. The reason is that you need at least 20 MB free space to hold both copies of the file!

Failure groups

GPFS groups disks into "failuregroups". A failuregroup is a collection of disks that share a single point of failure (SPOF). In a SAN setup there is usually only one SPOF for the disk: All disks are usually multipath, so a single Host Bay Adapter (HBA) failure is no problem. All systems can be connected to two separate SAN fabrics, so a fabric failure is also no problem. Each disk is hosted by one ESS. When the ESS fails, all disks in that ESS will fail. Unless you have a second ESS, you can prevent this failure by using failuregroups. GPFS uses these failure groups to prevent that both replication copies of a file will fail at the same time. It does this by writing the two copies of a file to disks in separate failuregroups.
GPFS introduction
Each file copy in a separate failuregroup.

In the example above you see that the file is written twice in the filesystem. One copy is striped over lun 1 + 2 and the other copy is striped across lun 3 + 4. When ESS1 fails the second copy of the file is still completely usable on ESS2.

Striping

Large files in GPFS are divided into equal sized blocks, and consecutive blocks are placed on different disks in a round-robin fashion. To minimize seek overhead, the block size is large (typically 256K). Large blocks have the advantage that they allow a large amount of data to be retrieved in a single I/O from each disk. GPFS stores small files (at the end of large files) in smaller units called sub-blocks, which are as small as 1/32 of the size of a full block. Striping works best when disks have equal size and performance. This is why you should use one disksize for data storage in a filesystem; do not mix and match large and small luns.

GPFS transaction log

Just like JFS, GPFS is a journaled filesystem. GPFS records all metadata updates that affect file system consistency in a journal log. Each node has a separate log for each file system it mounts, stored in that file system. Because this log can be read by all other nodes, any node can perform recovery on behalf of a failed node. It is not necessary to wait for the failed node to come back to life. After a failure, file system consistency is restored quickly by simply re-applying all updates recorded in the failed node's log. Once the updates described by a log record have been written back to disk, the log record is no longer needed and can be discarded. Thus, logs can be fixed size, because space in the log can be freed up at any time by flushing "dirty" metadata back to disk in the background.

GPFS data and metadata

The GPFS filesystem contains two types of data: Data and Metadata. "Data" means the actual files you want to store in the filesystem. This is the usable storage space. "Metadata" refers to all sorts of information used by GPFS internally. For each GPFS disk you can specify what it will contain: DataAndMetadata, MetadataOnly, Dataonly or DescOnly. "DataAndMetadata" is used for normal disks, so most disks in the system will have this designation. "DescOnly" is used for "quorum busters".

GPFS filesystem descriptor quorum

There is a structure in GPFS called the Filesystem Descriptor (FSDesc) that is written originally to every disk in the filesystem, but is updated only on a subset of the disks as changes to the filesystem occur, such as adding or deleting disks. The subset of disks is usually a set of three or five disks, depending on how many disks and failuregroups are in the filesystem. The disks that constitute this subset of disks can be found by reading any one of the FSDesc copies on any disk. The FSDesc may point to other disks where more up-to-date copies of the FSDesc are located.

To determine the correct filesystem configuration, a quorum of the subset of disks must be online so that the most up-to-date FSDesc can be found. If there are three special disks, then two of the three must be available. GPFS distributes the copies of FSDesc across the failure groups. If there are only two failuregroups, one failure group has two copies and the other failure group has one copy. In a scenario that causes one entire failure group to disappear all at once, if half of the disks that are unavailable contain the single FSDesc that is part of the quorum, everything stays up. On the other hand, if the downed failure group contains the majority of the quorum, the FSDesc cannot be updated and the filesystem must be force unmounted. If the disks fail one at a time, the FSDesc is moved to a new subset of disks by updating the other two copies and a new disk copy. However, if two of the three disks fail simultaneously, the FSDesc copies cannot be updated to show the new quorum configuration. In this case, the filesystem must be unmounted to preserve existing data integrity. To survive a single ESS failure in a dual ESS configuration, there must be a third failure group on an independent disk outside both ESSs (the so-called TieBreaker node, which contains one disk per filesystem which contains the third FSDesc).

The final picture will be:
GPFS introduction
Final GPFS on SAN picture.

Taking all things mentioned above in account, the final solution for a GPFS filesystem is:

All files in the filesystem are replicated across two failuregroups on two nodes (preferably in two sites). This is controlled by the filesystem setting "default number of replicas = 2". The number of disk that hold data is the same at each of the two sites. The number of disk used for data has no practical limit. You will probably create multiple filesystems for other reasons than the disk limit. These disks also hold a copy of the metadata.

There is a third site with one disk used as quorum buster on the TieBreaker node. These disks hold no data or metadata, only a single filesystem descriptor (FSDesc).

GPFS software

For GPFS 2.2 the following filesets are installed on each node of the GPFS cluster:
  • mmfs.base.cmds
  • mmfs.base.rte
  • mmfs.gpfs.rte
  • mmfs.gpfsdocs.data
  • mmfs.msg.en_US
For GPFS 2.3 the following filesets are installed on each node of the GPFS cluster:
  • gpfs.base
  • gpfs.msg.en_US
  • gpfs.docs.data
For Oracle RAC using GPFS 2.3, installation of HACMP 5.2 (and RSCT) is required. This is specifically necessary for Oracle RAC and not for GPFS.

No comments:

Post a Comment