or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

Steve Pate : ext2 & ext3

The first filesystem that was developed as part of Linux was a Minix filesystem clone. At this time, the Minix filesystem stored its block addresses in 16-bit integers that restricted the size of the filesystem to 64MB. Also, directory entries were fixed in size and therefore filenames were limited to 14 characters. Minix filesystem support was replaced in 1992 by the ext filesystem, which supported filesystem sizes up to 2GB and filename sizes up to 255 characters. However, ext inodes did not have separate access, modification, and creation time stamps, and linked lists were used to manage free blocks and inodes resulting in fragmentation and less-than-ideal performance.

These inadequacies were addressed by both the Xia filesystem and the ext2 filesystem (which was modelled on the BSD Fast File System), both of which provided a number of enhancements, including a better on-disk layout for managing filesystem resources. The improvements resulting in ext2 far outweighed those of Xia, and in ext2 became the defacto standard on Linux.

The following sections first describe the ext2 filesystem, followed by a description of how the filesystem has evolved over time to produce the ext3 filesystem which supports journaling and therefore fast recovery.

Features of the ext2 Filesystem

Shown below are the main features supported by ext2:

4TB filesystems. This required changes within the VFS layer. Note that the maximum file and filesystem size are properties of the underlying filesystem and the kernel implementation.

255-byte filenames. Directory entries are variable in length with a maximum size of 255 bytes.

Selectable file semantics. With a mount option, the administrator can choose whether to have BSD or SVR4 file semantics. This has an effect on the group ID chosen when a file is created. With BSD semantics, files are created with the same group ID as the parent directory. For System V semantics, if a directory has the set group ID bit set, new files inherit the group ID bit of the parent directory and subdirectories inherit the group ID and set group ID bit; otherwise, files and directories inherit the primary group ID of the calling process.

Multiple filesystem block sizes. Block sizes of 1024, 2048, and 4096 bytes can be specified as an option to mkfs.

Reserved space. Up to 5 percent of the filesystem can be reserved for root-only files, allowing some recovery in the case of a full filesystem.

Per-file attributes. Attributes can be set on a file or directory to affect subsequent file access. This is described in detail in the next section.

BSD-like synchronous updates. A mount option ensures that all meta-data (inodes, bitmaps, indirects and directories) are written to disk synchronously when modified. This increases filesystem integrity although at the expense of performance.

Periodic filesystem checks. To enforce filesystem integrity, ext2 has two ways of ensuring that a full fsck is invoked on the filesystem. A count is kept of how many times the filesystem is mounted read/write. When it reaches a specified count, a full fsck is invoked. Alternatively, a time-based system can be used to ensure that the filesystem is cleaned on a regular basis.

Fast symbolic links. As with VxFS, symbolic links are stored in the inode itself rather than in a separate allocated block.

The following sections describe some of these features in more detail.

Per-File Attributes

In addition to the features listed in the last section, there is a set of per-file attributes which can be set using the chattr command and displayed using the lsattr command. The supported attributes are:

EXT2_SECRM_FL. With this attribute set, whenever a file is truncated the data blocks are first overwritten with random data. This ensures that once a file is deleted, it is not possible for the file data to resurface at a later stage in another file.

EXT2_UNRM_FL. This attribute is used to allow a file to be undeleted.

EXT2_SYNC_FL. With this attribute, file meta-data, including indirect blocks, is always written synchronously to disk following an update. Note, though, that this does not apply to regular file data.

EXT2_COMPR_FL. The file is compressed. All subsequent access must use compression and decompression.

EXT2_APPEND_FL. With this attribute set, a file can only be opened in append mode (O_APPEND) for writing. The file cannot be deleted by anyone.

EXT2_IMMUTABLE_FL. If this attribute is set, the file can only be read and cannot deleted by anyone.

Attributes can be set on both regular files and directories. Attributes that are set on directories are inherited by files created within the directory.

The following example shows how the immutable attribute can be set on a file. The passwd file is first copied into the current directory and is shown to be writable by root. The chattr command is called to set the attribute, which can then displayed by calling lsattr. The two operations following show that it is then no longer possible to remove the file or extend it:

# cp /etc/passwd .
 # ls -l passwd
 -rw-r--r-- 1 root root 960 Jan 28 17:35 passwd
 # chattr +i passwd
 # lsattr passwd
 ---i--------passwd
 # rm passwd
 rm: cannot unlink 'passwd': Operation not permitted
 # cat >> passwd
 bash: passwd: Permission denied

Note that at the time of writing, not all of the file attributes are implemented.

The ext2 Disk Layout

The layout of structures on disk is shown in Figure 9.6. Aside from the boot block, the filesystem is divided into a number of fixed size block groups. Each block group manages a fixed set of inodes and data blocks and contains a copy of the superblock that is shown as follows. Note that the first block group starts at an offset of 1024 bytes from the start of the disk slice or volume.

struct ext2_super_block {
 unsigned long s_inodes_count; /* Inodes count (in use)*/
 Disk-Based Filesystem Case Studies 227
 unsigned long s_blocks_count; /* Blocks count (in use) */
 unsigned long s_r_blocks_count; /* Reserved blocks count */
 unsigned long s_free_blocks_count; /* Free blocks count */
 unsigned long s_free_inodes_count; /* Free inodes count */
 unsigned long s_first_data_block; /* First Data Block */
 unsigned long s_log_block_size; /* Block size */
 long s_log_frag_size; /* Fragment size */
 unsigned long s_blocks_per_group; /* # Blocks per group */
 unsigned long s_frags_per_group; /* # Fragments per group */
 unsigned long s_inodes_per_group; /* # Inodes per group */
 unsigned long s_mtime; /* Mount time */
 unsigned long s_wtime; /* Write time */
 unsigned short s_mnt_count; /* Mount count */
 short s_max_mnt_count; /* Maximal mount count */
 unsigned short s_magic; /* Magic signature */
 unsigned short s_state; /* File system state */
 unsigned short s_errors; /* Error handling */
 unsigned long s_lastcheck; /* time of last check */
 unsigned long s_checkinterval; /* max. time between checks */
 };

Many of the fields shown here are self explanatory and describe the usage of inodes and data blocks within the block group. The magic number for ext2 is 0xEF58. The fields toward the end of the superblock are used to determine when a full fsck should be invoked (either based on the number of read/write mounts or a specified time).

When writing sequentially to a file, ext2 tries to preallocate space in units of 8 contiguous blocks. Unused preallocation is released when the file is closed, so no space is wasted. This is used to help prevent fragmentation, a situation under which the majority of the blocks in the file are spread throughout the disk because contiguous blocks may be unavailable. Contiguous blocks are also good for performance because when files are accessed sequentially there is minimal disk head movement.

It is said that ext2 does not need defragmentation under normal load as long as there is 5 percent of free space on a disk. However, over time continuous addition and removal of files of various size will undoubtedly result in fragmentation to some degree. There is a defragmentation tool for ext2 called defrag but users are cautioned about its use.if a power outage occurs when running defrag, the file system can be damaged.

The block group is described by the following structure:

struct ext2_group_desc {
 unsigned long bg_block_bitmap; /* Blocks bitmap block */
 unsigned long bg_inode_bitmap; /* Inodes bitmap block */
 unsigned long bg_inode_table; /* Inodes table block */
 unsigned short bg_free_blocks_count; /* Free blocks count */
 unsigned short bg_free_inodes_count; /* Free inodes count */
 unsigned short bg_used_dirs_count; /* Directories count */
 };

This structure basically points to other components of the block group, with the first three fields referencing specific block numbers on disk. By allocating inodes and disk blocks within the same block group, it is possible to improve performance because disk head movement may be reduced. The bg_used_dirs_count field records the number of inodes in the group that are used for directories. This count is used as part of the scheme to balance directories across the different block groups and to help locate files and their parent directories within the same block group.

To better see how the block group structures are used in practice, the following example, using a small ext2 filesystem, shows how structures are set up when a file is allocated. Firstly, a filesystem is made on a floppy disk as follows:

# mkfs /dev/fd0
 mke2fs 1.24a (02-Sep-2001)
 Filesystem label=
 OS type: Linux
 Block size=1024 (log=0)
 Fragment size=1024 (log=0)
 184 inodes, 1440 blocks
 72 blocks (5.00%) reserved for the super user
 First data block=1
 1 block group
 8192 blocks per group, 8192 fragments per group
 184 inodes per group
 Writing inode tables: 0/1done
 Writing superblocks and filesystem accounting information: done
 This filesystem will be automatically checked every 35 mounts or
 180 days, whichever comes first. Use tune2fs -c or -i to override.

Analysis of the on-disk structures can be achieved using the debugfs command. The show_super_stats displays the superblock and the disk group structures. With the -h option, only the superblock is displayed:

# debugfs /dev/fd0
 debugfs 1.24a (02-Sep-2001)
 debugfs: show_super_stats -h
 Filesystem volume name: 
 Last mounted on: 
 Filesystem UUID: e4e5f20a-f5f3-4499-8fe0-183d9f87a5ba
 Filesystem magic number: 0xEF53
 Filesystem revision #: 1 (dynamic)
 Filesystem features: filetype sparse_super
 Filesystem state: clean
 Errors behavior: Continue
 Filesystem OS type: Linux
 Inode count: 184
 Block count: 1440
 Reserved block count: 72
 Free blocks: 1399
 Free inodes: 173
 First block: 1
 Block size: 1024
 Fragment size: 1024
 Blocks per group: 8192
 Fragments per group: 8192
 Inodes per group: 184
 Inode blocks per group: 23
 Last mount time: Wed Dec 31 16:00:00 1969
 Last write time: Fri Feb 8 16:11:59 2002
 Mount count: 0
 Maximum mount count: 35
 Last checked: Fri Feb 8 16:11:58 2002
 Check interval: 15552000 (6 months)
 Next check after: Wed Aug 7 17:11:58 2002
 Reserved blocks uid: 0 (user root)
 Reserved blocks gid: 0 (group root)
 First inode: 11
 Inode size: 128
 Group 0: block bitmap at 3, inode bitmap at 4, inode table at 5
 1399 free blocks, 173 free inodes, 2 used directories

The block group information is shown separate from the superblock. It shows the block numbers where the various structural information is held. For example, the inode bitmap for this block group is stored at block 4.recall from the information displayed when the filesystem was made that the block size is 1024 bytes. This is stored in the s_log_block_size field in the superblock.

Further information about the block group can be displayed with the dumpe2fs command as follows:

# dumpe2fs /dev/fd0
 dumpe2fs 1.24a (02-Sep-2001)
 ...
 Group 0: (Blocks 1 -1439)
 Primary Superblock at 1, Group Descriptors at 2-2
 Block bitmap at 3 (+2), Inode bitmap at 4 (+3)
 Inode table at 5-27 (+4)
 1399 free blocks, 173 free inodes, 2 directories
 Free blocks: 41-1439
 Free inodes: 12-184

There are 184 inodes per group in the example here. Inodes start at inode number 11 with the lost+found directory occupying inode 11. Thus, the first inode available for general users is inode 12. The following example shows how all inodes can be used but without all of the space being consumed:

# cd /mnt # i=12 # while [ $i -lt 188 ] ; do ; > $i ; i=_eexpr $i + 1_e ; done bash: 185: No space left on device bash: 186: No space left on device bash: 187: No space left on device # df -k Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda3 19111092 1844084 17267008 10% / /dev/hda1 21929 3615 17182 18% /boot shmfs 127780 0 127780 0% /dev/shm /dev/fd0 1412 15 1325 2% /mnt So, although the filesystem is only 2 percent full, all of the inodes have been allocated. This represents one of the difficulties that filesystems have faced over the years where the number of inodes are statically allocated when the filesystem is made.

The following example shows the statistics of an allocated file:

# cp /etc/passwd /mnt ; umount /mnt
 # debugfs /dev/fd0
 debugfs 1.24a (02-Sep-2001)
 debugfs: ls -l /
 2 40755 0 0 1024 13-Feb-2002 20:20 .
 2 40755 0 0 1024 13-Feb-2002 20:20 ..
 11 40755 0 0 12288 13-Feb-2002 20:18 lost+found
 12 100644 0 0 2064 13-Feb-2002 20:20 passwd
 debugfs: stat <12>
 Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation: 59537
 User: 0 Group: 0 Size: 2064
 File ACL: 0 Directory ACL: 0
 Links: 1 Blockcount: 6
 Fragment: Address: 0 Number: 0 Size: 0
 ctime: 0x3c6b3af9 -Wed Feb 13 20:20:09 2002
 atime: 0x3c6b3af8 -Wed Feb 13 20:20:08 2002
 mtime: 0x3c6b3af8 -Wed Feb 13 20:20:08 2002
 BLOCKS:
 (0-2):41-43
 TOTAL: 3

In this case, the file is displayed by inode number. The size of the file is 2064 bytes which results in three blocks being allocated: blocks 41 to 43. Recall from displaying the block group information shown previously that the first data block started at block 41.

ext2 On-Disk Inodes

The ext2 on-disk inode structure is defined by the ext2_inode structure as follows:

struct ext2_inode {
 __u16 i_mode; /* File mode */
 __u16 i_uid; /* Low 16 bits of Owner Uid */
 __u32 i_size; /* Size in bytes */
 __u32 i_atime; /* Access time */
 __u32 i_ctime; /* Creation time */
 __u32 i_mtime; /* Modification time */
 __u32 i_dtime; /* Deletion Time */
 __u16 i_gid; /* Low 16 bits of Group Id */
 __u16 i_links_count; /* Links count */
 __u32 i_blocks; /* Blocks count */
 __u32 i_flags; /* File flags */
 __u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
 __u32 i_generation; /* File version (for NFS) */
 __u32 i_file_acl; /* File ACL */
 __u32 i_dir_acl; /* Directory ACL */
 __u32 i_faddr; /* Fragment address */
 struct {
 __u8 l_i_frag; /* Fragment number */
 __u8 l_i_fsize; /* Fragment size */
 } linux2;
 };

The first several fields are self explanatory. The i_blocks field records the number of blocks that the file has allocated. This value is in 512-byte chunks. These blocks are stored as either direct data blocks in i_block[] or are referenced through indirect blocks within the same array. For example, consider the passwd file copied to an ext2 filesystem as shown above. Because the file is 2064 bytes in size, three 1024 byte blocks are required. The actual block count shown is 6 (512 byte blocks).

The inode i_block[] array has EXT2_N_BLOCKS (15) pointers to blocks of data. The first EXT2_NDIR_BLOCKS (12) entries in the array are direct pointers to data blocks. The i_block[12] element points to an indirect block of pointers to data blocks. The i_block[13] element points to a double indirect block for which each element points to an indirect block. The i_block[14] element points to a triple indirect block of pointers to double indirects.

Various inode numbers are reserved which explains why the first inode allocated has an inode number of 12 (lost+found is 11). Some reserved inodes are:

 EXT2_BAD_INO (1). This file contains a list of bad blocks on the file system.
 EXT2_ROOT_INO (2). This is the root directory of the file system.
 EXT2_ACL_IDX_INO (3). ACL inode.
 EXT2_ACL_DATA_INO (4). ACL inode.
 EXT2_BOOT_LOADER_INO (5). The file contains the boot loader.
 EXT2_UNDEL_DIR_INO (6). This file is used for file undelete.
 EXT2_FIRST_INO (11). This is the first inode that does not have a special
 meaning and can be used for other purposes.

There are many different inode flags that can be stored in i_flags. These map to the file attributes that can be set with chattr.

The i_faddr field is used in the case where the fragment size and block size are not equal. If the file does not require an exact number of filesystem-sized blocks, the last portion of the file data is stored in a fragment. The location of the fragment is stored in this field.

Repairing Damaged ext2 Filesystems

The e2fsck is used to repair filesystem inconsistencies, that can occur following a system crash. The process followed is divided into five separate passes which are listed below. The information shown here is based on material that appears in the Linux System Administrators Guide [WIRZ95]:

Pass 1. This phase takes the longest time to execute, because all of the inodes have to be read into memory and checked. In this phase, e2fsck checks each inode in the filesystem to ensure the file mode is valid and that all of the blocks in the inode are valid block numbers. During pass 1, bitmaps indicating which blocks and inodes are in use are compiled, to be used later.

If e2fsck notices data blocks that are mapped by more than one inode, it can either clone the duplicated blocks so that each inode has its own copy, or remove the blocks from one or more of the inodes.

To reduce the I/O time necessary in future passes, critical filesystem information is cached in memory, including the location on disk of all of the directory blocks on the filesystem. This removes the need to re-read the directory inodes during pass 2.

Pass 2. In this phase directories are validated. Because directory entries do not span disk blocks, each directory block can be checked individually without reference to other directory blocks. The directory blocks are checked to make sure that the directory entries are valid and contain references to inode numbers that are in use (as determined by pass 1).

For the first directory block in each directory inode, the _g._h and _h.._h entries are checked to make sure they exist, and that the inode number for the _g._h entry matches the current directory. Pass 2 also caches information concerning the parent directory in which each directory is linked. If a directory is referenced by more than one directory, the second reference of the directory is treated as an illegal hard link and is removed.

Note that at the end of pass 2, nearly all disk I/O that e2fsck needs to perform is complete. Information required by passes 3, 4, and 5 are cached in memory; hence, the remaining passes of e2fsck are largely CPU bound and take less than 5 to 10 percent of the total running time.

Pass 3. In this phase, the directory connectivity is checked by tracing the path of each directory back to the root using information that was cached during pass 2. At this time, the _g.._h entry for each directory is also checked to make sure it is valid. Any directories that can not be traced back to the root are linked to the lost+found directory.

Pass 4. In this phase, e2fsck checks the reference counts for all inodes by iterating over all the inodes and comparing the link counts (which were cached in pass 1) against internal counters calculated during passes 2 and 3. Any undeleted files with a zero link count are placed in lost+found during this pass.

Pass 5. In this last phase e2fsck checks the validity of the filesystem summary information. It compares the block and inode bitmaps which were constructed during the previous passes against the actual bitmaps on the filesystem and corrects the on-disk copies if necessary.

The e2fsck program is designed to run as quickly as possible. Because filesystem checking programs tend to be disk-bound, this was done by optimizing the algorithms used by e2fsck so that filesystem structures are not repeatedly accessed from the disk. In addition, the order in which inodes and directories are checked are sorted by block number, to reduce the amount of time in disk seeks.

Tuning a ext2 Filesystem

The tune2fs program can be used to change the various tunable parameters of an ext2 filesystem. Some of the different tunables that can be changed are:

-c max-mount-counts. This option adjusts the count of read/write mounts between two filesystem checks.

-e error-behavior. When errors are detected, the behavior of the ext2 kernel code can be altered with this option. The value of error-behavior can be continue in that the kernel continues with normal execution, remount-ro, which forces the kernel to remount the filesystem read-only, or panic in which case the kernel will panic.

-u user. This option sets the user who can benefit from the reserved blocks when the filesystem becomes full. The value of user can be a numerical user ID or a user name.

For further information on tune2fs see the tune2fs(8) manual page.

Resizing ext2 Filesystems

The resize2fs command can be used to increase or decrease the size of an ext2 filesystem. Note that the filesystem must be unmounted before the resize can take place. The resize2fs program does not manipulate the size of underlying partition. To increase the size of a filesystem, the partition must be increased first using fdisk. Similarly, to decrease the size of an ext2 filesystem, the partition must be resized with fdisk following the call to resize2fs.

If an ext2 filesystem resides on an LVM (Logical Volume Manager) volume, the e2fsadm command can be used to resize both the filesystem and the underlying logical volume.

The ext3 Filesystem

The ext3 filesystem was introduced to solve one specific problem, namely the amount of time it takes to perform a filesystem check following a system crash.

As described in the section VxFS Journaling, earlier in this chapter, these times can be significant, measured in many hours, if the filesystem is very large in size. Note that large in this case is actually a property of the amount of structural data (inodes) and not specifically the size of the filesystem.

Another goal behind ext3 was to make as few changes to the underlying ext2 code base as possible because ext2 is small in size, easy to maintain, robust, and well understood.

The use of ext3 was positioned in such a way that it is easy to transition between ext2 and ext3 filesystems and vice versa.

The actual journaling layer is separate from ext3. The filesystem understands the concepts of transaction (when one starts, when it finishes) but it is not actually responsible for the journaling.

How to Use an ext3 Filesystem

A new ext3 filesystem can be created by mkfs or by converting an existing ext2 filesystem. To create a new ext3 filesystem, mkfs is called as follows:

# mkfs -j /dev/sda5

To convert an existing ext2 filesystem to an ext3 filesystem, the tune2fs command can be invoked as follows:

# tune2fs -j /dev/sda5

Note that the command can be invoked on either a mounted or unmounted filesystem. If invoked on a mounted filesystem, the journal will appear as a visible file (.journal). If invoked on an unmounted filesystem or if mkfs -j is run when making the filesystem, the journal will not be visible.

To actually mount the filesystem, the ext3 filesystem type must be specified:

# mount -t ext3 /dev/sda5 /mnt1

Conversion back to ext2 can be achieved by using the tune2fs command as follows:

# tune2fs -O ^has_journal /dev/sda5

or simply by replaying the log to make the filesystem clean and then simply mounting it as an ext2 filesystem.

Data Integrity Models in ext3

As with VxFS, there is a set of choices about the type and level of journaling to be performed. Users can choose among the following options, which are passed to mount.

data=writeback. This option limits data integrity guarantees so that file data itself is not journaled. The filesystem, is however, guaranteed to be structurally sound at all times.

data=ordered. This mode, which is the default, ensures that data is consistent at all times. The data is actually written to the file before the transaction is logged. This ensures that there is no stale data in any filesystem block after a crash.

data=journal. This option writes all file data through the journal. This means that the data is actually written to disk twice. This option provides the best guarantees in terms of filesystem integrity but because data is written through the journal, performance can be significantly impacted and the time for recovery after a crash can be much greater.

How Does ext3 Work?

The design of ext3 was presented in [TWEE98]. To provide a transaction mechanism, all meta-data-related data blocks must be logged in the journal. There are three distinct types of blocks in question:

Journal blocks. An update to an inode, for example, will write the entire filesystem block to which the inode belongs in to the journal. In [TWEE98], Stephen Tweedie claims that this is a relatively cheap method due to the sequential nature in which data is written to the journal, and that by following this simple approach, there is little complexity in the kernel and therefore less CPU overhead.

Descriptor blocks. These blocks describe other journal blocks and are written to the journal before the journal blocks are written. Because the journal blocks are the actual meta-data blocks that must be written, the descriptor blocks are used to record information about the journal blocks, such as the disk block on which they reside.

Header blocks. The header blocks are written throughout the journal. They record the start and end of the journal together with a sequence number that is used during recovery to locate the order in which the blocks were written.

As with VxFS, transactions are delayed in memory to aid performance. With ext3, a set of transactions is batched into a compound transaction and committed to the journal on disk. This process is called checkpointing. While checkpointing is in progress, a new compound transaction is started, that will record any further changes to the filesystem while the previous compound transaction is being written to disk.

Crash recovery is performed by walking through the journal and writing any journal blocks to their correct location on disk. Because this is an idempotent operation, a crash in the middle of recovery does not matter because the process can be repeated any number of times with exactly the same effect.

Summary

There are many different UNIX filesystems and to scratch the surface on all of them would easily fill a book of this size. The three filesystems described in the chapter represent a good cross section of filesystems from the UNIX and Linux operating systems and cover the commercial filesystem market (VxFS), the most widely documented and ported filesystem (UFS), and the most popular open source filesystems (ext2 and ext3). Only a few other filesystems have been documented in any detail. [HANC01] describes the AdvFS filesystem developed by Digital which is the main filesystem of their True64 operating system. [KELL96] describes IBM_fs JFS filesystem.

To understand filesystem internals it is always best to start with one of the simple filesystems such as the original System V filesystem as documented in [LION96]. If studying Linux, the ext2 filesystem on one of the earlier kernels is a good place to start before looking at the more elaborate, and therefore more complex, filesystems.

Pseudo Filesystems

When people think of filesystems, they tend to think of a file hierarchy of files and directories that are all stored on disk somewhere. However, there are a number of filesystem types that provide a host of useful information but which have no physical backing store (disk storage). The most well known pseudo filesystem is /proc, which is used by the ps command as well as various debuggers.

This chapter describes some of the more well known pseudo filesystem types and provides a basic implementation of the ps command using the Solaris /proc filesystem.

The /proc Filesystem

The /proc filesystem was first introduced in 8th Edition UNIX and was described in Tom Killian_fs 1984 Usenix paper _gProcesses as Files_h [KILL84].

The /proc filesystem was to replace the ptrace() system call, with the advantage that the full process address space was visible and could be manipulated with read() and write() system calls. This contrasts with the interfaces offered by ptrace(), the system call traditionally used by debuggers, that only provides a word-at-a-time interface.

Roger Faulkner and Ron Gomes ported the research version of /proc to SVR4 and presented their work in another USENIX paper: _gThe Process File System and Process Model in UNIX System V_h [FAUL91]. At that time, Faulkner was with Sun Microsystems and Gomes with AT&T Bell Laboratories. As described in the paper, future work was intended to restructure /proc from a flat file system into a directory hierarchy describing a process. That work was undertaken at both Sun and USL and will be described later.

In the early /proc implementation, whose name is derived from the directory on which it is mounted, there is an entry in the directory for each process in the system. The name of the file displayed corresponds to the process ID, while the size of the file represents the size of the process address space. The file permissions correspond to the user who owns the process.

Figure 11.1 shows at a high level how the /proc filesystem is implemented. Standard file-related system calls such as open(), read(), and write() are handled at the filesystem-independent layer in the same manner as for other filesystem types. Much of the information about a process is held in the process table (traditionally in the array proc[]). To open a specific process file, the /proc filesystem must scan the process table looking for an entry whose p_pid field matches the pathname component passed.

One of the most widely used commands that access /proc is ps. Its role is to open each file in the /proc directory and then access the process status through an ioctl() system call. This was originally represented by the prstatus structure, which could be obtained by opening the file and issuing the PIOCSTATUS ioctl command. With the SVR4 implementation of /proc, there were over 40 different ioctl commands that could be issued, many of which dealt with debugging.

Note that the /proc filesystem does not have to be mounted on the /proc directory. It can in fact be mounted multiple times, which allows it to be used in chroot() environments.

The Solaris /proc Implementation

With the introduction of user-level threads of execution, the notion of /proc changed substantially from the single threaded process-based model of previous versions of UNIX. Each entry in /proc is a directory under which all of the information about a specific process is collected.

As an example, consider the following process, which is run in the background. Using the process ID that is returned, the contents of the /proc/3707 are displayed:

$ sleep 10000&
 [1] 3707
 $ cd /proc/3707
 $ ls -l
 total 1618
 -rw------- 1 spate fcf 1630208 May 28 21:24 as
 -r-------- 1 spate fcf 152 May 28 21:24 auxv
 Pseudo Filesystems 251
 -r-------- 1 spate fcf 36 May 28 21:24 cred
 --w------- 1 spate fcf 0 May 28 21:24 ctl
 lr-x------ 1 spate fcf 0 May 28 21:24 cwd ->
 dr-x------ 2 spate fcf 8208 May 28 21:24 fd
 -r--r--r-- 1 spate fcf 120 May 28 21:24 lpsinfo
 -r-------- 1 spate fcf 912 May 28 21:24 lstatus
 -r--r--r-- 1 spate fcf 536 May 28 21:24 lusage
 dr-xr-xr-x 3 spate fcf 48 May 28 21:24 lwp
 -r-------- 1 spate fcf 1728 May 28 21:24 map
 dr-x------ 2 spate fcf 544 May 28 21:24 object
 -r-------- 1 spate fcf 2048 May 28 21:24 pagedata
 -r--r--r-- 1 spate fcf 336 May 28 21:24 psinfo
 -r-------- 1 spate fcf 1728 May 28 21:24 rmap
 lr-x------ 1 spate fcf 0 May 28 21:24 root ->
 -r-------- 1 spate fcf 1440 May 28 21:24 sigact
 -r-------- 1 spate fcf 1232 May 28 21:24 status
 -r--r--r-- 1 spate fcf 256 May 28 21:24 usage
 -r-------- 1 spate fcf 0 May 28 21:24 watch
 -r-------- 1 spate fcf 2736 May 28 21:24 xmap

The contents of some of these files are C structures. For each of the structures that can be accessed, the procfs.h header file can be referenced for further information. Where structures are described, the file can be opened and the structure read directly from offset 0 within the file. A primitive ps example, shown in the section Accessing Files in the Solaris /proc Filesystem, later in this chapter, demonstrates how this is achieved.

Some of the files make reference to an LWP, a light weight process. The LWP model is used to provide support for multiple threads of control within a process. Grouping threads into an LWP alters the scheduling properties of the different threads.

The various files contained within /proc on a per-process basis are:

Opening this file gives access to the address space of the process. This allows the caller to find a specific address using lseek() and then either read from or write to the address using read() and write().

This file contains dynamic linker information.

cred. The process credentials, defined by the pcred structure, can be found here. This includes information such as the real and effective user IDs, real and effective group IDs, group, and supplementary group information.

ctl. This write-only file is used for process control and accounting. A request may be made to stop or start a process or enable process event tracing. cwd. This file is a symbolic link to the process_f current working directory.

fd. This directory contains files that correspond to the files that the process has open. There is one entry per open file.

lpsinfo, lstatus, lusage. These files give information about each of the process LWPs. Note that there can be multiple LWPs per process; each contains one or more threads.

map. This file contains an array of pmap structures, each of which describes a segment within the virtual address range of the process.

object. Each address space segment maps an underlying file. This directory contains read-only files that are referenced by the map and pagedata files. Opening one of these files gives a file descriptor for the specific mapped file.

pagedata. Opening this file allows the caller to track address space references and modifications on a per-page basis.

psinfo. This file gives general information about the state of the process that is used by the ps command. The psinfo structure, defined in procfs.h, can simply be read from this file.

rmap. Similar to the map file, this file contains an array of prmap structures. These segments are reserved by the operating system for structures such as the stack.

root.This file is a symbolic link to the process_f root directory.

sigact. This file contains an array of sigaction structures which define the disposition of signals associated with the traced process.

status. The information stored in this file, underpinned by the pstatus structure, gives a fairly detailed account about the state of the process. This includes a set of flags that indicate whether the process is runnable, stopped, being single-stepped, and so on. Process group and session information, memory size, and tracing data are some of the other types of information that can be found in this file.

usage. This file, underpinned by the prusage structure, gives a wealth of timing-related information about the process.

watch. This file contains an array of pwatch structures, which enable a process to be debugged. The controlling process can set breakpoints in the process by writing a PCWATCH message through the ctl file.

The lwp directory contains further information about each light weight process.

Accessing Files in the Solaris /proc Filesystem

To demonstrate how to access files within /proc, the following simple program gives an idea of how the ps program is implemented. Much of the information that is displayed by ps can be accessed through the psinfo file. Reading from this file returns data underpinned by the psinfo structure. The following program takes a process ID as an argument and reads the corresponding psinfo for that process. It then displays some of the information.

#include < fcntl.h>
 #include < procfs.h>
 main(int argc, char *argv[])
 {
 struct psinfo ps;
 char fname[256];
 int fd;
 sprintf(fname, "/proc/%s/psinfo", argv[1]);
 fd = open(fname, O_RDONLY);
 read(fd, (char *)&ps, sizeof(struct psinfo));
 printf("UID\tPID\tPPID\tCMD\n");
 printf("%d\t%d\t%d\t%s\n",
 ps.pr_uid, ps.pr_pid, ps.pr_ppid, ps.pr_psargs);
 }

Shown below is a simple run of the program, which displays information about the sleep process shown earlier:

$ ./mps 3707
 UID PID PPID CMD
 824 3707 1 sleep 100000

The psinfo file for each /proc entry is readable by anyone. Thus, it is possible for any user to write a more elaborate version of the preceding program that displays entries for all processes.

Tracing and Debugging with /proc

The ctl file allows one process to control another process through a rich set of functions provided by the /proc filesystem. Although all of these functions won_ft be described here, the aim is to highlight the type of features available and show how a process can be traced or debugged.

Access to the ctl file, which is write only, is achieved by writing an operational code to the file together with any additional data required for the operation in question. The controlling process tracks three different types of events, namely: Signals. A stop based on a signal is handled in all cases where the signal is detected, whether on return from a system call or trap, or during process wakeup.

System calls. The process is stopped either when the kernel is entered to process a system call or is just about to exit from the kernel back to user space after the system call has been processed.

Faults. There are a number of different fault types that can be managed, some of which depend on the type of architecture on which the operating system is running. Fault types include illegal instructions, breakpoints, memory access, and trace traps (used for single stepping).

The truss command is a prime example of a utility that controls another process. Its role is to display the system calls made by another process including the system call arguments and return values. The PCSENTRY and PCSEXIT control functions determine whether a process stops on entry to or exit from a system call. The system calls to be traced are held in the sysset_t structure, which is passed along with the PCSENTRY and PCSEXIT control functions. The prfillset() function can be used to build the complete set of system calls, because truss will monitor all system calls. For a more controlled trace, the set of system calls monitored can be altered using the praddset() and prdelset() library functions.

There are a number of different control messages that both stop and start a process. As an example of those functions that are relevant to truss, the PCSTOP function directs the process to stop on an event of interest and waits for it to stop. An event of interest is defined by invoking PCSTRACE (signals to be traced), PCSFAULT (faults to be traced), PCSENTRY (system call entry), or PCSEXIT (system call exit). The PCRUN control function makes the process runnable again. The following pseudo code gives a high-level view of how the truss utility can be implemented:

prfillset(&syscalls)
 PCSENTRY(syscalls)
 PCSEXIT(syscalls)
 do {
 PCSTOP()
 extract system call arguments
 PCSTART()
 PCSTOP()
 extract system call return value
 display system call type, arguments and return value
 PCSTART()
 } while (syscall type != exit);

Although this is a simplification, it demonstrates the power of the control functions implemented by the /proc filesystem.

There are a large number of control functions that make a debugger writer_fs life much easier. If the debugger is interested in fault types, the following are relevant:

FLTBPT. A breakpoint trap.
 FLTTRACE. A trace trap (used for single stepping).
 FLTWATCH. A watchpoint trap (used to trap on memory access).

The PCSFAULT control function can be used to set the faults to be traced. To put a breakpoint on a specific memory access, the PCWATCH function can be used to specify the address to be watched and whether an event should be triggered for read, write, or execute access. This can be used in conjunction with the stop and start control functions.

Anyone wishing to study how a real debugger makes use of /proc should look at the Solaris implementation of gdb, the GNU debugger whose source is freely available.

The Specfs Filesystem

Devices, whether block or character, are represented by special files in the filesystem. As the number of UNIX filesystem types increased, it was found that each filesystem was duplicating effort when managing access to the devices themselves.

Having multiple special files in the namespace caused an additional problem in that there could be multiple buffers in the buffer cache corresponding to the same block on disk. Considering how files are accessed, returning a filesystem vnode for a device file is incorrect. For example, consider the case where the device file resides on a UFS filesystem. Returning a vnode that has the v_op field of the vnode set to the list of UFS vnode operations will lead to problems. First, the open vnode operation on UFS or any other filesystem really has no function to perform for regular files. Second, many of the operations that are applicable to regular files are not applicable to device files. To make matters worse, if the vnode goes inactive, the filesystem may attempt to close the device even though it is open through access to another special file that references the same device.

All of these problems can be solved by adding additional logic inside the filesystem. However, consideration must be given on how to handle device access for each vnode operation. Furthermore, reference counting to determine when the last close on a device occurs is left up to the device driver. All in all, this leads to a situation that has a lot of duplication and is prone to errors.

To solve these problems, a new filesystem type, specfs, was introduced in SVR4. The specfs filesystem is not visible to users in that it cannot be mounted or seen from within the namespace.

During a VOP_LOOKUP() operation, instead of returning a vnode which corresponds to the special file, the filesystem makes a call to specvp() which returns a new specfs vnode, that the filesystem must return from the lookup operation. This vnode points to a specfs node (snode), a private specfs data structure that references the real vnode of the filesystem.

In the case where one device has more than one entry in the namespace, the snode also points to a common specfs vnode. It is through this common vnode that device access actually takes place.

The following example shows the linkage between two device special files and the common specfs vnode that represents both. This is also shown in Figure 11.2. First of all consider the following simple program, which simply opens a file and pauses awaiting a signal:

#include < fcntl.h>
 main(int argc, char *argv[])
 {
 int fd;
 fd = open(argv[1], O_RDONLY);
 pause();
 }

As shown below, a new special file is created with the same major and minor number as /dev/null:

# ls -l /dev/null
 crw-r--r-- 1 root other 13, 2 May 30 09:17 mynull
 # mknod mynull c 13 2
 # ls -l mynull
 crw-r--r-- 1 root other 13, 2 May 30 09:17 mynull

and the program is run as follows:

# ./dopen /dev/null &
 [1] 3715
 # ./dopen mynull &
 [2] 3719

Using crash, it is possible to trace through the list of file related structures starting out at the file descriptor for each process, to see which underlying vnodes they actually reference. First, the process table slots are located where the two processes reside:

# crash
 dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
 > p ! grep dopen
 336 s 3719 3713 3719 3713 0 46 dopen load
 363 s 3715 3713 3715 3713 0 46 dopen load

Starting with the process that is accessing the mynull special file, the user area is displayed to locate the open files:

> user 336
 ...
 OPEN FILES, POFILE FLAGS, AND THREAD REFCNT:
 [0]: F 300106fc690, 0, 0 [1]: F 300106fc690, 0, 0
 [2]: F 300106fc690, 0, 0 [3]: F 300106fca10, 0, 0
 ...

The file structure and its corresponding vnode are then displayed as shown:

> file 300106fca10
 ADDRESS RCNT TYPE/ADDR OFFSET FLAGS
 300106fca10 1 SPEC/300180a1bd0 0 read
 > vnode 300180a1bd0
 VCNT VFSMNTED VFSP STREAMP VTYPE RDEV VDATA VFILOCKS VFLAG
 1 0 300222d8578 0 c 13,2 300180a1bc8 0 -
 > snode 300180a1bc8
 SNODE TABLE SIZE = 256
 HASH-SLOT MAJ/MIN REALVP COMMONVP NEXTR SIZE COUNT FLAGS
 - 13,2 3001bdcdf50 30001b5d5b0 0 0 0
 The REALVP field references the vnode for the special file within the filesystem
 that references mynull.
 For the process that opens the /dev/null special file, the same sequence of
 operations is followed as shown:
 > user 363
 ...

Note that for the snode displayed here, the COMMONVP field is identical to the COMMONVP field shown for the process that referenced mynull.

To some readers, much of what has been described may sound like overkill. However, device access has changed substantially since the inception of specfs. By consolidating all device access, only specfs needs to be changed. Filesystems still make the same specvp() call that they were making 15 years ago and therefore have not had to make any changes as device access has evolved.

The BSD Memory-Based Filesystem (MFS)

The BSD team developed an unusual but interesting approach to memory-based filesystems as documented in [MCKU90]. Their goals were to improve upon the various RAM disk-based filesystems that had traditionally been used.

A RAM disk is typically a contiguous section of memory that has been set aside to emulate a disk slice. A RAM disk-based device driver is the interface between this area of memory and the rest of the kernel. Filesystems access the RAM disk just as they would any other physical device. The main difference is that the driver employs memory to memory copies rather than copying between memory and disk.

The paper describes the problems inherent with RAM disk-based filesystems. First of all, they occupy dedicated memory. A large RAM disk therefore locks down memory that could be used for other purposes. If many of the files in the RAM disk are not being used, this is particularly wasteful of memory. One of the other negative properties of RAM disks, which the BSD team did not initially attempt to solve, was the triple copies of data. When a file is read, it is copied from the file_fs location on the RAM disk into a buffer cache buffer and then out to the user_fs buffer. Although this is faster than accessing the data on disk, it is incredibly wasteful of memory.

The BSD MFS Architecture

Figure 11.3 shows the overall architecture of the BSD MFS filesystem. To create and mount the filesystem, the following steps are taken:

1. A call to newfs is made indicating that the filesystem will be memory-based.

2. The newfs process allocates an area of memory within its own address space in which to store the filesystem. This area of memory is then initialized with the new filesystem structure.

3. The newfs command call is made into the kernel to mount the filesystem. This is handled by the mfs filesystem type that creates a device vnode to reference the RAM disk together with the process ID of the caller.

4. The UFS mount entry point is called, which performs standard UFS mount time processing. However, instead of calling spec_strategy() to access the device, as it would for a disk-based filesystem, it calls mfs_strategy(), which interfaces with the memory-based RAM disk.

One unusual aspect of the design is that the newfs process does not exit. Instead, it stays in the kernel acting as an intermediary between UFS and the RAM disk.

As requests for read and write operations enter the kernel, UFS is invoked as with any other disk-based UFS filesystem. The difference appears at the filesystem/driver interface. As highlighted above, UFS calls mfs_strategy() in place of the typical spec_strategy(). This involves waking up the newfs process, which performs a copy between the appropriate area of the RAM disk and the I/O buffer in the kernel. After I/O is completed, the newfs process goes back to sleep in the kernel awaiting the next request.

After the filesystem is unmounted the device close routine is invoked. After flushing any pending I/O requests, the mfs_mount() call exits causing the newfs process to exit, resulting in the RAM disk being discarded.

Performance and Observations

Analysis showed MFS to perform at about twice the speed of a filesystem on disk for raw read and write operations and multiple times better for meta-data operations (file creates, etc). The benefit over the traditional RAM disk approach is that because the data within the RAM disk is part of the process address space, it is pageable just like any other process data. This ensures that if data within the RAM disk isn_ft being used, it can be paged to the swap device.

There is a disadvantage with this approach; a large RAM disk will consume a large amount of swap space and therefore could reduce the overall amount of memory available to other processes. However, swap space can be increased, so MFS still offers advantages over the traditional RAM disk-based approach.

The Sun tmpfs Filesystem

Sun developed a memory-based filesystem that used the facilities offered by the virtual memory subsystem [SNYD90]. This differs from RAM disk-based filesystems in which the RAM disk simply mirrors a copy of a disk slice. The goal of the design was to increase performance for file reads and writes, allow dynamic resizing of the filesystem, and avoid an adverse effect on performance. To the user, the tmpfs filesystem looks like any other UNIX filesystem in that it provides full UNIX file semantics.

Chapter 7 described the SVR4 filesystem architecture on which tmpfs is based. In particular, the section An Overview of the SVR4 VM Subsystem in Chapter 7, described the SVR4/Solaris VM architecture. Familiarity with these sections is essential to understanding how tmpfs is implemented. Because tmpfs is heavily tied to the VM subsystem, it is not portable between different versions of UNIX. However, this does not preclude development of a similar filesystem on the other architectures.

Architecture of the tmpfs Filesystem

In SVR4, files accessed through the read() and write() system calls go through the seg_map kernel segment driver, which maintains a cache of recently accessed pages of file data. Memory-mapped files are backed by a seg_vn kernel segment that references the underlying vnode for the file. In the case where there is no backing file, the SVR4 kernel provides anonymous memory that is backed by swap space. This is described in the section Anonymous Memory in Chapter 7.

Tmpfs uses anonymous memory to store file data and therefore competes with memory used by all processes in the system (for example, for stack and data segments). Because anonymous memory can be paged to a swap device, tmpfs data is also susceptible to paging.

Figure 11.4 shows how the tmpfs filesystem is implemented. The vnode representing the open tmpfs file references a tmpfs tmpnode structure, which is similar to an inode in other filesystems. Information within this structure indicates whether the file is a regular file, directory, or symbolic link. In the case of a regular file, the tmpnode references an anonymous memory header that contains the data backing the file.

File Access through tmpfs

Reads and writes through tmpfs function in a very similar manner to other filesystems. File data is read and written through the seg_map driver. When a write occurs to a tmpfs file that has no data yet allocated, an anon structure is allocated, which references the actual pages of the file. When a file grows the anon structure is extended. Mapped files are handled in the same way as files in a regular filesystem. Each mapping is underpinned by a segment vnode.

Performance and Other Observations

Testing performance of tmpfs is highly dependent on the type of data being measured. Many file operations that manipulate data may show only a marginal improvement in performance, because meta-data is typically cached in memory. For structural changes to the filesystem, such as file and directory creations, tmpfs shows a great improvement in performance since no disk access is performed.

[SNYD90] also shows a test under which the UNIX kernel was recompiled. The overall time for a UFS filesystem was 32 minutes and for tmpfs, 27 minutes. Filesystems such as VxFS, which provide a temporary filesystem mode under which nearly all transactions are delayed in memory, could close this gap significantly.

One aspect that is difficult to measure occurs when tmpfs file data competes for virtual memory with the applications that are running on the system. The amount of memory on the system available for applications is a combination of physical memory and swap space. Because tmpfs file data uses the same memory, the overall memory available for applications can be largely reduced.

Overall, the deployment of tmpfs is highly dependent on the type of workload that is running on a machine together with the amount of memory available. 262 UNIX Filesystems.Evolution, Design, and Implementation

CHAPTER 13 Clustered and Distributed Filesystems

With the advent of computer networks, which appeared in the 1970s and became more widespread in the 1980s, the need to share files between machines became essential. Initially, UNIX tools such as uucp and ftp were used to transfer files from one machine to another. However, disks were still relatively expensive; this resulted in a need to access files across the network without local storage.

The 1980s saw a number of different distributed filesystems make an appearance in the UNIX community, including Sun_fs Network Filesystem (NFS), AT&T_fs Remote File Sharing (RFS), and CMU_fs Andrew File System (AFS) which evolved into the DCE Distributed File Service (DFS). Some of the distributed filesystems faded as quickly as they appeared. By far, NFS has been the most successful, being used on tens of thousands of UNIX and non-UNIX operating systems throughout the world.

Distributed filesystems operate around a client/server model, with one of the machines owning an underlying disk-based filesystem and serving files to clients through some well-defined protocol. The protocol and means of transferring files to the user is transparent, with UNIX file semantics being a key goal.

Clustered filesystems by contrast treat a collection of machines and disks as a single entity and provide a fully coherent view of the filesystem from any of the nodes. To the user, clustered filesystems present a single coherent view of the filesystem and may or may not offer full UNIX file semantics. Clustered filesystems as well as local filesystems can be exported for use with NFS. 286 UNIX Filesystems.Evolution, Design, and Implementation

Distributed Filesystems

Unlike local filesystems where the storage is physically attached and only accessible by processes that reside on the same host machine, distributed filesystems allow access to files on a remote machine through use of a well-defined protocol. Distributed filesystems employ a client/server model where a single filesystem server can serve files to multiple clients.

Regardless of the type of distributed filesystem, one goal that is absolutely essential to all of these filesystems is the need to provide UNIX file semantics when accessing remote files from the client.

There have been numerous distributed filesystems developed for UNIX over the last 20 years. Many of them have come and gone. The most successful distributed filesystem by far is the Sun Network Filesystem (NFS) which appeared in the mid 1980s. Although not as feature-rich as filesystems such as the DCE Distributed File Service (DFS), the simplicity of the NFS protocol, together with the fact that the NFS protocol is in public domain, resulted in it being ported to many different operating systems, UNIX and non-UNIX alike.

The following sections describe some of the main UNIX distributed filesystems with particular emphasis on NFS.

The Network File System (NFS)

With the advent of networks providing connectivity between computers, it became feasible to provide interfaces through which user programs could access files across a network using the same mechanisms by which they accessed files on a local machine.

Hard disks were still relatively expensive in the late 1970s and early 1980s. By providing a client/server file protocol such as NFS, hardware designers were free to build diskless workstations or least workstations with a minimal amount of local storage.

NFS Background and History

The Network File System (NFS) was initially developed by Sun Microsystems in the early to mid 1980s and has been a huge success, with ports to just about every operating system, UNIX and non-UNIX alike. In the paper that described the first two versions of NFS [SAND85], the goals of NFS were to:

Provide machine and OS independence. This goal was to ensure that NFS could work on UNIX and non-UNIX operating systems. The client/server protocols were to be simple enough that they could be implemented on PCs.at the time, a DOS-based environment.

Provide resilience to server crashes. If the server crashes, the clients who are currently accessing the server must be able to recover. Furthermore, the client should be unable to tell the difference between a server that crashed and restarted and one that was just slow in responding.

Provide transparent file access. In order for the filesystem to succeed, it was important that applications could access files through NFS in the same manner in which they could access files on a local disk.

Maintain UNIX semantics on the client. To satisfy the above goal in UNIX environments, it was imperative that NFS provide UNIX file semantics to the applications on the client.

Have acceptable performance. As stated in [SAND85] _gPeople will not want to use NFS if it is no faster than the existing network utilities, such as rcp, even if it is easier to use._h The performance targets were set at 80 percent of local disk access.

There were three pieces that comprised NFS, the protocol, the client, and the server. All three will be described throughout the following sections.

The original NFS implementation, as described in [SAND85], encompassed both version 1 and 2 of the protocol when it first became available to the public in SunOS 2.0 in 1985. The first version of the protocol was only used internally within Sun.

At the time of writing, NFS implementations adhering to version 3 of the protocol have been in common use for several years and version 4 implementations are starting to appear. NFS is very well understood by tens of thousands of people throughout the world, which is a great testament of its success and an indicator as to why it will still be one of the most dominant of the distributed filesystems for many years to come.

The Version 1 and 2 NFS Protocols

As described in The Sun VFS/Vnode Architecture in Chapter 7, the SunOS filesystem architecture was redesigned and implemented to accommodate multiple filesystem types and provide support for NFS. This was not the only feature of the kernel that NFS depended on.it also relied on use of the Sun Remote Procedure Call (RPC) infrastructure, that provided a synchronous, cross network mechanism for one process (or the kernel) to call another process on a different machine. This allowed the NFS protocol to be broken down into a set of procedures specifying their arguments, the results, and the effects.

To communicate across the network, NFS used the User Datagram Protocol (UDP) on top of the Internet Protocol (IP). Because the protocol was designed to be independent of machine architectures and operating systems, the encoding of the protocol messages and their responses was sensitive to issues such as endianess (the order in which bytes are packed into a machine word). This resulted in the use of the External Data Representation (XDR) specification.

The use of RPC and XDR are described in the following section. Before describing how they are used, it first helps to describe the actual procedure calls NFS introduced for communication across the network. The version 2 protocol was documented in [RFC1094], which describes the procedure calls shown in

Table 13.1. Most of the operations are self explanatory. The null procedure is used to ping the server and can also be used as a means of measuring the round trip time. The statfs procedure returns information that can be displayed when making a call to the df command.

The file referenced by these procedures is called a file handle. This is an opaque data structure provided by the server in response to a lookup request. The client should never try to interpret the contents of the file handle. File handles are constructed using a combination of operating-specific information and information provided by the filesystem. For the latter, the information must provide a means to locate the file, so the file handle is typically a combination of filesystem specific information together with the inode number of the file and its generation count.

Many of the procedures deal with file attributes. Not surprisingly, the attributes correspond to the various fields of the stat structure as described in the section Basic File Properties in Chapter 2.

The NFS server is stateless in that there is no information kept on the server about past requests. This avoids any complicated crash recovery mechanism. If the client does not receive a response from the server within a specific period of time, the request is retried until it succeeds. This tolerates a server crash and reboot, ensuring that the client cannot detect the difference between a server crash and a server that is simply slow in responding.

The version 2 protocol also requires that any file writes are synchronous. This meets the objective of achieving UNIX file semantics.

Within the file handle, the inode generation count is typically encoded. Consider the case when a file is opened and a file handle is returned. If the file is removed on the server and the inode is reused later for a new file, the file handle is no longer valid because it refers to the old file. To distinguish between the old and new files, UNIX filesystems contain a generation count that is incremented each time the inode is reused. By using the generation count within the file handle, the stale file handle will be detected by the server when the deleted file is referenced.

One of the main goals of NFS was to make it portable across different operating systems. [ROSE86] demonstrated early ports to both an SVR2.2-based version of UNIX and the Sequent Dynix operating system, a System V/BSD hybrid. There were a number of different PC (DOS-based) implementations of NFS. [CALL00] describes the various issues encountered with porting NFS to these platforms.

NFS Client/Server Communications

NFS relies on both the Remote Procedure Call (RPC) [RFC1057] and eXternal Data Representation (XDR) [RFC1014] specifications as a means of communicating between client and server. XDR allows for the description and encoding of different data types. Its goal is to allow communication between machines with

different underlying architectures. RPC provides an infrastructure for creating client/server applications whereby an application can call a function provided by the server just as it would call a function within its own address space.

The XDR format assumes that bytes (8-bit) are portable in that the encoding of bytes does not change from one architecture or hardware device to another. Building on top of bytes, the XDR specification requires data types to be constructed from multiples of four bytes of data. If data types require a number of bytes that is not exactly divisible by 4, any unused bytes will be zero-filled. The ordering of the bytes is such that, if read from a byte stream, the high order byte is always read first. This is called big-endian (the biggest digit in a number comes first). XDR also uses a standard representation for floating point numbers (following the IEEE standard).

To give a simple example of how XDR is used, consider the encoding of an integer that is defined as a 32-bit data type. This is encoded as follows:

So if this data type were being read from a regular file as a series of bytes, byte 0, the most significant byte, would be read first.

The XDR specification defines a number of primitive data types including signed and unsigned integers, booleans, hyper integers (64 bits), fixed and variable length opaque data types, and strings. It also defines a wide range of structured data types including arrays, unions, linked lists, and structures. In basic terms, any data type that can be defined in the most popular high-level languages can be encoded within XDR.

The RPC mechanism used for NFS was derived from Sun RPC, a simple way to provide a remote procedure call mechanism between two processes on different machines. To the caller of such a procedure, there is no difference between calling an RPC function and calling a local function.

The RPC protocol [RFC1057] can be implemented on top of several different transport protocols. In the case of the early NFS implementations, this was based on UDP/IP. Within the last ten years, a move to using TCP/IP has been made in many environments (typically to avoid packet loss when going through routers). Description of the RPC protocol is beyond the scope of this book. The NFS specification [RFC1094] defines the NFS protocol as an RPC program and [CALL00] provides a more detailed description of the protocol itself.

The overall architecture of NFS in a VFS/vnode architecture is shown in Figure 13.1. This shows the placement of NFS, XDR, and RPC within the kernel. To the rest of the kernel, NFS appears as any other filesystem type.

Exporting, Mounting, and Accessing NFS Filesystems

Table 13.1 shows the different NFS protocol procedures. One thing that is missing from this list of functions is the means by which the very first file handle is obtained, the handle of the root directory from which subsequent lookup operations and other procedures can be performed. This is achieved through a separate mount protocol that returns the file handle for the root directory.

There were two reasons for separating the mount protocol from the NFS protocol itself (note that both are described in [RFC1094]). First, the means by which access checking is performed on the server is typically implemented in user space, which can make use of a number of different security mechanisms. Because the NFS protocol is implemented within the kernel for performance reasons, it was felt that it was easiest to allow for this separation. The second reason that the protocols were separated was that the designers thought a single pathname to file handle procedure would tie the implementation of NFS too closely with UNIX.
It was envisaged that the pathname to file handle translation may be implemented by a protocol that differs from the mount protocol.

The initial mount protocol consisted of six different procedures:

MOUNTPROC_NULL. This procedure performs no specific function. It is used to ping the server to measure the round-trip time, which can be used to determine how to time out NFS procedure calls to the server.

MOUNTPROC_MNT. This procedure takes a pathname and returns a file handle that corresponds to the pathname.

MOUNTPROC_DUMP. This function returns a list of clients and the exported filesystems that they have mounted. This is used by the UNIX commands showmount and dfmounts that list the clients that have NFS mounted filesystems together with the filesystems that they have mounted.

MOUNTPROC_UMNT. This procedure is used to inform the server that the NFS filesystem on the client has been unmounted.

MOUNTPROC_UMNTALL. This procedure is sent by a client following a reboot (or after a crash). This prevents the server from maintaining stale mount entries in the event that the client crashed before sending corresponding MOUNTPROC_UMNT messages.

MOUNTPROC_EXPORT. This procedure returns a list of exported filesystems.

The mount protocol remained unchanged for a number of years. Since then, only one additional procedure has been added, MOUNTPROC_PATHCONF, which retrieves additional information from the server allowing NFS filesystems to comply with the POSIX pathconf() system call.

Using NFS

Although the NFS implementations differ from one platform to the next in the way in which filesystems are exported, using NFS is trivial when the appropriate client and server software is running. NFS daemons/threads are typically started when the system bootstraps and enters multiuser mode.

As an example, consider the case in Solaris where a server called srv wishes to export a filesystem mounted on /fs1 to any client. The easiest way to achieve this is to place an entry in /etc/dfs/dfstab that specifies the filesystem to be shared and/or exported. This ensures that the filesystem will be exported for use on each reboot. If no options are needed for this filesystem, the following line is sufficient:

share -F nfs /fs1

On the client side, a mount call can then be made to mount the filesystem as follows:

# mount -F nfs srv:/fs1 /mnt
 # mount | grep mnt
 /mnt on srv:/fs1 remote/read/write/setuid/dev=2fc004
 on Wed Jun 19 07:25:18 2002

Once mounted, the filesystem is then usable just as any local filesystem.

The Version 3 NFS Protocol

Despite its huge success, the NFS version 2 protocol was not without problems, leading to the introduction of the NFS version 3 protocol [RFC1813]. The two main problems with the version 2 protocol are highlighted below:

Only files up to 4GB in size could be accessed. This limitation was exposed early on when running NFS on large machines but was rapidly becoming a problem in most environments.

Because all writes were required to be written synchronously, write intensive applications suffered a performance bottleneck. There were numerous workarounds for this problem, including some that violated the NFS protocol by performing writes asynchronously.

The goals of those involved in designing NFS version 3 were to solve the two problems described above, tidy up the existing version 2 protocol, and add some minor features, but at the same time limit the scope of the version 3 protocol to avoid it becoming too bloated to be implemented in a timely manner.

[PAWL94] provides an overview of the process the designers of the version 3 protocol went through, the problems inherent in the version 2 protocol, the goals behind the version 3 protocol, the changes introduced with the version 3 protocol, and various implementation and performance-related information. In this paper, they identify twelve main areas in which the protocol was enhanced:

All arguments within the protocol such as file sizes and file offsets were widened from 32 to 64-bits. This solved the 4GB file restriction.

The write model was changed to introduce a write/commit phase that allowed for asynchronous writes. (This will be described further).

A new ACCESS procedure was introduced to solve permission checking problems when mapping the ID of the superuser. This procedure works in the presence of ACLs (Access Control Lists).

In the original protocol, some of the procedures required a subsequent call in order to retrieve file attributes. In the new protocol, all procedures returned file attributes.

In the original protocol, writes were limited to 8Kb per procedure call. This restriction was relaxed in the new protocol.

The READDIRPLUS procedure was introduced that returned both a file handle and attributes. This eliminated some lookup calls when scanning a directory.

The file handle size in the version 2 protocol was a fixed, 32-byte opaque data type. In version 3, it was changed to be of variable size up to a maximum of 64 bytes.

The CREATE procedure was modified to allow for exclusive file creates. This solved a workaround in the version 2 protocol whereby a LOOKUP was followed by a CREATE, which left a window in which another client could create the file.

The version 2 protocol limited the size of filenames and pathnames to 255 and 1024 characters respectively. In version 3, this was replaced by variable length strings which could be agreed on between the client and server.

The version 3 protocol tightened the errors that could be returned from the server. All error values are iterated, and no errors outside of the list are permitted.

For the set of file attributes, the blocksize field was removed. The blocks field was changed to used and recorded the total number of bytes used by the file.

A new error type, NFS3ERR_JUKEBOX, was introduced. In a Hierarchical Storage Management (HSM) environment, a request may be made to the server to read a file that has been migrated to tape. The time to read the data back in from tape could be quite large. This error informs the client that the operation is in progress and that the call should be retried. It also allows the client to display a message on the user_fs console if applicable.

Writes in UNIX are asynchronous by default unless the O_SYNC or O_DSYNC flags are passed to open(). Forcing asynchronous writes to be synchronous inevitably affects performance. With NFS version 3, the client can send a number of asynchronous WRITE requests that it can then commit to disk on the server at a later date by issuing a COMMIT request. Once it receives a COMMIT request, the server cannot return until all data has been flushed to disk. In some regards, the COMMIT request is similar to calling fsync(). The most noticeable difference is that the COMMIT request does not necessarily cover the data for the whole file. It does however allow the client to flush all data when a file is closed or to break up a large synchronous write request into a number of smaller writes, all of which are performed asynchronously but followed by a COMMIT request. This itself is an important enhancement because it allows the filesystem on the server or the disk driver to coalesce a number of writes in a single large write, which is more efficient. The use of asynchronous writes should not affect the crash/recovery properties of NFS because the client is required to keep a copy of all data to be written to the file until a COMMIT is issued.

The READDIRPLUS procedure, while it can be extremely effective, also presents problems. The procedure was introduced to minimize the number of over-the-wire LOOKUP requests once a READDIR procedure had been invoked. This would typically be the case when issuing an ls -F request on a directory.

Because the implementation of READDIRPLUS is significantly more expensive than READDIR, it should be used with caution. Typically, the operation should be performed only when first accessing the directory in order to populate the DNLC (or other name cache, depending on the underlying OS). The operation should then be performed again only in the case where the cache was invalidated for the directory due to a directory modification.

Because many of the goals of NFS version 3 were to improve performance, the proof of its success was therefore dependent on how well it performed. [PAWL94] documented a number of different performance-related tests that showed that the version 3 protocol did in fact meet its objectives.

The NFS Lock Manager Protocol

One decision that the NFS team made when designing the NFS protocol was to omit file locking. One of the main reasons for this was that to support record locking, state would need to be maintained on the server, which would dramatically increase the complexity of NFS implementations.

However, file locking was not something that could be easily overlooked and was therefore implemented in SunOS as the Network Lock Manager (NLM). Various iterations of the NLM protocol appeared, with NLM version 3 being the version that was most widely used with NFS version 2. Unfortunately, due to the complexity of implementing support for locking, the NLM protocol was not widely implemented. With the introduction of NFS version 3 [RFC1813], the definition of NLM (version 4) was included with the NFS specification but was still a separate protocol. The NLM protocol also relied on the Network Status

Monitor protocol that was required in order to notify clients and servers of a crash such that lock state could be recovered.

Crash/recovery involves coordination between both clients and server as shown here:

Server crash. When locks are handed to clients, the server maintains a list of clients and the locks that they own. If the server crashes, this information is lost. When the server reboots, a status monitor runs and sends a message to all known clients. The lock manager on each client is notified and is given an opportunity to reclaim all locks it owns for files on the server. There is a fixed amount of time (or grace period) in which the clients can respond. Note however, that this is not an ideal situation as notification of clients may be delayed, because the status monitor typically informs all clients in a serial manner. This window may be reduced by multithreading the status monitor.

Client crash. If the client crashes, any locks that the client holds on the server must be cleaned up. When the client resumes, a message is sent to the server to clean up its locks. Through use of a client state number, which is incremented on reboot, the server is able to detect that the client has been rebooted and removes any of the locks that were held by the client before it crashed/rebooted.

Since the NLM was not widely adopted, version 4 of the NFS protocol has been extended to include file locking. This is described in the next section.

The Version 4 NFS Protocol and the Future of NFS

NFS was designed for local area networks and was put in place before the widespread adoption of the World Wide Web. As time goes by there is more of a need to use NFS in wide area networks. This raises questions on security and further highlights the need for NFS to address cross-platform issues. Although one of the goals of the original NFS implementations was to support non-UNIX platforms, the protocol was still heavily geared towards UNIX environments.

The goals of the version 4 protocol are to address the problems highlighted above and to provide additional features that were omitted in the version 3 protocol (version 3 changes were kept to a minimum to ensure that it could be designed and implemented in a timely manner). Although the version 4 protocol involves some substantial changes, the goals are to allow small incremental changes that do not require a complete overhaul of the protocol. The time between the version 2 and 3 protocols was approximately 8 years, which is similar to the time between the version 3 and 4 protocols. A minor revision to the version 4 protocol allows new features to be added to the version 4 protocol in a much more timely manner, say a 1 to 2 year timeframe.

The version 4 protocol is described in [RFC3010]. Following are the main changes that are part of the new protocol. Note that the changes introduced with the version 4 protocol are substantial and only covered briefly here.

Compound procedures. Many file-related operations over NFS require a large number of procedure calls. In a local area network this is not such a great issue. However, when operating in a wide area network the effect on performance is much more noticeable. By combining a number of procedure calls into a single, compound procedure, the amount of over-the-wire communications can be reduced considerably resulting in much better performance.

Internationalization. In previous versions of the protocol, file names were handled as an opaque byte stream. Although they were typically limited to a 7-bit US ASCII representation, they were commonly encoded in 8-bit ISO-Latin-1. Problems occurred because there was no way to specify the type of encoding within XDR. This limited the use of NFS in environments where there may be mixed character sets. To provide better support for internationalization, file and directory names will be encoded with UTF-8.

Volatile file handles. The NFS version 2 and version 3 protocols provided a single file handle type with one set of semantics. This file handle is defined as having a value that is fixed for the lifetime of the filesystem to which it refers. As an example, a file handle on UNIX comprises, amongst other things, the inode number and generation count. Because inodes can be freed and reallocated, the generation count of the inode is increased when reused to ensure that a client file handle that refers to the old file cannot now refer to the new file even though the inode number stays the same. There have also been some implementations that are unable to correctly implement the traditional file handle which inhibits the adoption of NFS on some platforms.

The NFS version 4 protocol divides file handles into both persistent file handles, which describe the traditional file handle, and volatile file handles. In the case of volatile file handles, the server may not always be able to determine whether the file handle is still valid. If it detects that a file handle is in fact invalid, it returns an NFS4ERR_STALE error message. If however it is unable to determine whether the file handle is valid, it can return an NFS4ERR_FHEXPIRED error message. Clients are able to detect whether a server can handle persistent and volatile file handles and therefore act accordingly.

Attribute classes. The set of attributes that were passed over the wire with earlier versions of the protocol were very UNIX-centric in that the information returned by the server was sufficient to respond to a stat() call on the client. In some environments, these attributes are meaningless, and in some cases, servers are unable to provide valid values.

In NFS version 4, the set of file attributes is divided into three different classes, namely mandatory, recommended, and named attributes.

The mandatory set of attributes contain information such as the file type and size, information about file handle expiration times, whether hard links and symbolic links are supported, and whether the file has named data streams/attributes.

The set of recommended attributes contain information such as the type of ACLs (Access Control Lists) that the filesystem supports, the ACLs themselves, information about the owner and group, access timestamps, and quota attributes. It also contains information about the filesystem such as free space, total number of files, files available for use, and filesystem limits such as the maximum filename length and maximum number of links.

Named attributes, also called named data streams, allow a single file to have multiple streams of opaque bytes unlike the traditional UNIX model of supporting a single stream of bytes per file. To access named data streams over NFS version 4, the OPENATTR procedure can retrieve a virtual attribute directory under which READDIR and LOOKUP procedures can be used to view and access the named attributes.

Better namespace handling. Both NFS version 2 and 3 servers export a set of independent pieces of their overall namespace and do not allow NFS clients to cross mountpoints on the server, because NFS expects all lookup operations to stay within a single filesystem. In NFS v4, the server provides a single root file handle through which clients can obtain file handles for any of the accessible exports.

NFS v4 servers can be made browsable by bridging exported subtrees of the namespace with a pseudo filesystem and allowing clients to cross server mountpoints. The tree constructed by the server is a logical view of all the different exports.

File locking. As mentioned earlier, locking is not part of the NFS version 2 or version 3 protocols which rely instead on the Network Lock Manager (NLM) protocol, described in the section The NFS Lock Manager Protocol, earlier in the chapter. The NLM protocol was not, however, widely adopted. NFS version 4 provides for both UNIX file-level locking functions and Windows-based share locking functions. NFS version 4 supports both record and byte range locking functions.

Client side caching. Most NFS clients cache both file data and attributes as much as possible. When moving more towards wide area networks, the cost of a cache miss can be significant. The problem with the version 2 and version 3 protocols is that NFS does not provide a means to support cache coherency between multiple clients, which can sometimes lead to invalid file data being read.

NFS version 4 does not provide cache coherency between clients but defines a limited set of caching guarantees to allow locks and share reservations to be used without destructive interference from client-side caching. NFS v4 also provides a delegation scheme that allows clients to make decisions that were traditionally made by the server.

The delegation mechanism is an important feature in terms of performance because it limits the number of procedure calls that would typically go between client and server when accessing a file. When another client attempts to access a file for which a delegation has been granted, the server invokes an RPC to the client holding the delegation. The client is then responsible for flushing any file information, including data, that has changed before responding to the recall notice. Only after the first client has responded to the revoke request will the second client be allowed to access the file.

The NFS version 4 specification provides many details for the different types of delegations that can be granted and therefore the type of caching that can be performed on the client.

Built-in security. NFS has relied on the UNIX-centric user ID mechanisms to provide security. This has generally not been a problem because NFS has largely been used within private networks. However, because one of the goals of the version 4 protocol is to widen the use of NFS to wide area networks, this level of security is insufficient. The basic NFS security mechanisms are being extended through use of the RPCSEG_GSS framework. The RPCSEG_GSS mechanisms are implemented at the RPC layer and are capable of providing both private keys schemes, such as Kerboros version 5, and public key schemes.

The NFS version 4 protocol is a significant rework of the version 3 protocol. It provides a wide range of features aimed at continuing its success as it becomes more widely adopted in wide area networks, and it provides better support for building a distributed filesystem for heterogeneous operating systems. There was a huge amount of investment in NFS prior to version 4. Because NFS version 4 attempts to address many of the prior limitations of the earlier versions, including more attention to non-UNIX operating systems, NFS is likely to grow in popularity.

The NFS Automounter

One feature that is part of some distributed filesystems is the ability to provide a unified namespace across a number of different clients. For example, to see /home/spate on several different clients would require exporting the filesystem from the server on which the filesystem resides and NFS mounting it on all of the required clients. If the mount point is permanent, the appropriate entries must be placed in the vfstab/fstab table. Obviously this model does not scale well when dealing with hundreds of filesystems and a very large number of clients and servers.

This problem was resolved by introduction of the automounter. This aids in creation of a unified namespace while keeping the number of mounted filesystems to only those filesystems that are actually in use. The automounter is simple in nature. When a user attempts to access a file that crosses a mount point within the boundaries of the automounter, the NFS filesystem is first mounted prior to allowing the access to proceed. This is shown in Figure 13.2.
The first automounters were implemented as user space daemons, which typically mount themselves on those directories that require automounter services and masquerade as NFS servers. When an attempt is made to access a file within one of these filesystems, the kernel sends an NFS LOOKUP call to the server, in this case the automounter. The automounter then NFS mounts the real filesystem onto a directory somewhere within its own mount space. The real filesystem is then referenced through symbolic links. For example, in Figure 13.2, the filesystem to be mounted on fs2 may be mounted on /auto/f2 and /mnt/fs1 will be a symbolic link to this directory.

In many environments it is usual to see a combination of standard NFS mounted filesystems and automounted filesystems. The automounter should be used for filesystems that are not accessed frequently, such as manual pages, source code repositories, and so on. User directories and bin directories are examples of directories that are typically mounted through standard NFS means.

Another common use of the automounter is to use it in conjunction with the Network Information Service (NIS) in an environment where user home directories are distributed throughout the network. In this way, NIS centrally manages all of the NFS mounts from one of the servers. Although each user_fs home directory physically resides on only one server, the same server is configured to export the home directory to all hosts on the network. Each host on the network runs the automounter as an NFS client and can therefore mount a user_fs home directory. This allows the user to log in to any host and have access to his/her home directory. In this environment, file access is enhanced transparently while the use of the automounter avoids the overhead caused by dozens of active but unused NFS mounted filesystems.

Automounter Problems and the Autofs Filesystem

[CALL93] highlighted some of the problems inherent with using the automounter and provided details about autofs, a new automounter that solved the problems described. The type of problems that the original automounter exhibited are as follows:

Symbolic links. The preceding section described how the automounter actually NFS mounts the filesystem on a temporary directory and refers to it through a symbolic link. Because the goal of the automounter is only to mount filesystems when required, it periodically unmounts the filesystem if there is no activity for a predetermined amount of time.

However, if a process issues a getcwd() system call, the real path may be cached which references the temporary directory structure, that is, where the filesystem is actually mounted. If the path is used later, there is no guarantee that the filesystem is still mounted and the automounter is unable to detect that access is being requested. The user process may therefore see the local directory structure and thus unpredictable results.

Adding new mountpoints. The list of filesystems that the automounter manages is consulted only when the automounter first starts. A workaround is to terminate and restart the automounter.obviously not an ideal solution.

Performance. The method of sending NFS requests to the automounter when crossing its mount point, together with the management of symbolic links, is more time consuming than accessing an NFS filesystem directly.

Single threading. Because the automounter is single threaded it can only handle one request at a time. Therefore, when in the process of mounting an NFS filesystem, all subsequent access is blocked.

The autofs filesystem replaced the user-level automounter daemon with an in-kernel filesystem type. The automounter daemon is still retained. However, when it starts, it mounts autofs in place for each of the filesystems that is to be managed. In the previous example, the autofs filesystem would be mounted on /mnt. When access is detected to, say /mnt/fs2, autofs invokes an RPC request to communicate with the automounter daemon that NFS mounts the filesystem on /mnt/fs2. Once this is achieved, the autofs filesystem does not intercept any further operations. This eradicates the symbolic link problem and therefore increases the overall performance.

The Remote File Sharing Service (RFS)

At the time Sun was designing NFS, AT&T was working on the development of another distributed filesystem, Remote File Sharing (RFS) [RIFK86]. The design goals were quite different for RFS in that they wanted complete UNIX file semantics. This included access to remote devices as well as providing UNIX-level file and record locking. Coverage of different operating systems was not a goal for RFS and thus its implementation was heavily restricted to System V UNIX environments.

The RFS Architecture

In a manner similar to NFS, the RFS client is able to mount a directory that is exported by the server. Exporting of filesystems involved advertising a resource name with a name server. The RFS client could receive details of available resources from the name server and mount a filesystem without prior knowledge of the server which owned the filesystem. RFS required a separate name service protocol in order to manage all resources. Servers would issue an advertise procedure to the name server, which then registered the resource in the name server database. When the client requested information about a specific resource, the name server would return the name of the server on which the resource was located. Communication could then take place between client and server to actually mount the filesystem.

RFS also relied on an RPC protocol that provided a procedure on the server for every file-related system call. XDR was used to encode data types but only where the machine architecture of the client and server differed. On the server side, the goal was to emulate the environment of the client and provide a context similar to one on the caller to handle the remote procedure calls. This was used to provide management of the process user and group IDs, umask, and so on. This emulation was a little awkward for some operations. For example, when performing a lookup on a pathname, if an RFS mount point was to be crossed, the remainder of the pathname was sent to the server to be resolved. If a series of ".." components took the pathname out of the RFS mounted filesystem, the operation had to be aborted and completed on the client.

To provide for full UNIX semantics including file and record locking, RFS was required to provide a stateful server. This required the server to maintain reference counts for every open call from every client, file and record lock information on the server, and information about the state of named pipes. RFS also maintained a list of all client mounts. If either the server or one of the clients crashed, there was a significant amount of crash/recovery to be performed. If a client crashed, the server was required to remove all traces of the client. Amongst other things, this included decrementing reference counts, releasing any locks that client held, and so on.

Server side failure resulted in the ENOLINK error message being returned when any attempts were made to access files on the server. All inodes/vnodes on the client that accessed RFS files were marked to indicate the failure such that further attempts at access would return ENOLINK without any attempt to communicate with the server.

The overall architecture of RFS is shown in Figure 13.3. Unlike NFS, RFS requires a reliable, virtual circuit transport service. In the figure this is shown as TCP/IP. A virtual circuit is established during the mount and remains in existence for the duration of the mount. For each client/server pair, the virtual circuit is shared if a client mounts more than one RFS filesystem.
Another big difference between RFS and NFS is the support for client-side caching. RFS implements a write-through cache on each server; that is, writes are always sent to the server at the time the write occurs but the data is cached for subsequent access. Obviously this presents a challenge with respect to cache coherency when a write occurs to a file and data is cached on one of more clients. RFS must invalidate cached copies of the data on clients other than the one from which the write is issued. The client-side caching of file data is subsequently disabled either until the process that issued the write closes the file or a predetermined amount of time has passed since the last write to the file.

Differences between RFS and NFS

When RFS was introduced, there were a number of differences between RFS and NFS as defined by the version 2 protocol. Some of these differences were:

NFS is stateless whereas RFS requires a primary name server that coordinates RFS activity.

RFS can map user and group IDs from client to server based on presence of a mapping table. NFS by contrast requires that the IDs are the same on both client and server. More specifically, NFS implementations assume the use of NIS to maintain a consistent user database across the network.

RFS allows access to device files across the network while devices are not accessible across NFS.

RFS names resources, the directories that are advertised, which are communicated to the primary server. This is not required in NFS.

RFS requires a connection-mode virtual circuit environment, while NFS runs in a connectionless state.

RFS provides support for mandatory file and record locking. This is not defined as part of the NFS protocol.

NFS can run in heterogeneous environments, while RFS is restricted to UNIX environments and in particular System V UNIX.

RFS guarantees that when files are opened in append mode (O_APPEND) the write is appended to the file. This is not guaranteed in NFS.

In an NFS environment, the administrator must know the machine name from which the filesystem is being exported. This is alleviated with RFS through use of the primary server.

When reading through this list, it appears that RFS has more features to offer and would therefore be a better offering in the distributed filesystem arena than NFS. However, the goals of both projects differed in that RFS supported full UNIX semantics whereas for NFS, the protocol was close enough for most of the environments that it was used in.

The fact that NFS was widely publicized and the specification was publicly open, together with the simplicity of its design and the fact that it was designed to be portable across operating systems, resulted in its success and the rather quick death of RFS, which was replaced by NFS in SVR4.

RFS was never open to the public in the same way that NFS was. Because it was part of the UNIX operating system and required a license from AT&T, it stayed within the SVR3 area and had little widespread usage. It would be a surprise if there were still RFS implementations in use today.

The Andrew File System (AFS)

The Andrew Filesystem (AFS) [MORR86] was developed in the early to mid 1980s at Carnegie Mellon University (CMU) as part of Project Andrew, a joint project between CMU and IBM to develop an educational-based computing infrastructure. There were a number of goals for the AFS filesystem. First, they required that UNIX binaries could run on clients without modification requiring that the filesystem be implemented in the kernel. They also required a single, unified namespace such that users be able to access their files wherever they resided in the network. To help performance, aggressive client-side caching would be used. AFS also allowed groups of files to be migrated from one server to another without loss of service, to help load balancing.

The AFS Architecture

An AFS network, shown in Figure 13.4, consists of a group of cells that all reside under /afs. Issuing a call to ls /afs will display the list of AFS cells. A cell is a collection of servers that are grouped together and administered as a whole. In the academic environment, each university may be a single cell. Even though each cell may be local or remote, all users will see exactly the same file hierarchy regardless of where they are accessing the filesystem.

Within a cell, there are a number of servers and clients. Servers manage a set of volumes that are held in the Volume Location Database (VLDB). The VLDB is replicated on each of the servers. Volumes can be replicated over a number of different servers. They can also be migrated to enable load balancing or to move a user_fs files from one location to another based on need. All of this can be done without interrupting access to the volume. The migration of volumes is achieved by cloning the volume, which creates a stable snapshot. To migrate the volume, the clone is moved first while access is still allowed to the original volume. After the clone has moved, any writes to the original volume are replayed to the clone volume.

Client-Side Caching of AFS File Data

Clients each require a local disk in order to cache files. The caching is controlled by a local cache manager. In earlier AFS implementations, whenever a file was opened, it was first copied in its entirety to the local disk on the client. This quickly became problematic as file sizes increased, so later AFS versions defined the copying to be performed in 64KB chunks of data. Note that, in addition to file data, the cache manager also caches file meta-data, directory information, and symbolic links.

When retrieving data from the server, the client obtains a callback. If another client is modifying the data, the server must inform all clients that their cached data may be invalid. If only one client holds a callback, it can operate on the file without supervision of the server until a time comes for the client to notify the server of changes, for example, when the file is closed. The callback is broken if another client attempts to modify the file. With this mechanism, there is a potential for callbacks to go astray. To help alleviate this problem, clients with callbacks send probe messages to the server on a regular basis. If a callback is missed, the client and server work together to restore cache coherency.

AFS does not provide fully coherent client side caches. A client typically makes changes locally until the file is closed at which point the changes are communicated with the server. Thus, if multiple clients are modifying the same file, the client that closes the file last will write back its changes, which may overwrite another client_fs changes even with the callback mechanism in place.

Where Is AFS Now?

A number of the original designers of AFS formed their own company Transarc, which went on to produce commercial implementations of AFS for a number of different platforms. The technology developed for AFS also became the basis of DCE DFS, the subject of the next section. Transarc was later acquired by IBM and, at the time of this writing, the history of AFS is looking rather unclear, at least from a commercial perspective.

The DCE Distributed File Service (DFS)

The Open Software Foundation started a project in the mid 1980s to define a secure, robust distributed environment for enterprise computing. The overall project was called the Distributed Computing Environment (DCE). The goal behind DCE was to draw together the best of breed technologies into one integrated solution, produce the Application Environment Specification (AES), and to release source code as an example implementation of the standard. In 1989, OSF put out a Request For Technology, an invitation to the computing industry asking them to bid technologies in each of the identified areas. For the distributed filesystem component, Transarc won the bid, having persuaded OSF of the value of their AFS-based technology.

The resulting Distributed File Service (DFS) technology bore a close resemblance to the AFS architecture. The RPC mechanisms of AFS were replaced with DCE RPC and the virtual filesystem architecture was replaced with VFS+ that allowed local filesystems to be used within a DFS framework, and Transarc produced the Episode filesystem that provided a wide number of features.

DCE / DFS Architecture

The cell nature of AFS was retained, with a DFS cell comprising a number of servers and clients. DFS servers run services that make data available and monitor and control other services. The DFS server model differed from the original AFS model, with some servers performing one of a number of different functions:

File server. The server that runs the services necessary for storing and exporting data. This server holds the physical filesystems that comprise the DFS namespace.

System control server. This server is responsible for updating other servers with replicas of system configuration files.

Fileset database server. The Fileset Location Database (FLDB) master and replicas are stored here. The FLDB is similar to the volume database in AFS. The FLDB holds system and user files.

Backup database server. This holds the master and replicas of the backup database which holds information used to backup and restore system and user files.

Note that a DFS server can perform one or more of these tasks. The fileset location database stores information about the locations of filesets. Each readable/writeable fileset has an entry in the FLDB that includes information about the fileset_fs replicas and clones (snapshots).

DFS Local Filesystems

A DFS local filesystem manages an aggregate, which can hold one or more filesets and is physically equivalent to a filesystem stored within a standard disk partition. The goal behind the fileset concept was to make it smaller than a disk partition and therefore more manageable. As an example, a single filesystem is typically used to store a number of user home directories. With DFS, the aggregate may hold one fileset per user.

Aggregates also supports fileset operations not found on standard UNIX partitions, including the ability to move a fileset from one DFS aggregate to another or from one server to another for load balancing across servers. This is comparable to the migration performed by AFS.

UNIX partitions and filesystems can also be made visible in the DFS namespace if they adhere to the VFS+ specification, a modification to the native VFS/vnode architecture with additional interfaces to support DFS. Note however that these partitions can store only a single fileset (filesystem) regardless of the amount of data actually stored in the fileset.

DFS Cache Management

DFS enhanced the client-side caching of AFS by providing fully coherent client side caches. Whenever a process writes to a file, clients should not see stale data. Clustered and Distributed Filesystems 307 To provide this level of cache coherency, DFS introduced a token manager that keeps a reference of all clients that are accessing a specific file.

When a client wishes to access a file, it requests a token for the type of operation it is about to perform, for example, a read or write token. In some circumstances, tokens of the same class allow shared access to a file; two clients reading the same file would thus obtain the same class of token. However, some tokens are incompatible with tokens of the same class, a write token being the obvious example. If a client wishes to obtain a write token for a file on which a write token has already been issued, the server is required to revoke the first client_fs write token allowing the second write to proceed. When a client receives a request to revoke a token, it must first flush all modified data before responding to the server.

The Future of DCE / DFS

The overall DCE framework and particularly the infrastructure required to support DFS was incredibly complex, which made many OS vendors question the benefits of supporting DFS. As such, the number of implementations of DFS were small and adoption of DFS equally limited. The overall DCE program came to a halt in the early 1990s, leaving a small number of operating systems supporting their existing DCE efforts. As NFS evolves and new, distributed filesystem paradigms come into play, the number of DFS installations is likely to decline further.

Clustered Filesystems

With distributed filesystems, there is a single point of failure in that if the server (that owns the underlying storage) crashes, service is interrupted until the server reboots. In the event that the server is unable to reboot immediately, the delay in service can be significant.

With most critical business functions now heavily reliant on computer-based technology, this downtime is unacceptable. In some business disciplines, seconds of downtime can cost a company significant amounts of money.

By making hardware and software more reliable, clusters provide the means by which downtime can be minimized, if not removed altogether. In addition to increasing the reliability of the system, by pooling together a network of interconnected servers, the potential for improvements in both performance and manageability make cluster-based computing an essential part of any large enterprise.

The following sections describe the clustering components, both software and hardware, that are required in order to provide a clustered filesystem (CFS). There are typically a large number of components that are needed in addition to filesystem enhancements in order to provide a fully clustered filesystem. After describing the basic components of clustered environments and filesystems, the VERITAS clustered filesystem technology is used as a concrete example of how a clustered filesystem is constructed. Later sections describe some of the other clustered filesystems that are available today. The following sections only scratch the surface of clustered filesystem technology. For a more in depth look at clustered filesystems, you can refer to Dilip Ranade_fs book Shared Data Clusters [RANA02].

What Is a Clustered Filesystem?

In simple terms, a clustered filesystem is simply a collection of servers (also called nodes) that work together to provide a single, unified view of the same filesystem. A process running on any of these nodes sees exactly the same view of the filesystem as a process on any other node. Any changes by any of the nodes are immediately reflected on all of the other nodes.

Clustered filesystem technology is complementary to distributed filesystems. Any of the nodes in the cluster can export the filesystem, which can then be viewed across the network using NFS or another distributed filesystem technology. In fact, each node can export the filesystem, which could be mounted on several clients.

Although not all clustered filesystems provide identical functionality, the goals of clustered filesystems are usually stricter than distributed filesystems in that a single unified view of the filesystem together with full cache coherency and UNIX semantics, should be a property of all nodes within the cluster. In essence, each of the nodes in the cluster should give the appearance of a local filesystem.

There are a number of properties of clusters and clustered filesystems that enhance the capabilities of a traditional computer environment, namely:

Resilience to server failure. Unlike a distributed filesystem environment where a single server crash results loss of access, failure of one of the servers in a clustered filesystem environment does not impact access to the cluster as a whole. One of the other servers in the cluster can take over responsibility for any work that the failed server was doing.

Resilience to hardware failure. A cluster is also resilient to a number of different hardware failures, such as loss to part of the network or disks. Because access to the cluster is typically through one of a number of different routes, requests can be rerouted as and when necessary independently of what has failed. Access to disks is also typically through a shared network.

Application failover. Failure of one of the servers can result in loss of service to one or more applications. However, by having the same application set in a hot standby mode on one of the other servers, a detected problem can result in a failover to one of the other nodes in the cluster. A failover results in one machine taking the placed of the failed machine. Because a single server failure does not prevent access to the cluster filesystem on another node, the application downtime is kept to a minimum; the only work to perform is to restart the applications. Any form of system restart is largely taken out of the picture.

Increased scalability. Performance can typically be increased by simply adding another node to the cluster. In many clustered environments, this may be achieved without bringing down the cluster.

Better management. Managing a set of distributed filesystems involves managing each of the servers that export filesystems. A cluster and clustered filesystem can typically be managed as a whole, reducing the overall cost of management.

As clusters become more widespread, this increases the choice of underlying hardware. If much of the reliability and enhanced scalability can be derived from software, the hardware base of the cluster can be moved from more traditional, high-end servers to low cost, PC-based solutions.

Clustered Filesystem Components

To achieve the levels of service and manageability described in the previous section, there are several components that must work together to provide a clustered filesystem. The following sections describe the various components that are generic to clusters and cluster filesystems. Later sections put all these components together to show how complete clustering solutions can be constructed.

Hardware Solutions for Clustering

When building clusters, one of the first considerations is the type of hardware that is available. The typical computer environment comprises a set of clients communicating with servers across Ethernet. Servers typically have local storage connected via standards such as SCSI or proprietary based I/O protocols.

While Ethernet and communication protocols such as TCP/IP are unlikely to be replaced as the communication medium between one machine and the next, the host-based storage model has been evolving over the last few years. Although SCSI attached storage will remain a strong player in a number of environments, the choice for storage subsystems has grown rapidly. Fibre channel, which allows the underlying storage to be physically separate from the server through use of a fibre channel adaptor in the server and a fibre switch, enables construction of storage area networks or SANs.

Figure 13.5 shows the contrast between traditional host-based storage and shared storage through use of a SAN.

Cluster Management

Because all nodes within the cluster are presented as a whole, there must be a means by which the clusters are grouped and managed together. This includes the ability to add and remove nodes to or from the cluster. It is also imperative that any failures within the cluster are communicated as soon as possible, allowing applications and system services to recover.

These types of services are required by all components within the cluster including filesystem, volume management, and lock management.

Failure detection is typically achieved through some type of heartbeat mechanism for which there are a number of methods. For example, a single master node can be responsible for pinging slaves nodes that must respond within a predefined amount of time to indicate that all is well. If a slave does not respond before this time or a specific number of heartbeats have not been acknowledged, the slave may have failed; this then triggers recovery mechanisms.

Employing a heartbeat mechanism is obviously prone to failure if the master itself dies. This can however be solved by having multiple masters along with the ability for a slave node to be promoted to a master node if one of the master nodes fails.

Cluster Volume Management

In larger server environments, disks are typically managed through use of a Logical Volume Manager. Rather than exporting physical disk slices on which filesystems can be made, the volume manager exports a set of logical volumes. Volumes look very similar to standard disk slices in that they present a contiguous set of blocks to the user. Underneath the covers, a volume may comprise a number of physically disjointed portions of one or more disks. Mirrored volumes (RAID-1) provide resilience to disk failure by providing one or more identical copies of the logical volume. Each mirrored volume is stored on a different disk.

In addition to these basic volume types, volumes can also be striped (RAID 0). For a striped volume the volume must span at least two disks. The volume data is then interleaved across these disks. Data is allocated in fixed-sized units called stripes. For example, Figure 13.6 shows a logical volume where the data is striped across three disks with a stripe size of 64KB.

The first 64KB of data is written to disk 1, the second 64KB of data is written to disk 2, the third to disk 3, and so on. Because the data is spread across multiple disks, this increases both read and write performance because data can be read from or written to the disks concurrently.

Volume managers can also implement software RAID-5 whereby data is protected through use of a disk that is used to hold parity information obtained from each of the stripes from all disks in the volume.

In a SAN-based environment where all servers have shared access to the underlying storage devices, management of the storage and allocation of logical volumes must be coordinated between the different servers. This requires a clustered volume manager, a set of volume managers, one per server, which communicate to present a single unified view of the storage. This prevents one server from overwriting the configuration of another server.

Creation of a logical volume on one node in the cluster is visible by all other nodes in the cluster. This allows parallel applications to run across the cluster and see the same underlying raw volumes. As an example, Oracle RAC (Reliable Access Cluster), formerly Oracle Parallel Server (OPS), can run on each node in the cluster and access the database through the clustered volume manager.

Clustered volume managers are resilient to a server crash. If one of the servers crashes, there is no loss of configuration since the configuration information is shared across the cluster. Applications running on other nodes in the cluster see no loss of data access.

Cluster Filesystem Management

The goal of a clustered filesystem is to present an identical view of the same filesystem from multiple nodes within the cluster. As shown in the previous sections on distributed filesystems, providing cache coherency between these different nodes is not an easy task. Another difficult issue concerns lock management between different processes accessing the same file.

Clustered filesystems have additional problems in that they must share the resources of the filesystem across all nodes in the system. Taking a read/write lock in exclusive mode on one node is inadequate if another process on another node can do the same thing at the same time. When a node joins the cluster and when a node fails are also issues that must be taken into consideration. What happens if one of the nodes in the cluster fails? The recovery mechanisms involved are substantially different from those found in the distributed filesystem client/server model.
The local filesystem must be modified substantially to take these considerations into account. Each operation that is provided by the filesystem 312 UNIX Filesystems.Evolution, Design, and Implementation must be modified to become cluster aware. For example, take the case of mounting a filesystem. One of the first operations is to read the superblock from disk, mark it dirty, and write it back to disk. If the mount command is invoked again for this filesystem, it will quickly complain that the filesystem is dirty and that fsck needs to be run. In a cluster, the mount command must know how to respond to the dirty bit in the superblock.

A transaction-based filesystem is essential for providing a robust, clustered filesystem because if a node in the cluster fails and another node needs to take ownership of the filesystem, recovery needs to be performed quickly to reduce downtime. There are two models in which clustered filesystems can be constructed, namely:

Single transaction server. In this model, only one of the servers in the cluster, the primary node, performs transactions. Although any node in the cluster can perform I/O, if any structural changes are needed to the filesystem, a request must be sent from the secondary node to the primary node in order to perform the transaction.

Multiple transaction servers. With this model, any node in the cluster can perform transactions.

Both types of clustered filesystems have their advantages and disadvantages. While the single transaction server model is easier to implement, the primary node can quickly become a bottleneck in environments where there is a lot of meta-data activity.

There are also two approaches to implementing clustered filesystems. Firstly, a clustered view of the filesystem can be constructed by layering the cluster components on top of a local filesystem. Although simpler to implement, without knowledge of the underlying filesystem implementation, difficulties can arise in supporting various filesystem features.

The second approach is for the local filesystem to be cluster aware. Any features that are provided by the filesystem must also be made cluster aware. All locks taken within the filesystem must be cluster aware and reconfiguration in the event of a system crash must recover all cluster state.

The section The VERITAS SANPoint Foundation Suite describes the various components of a clustered filesystem in more detail.

Cluster Lock Management

Filesystems, volume managers, and other system software require different lock types to coordinate access to their data structures, as described in Chapter 10. This obviously holds true in a cluster environment. Consider the case where two processes are trying to write to the same file. The process which obtains the inode read/write lock in exclusive mode is the process that gets to write to the file first. The other process must wait until the first process relinquishes the lock.

In a clustered environment, these locks, which are still based on primitives provided by the underlying operating system, must be enhanced to provide distributed locks, such that they can be queried and acquired by any node in the cluster. The infrastructure required to perform this service is provided by a distributed or global lock manager (GLM).

�� Linux Kernel

�� uxfs. � �� , �� . �� mkfs � fsdb.

�� , �� . �� printk(), � �� kdb � gdb. �� .

�� 2.4.18. �� :

 	www.wiley.com/compbooks/pate

�� uxfs,�� uxfs

��

�� - �� . �� ,�� ,�� . � �� , �� - �� .

�� 512-�� . �� UX_BSIZE � �� ux_fs.h.
�� - �� 470. �� UX_MAXBLOCKS.
� �� 32 inodes (UX_MAXFILES). �� 0 � 1 ��. �� 2 �� ,�� 3 - �� lost+found , 28 �� .
�� 0. �� ,�� , �� , �� . � �� , �� .

 
 struct ux_superblock {
 __u32 s_magic;
 __u32 s_mod;
 __u32 s_nifree;
 __u32 s_inode[UX_MAXFILES]
 }
 
 superblock block 0 {
 
 __u32 s_nbfree;
 __u32 s_block[UX_MAXBLOCKS]
 }
 
 
 struct ux_inode 
 {
 __u32 i_mode;
 __u32 i_nlink;
 data blocks 
 __u32 i_atime;
 __u32 i_mtime;
 __u32 i_ctime;
 __s32 i_uid;
 __s32 i_gid;
 __u32 i_size;
 __u32 i_blocks;
 __u32 i_addr[UX_DIRECT_BLOCKS]
 
 };

�� . �� 8. �� 0 � 1 �� ,�� 10 � lost+found � �� 11. �� 12 �� 39.
�� 33. �� 50 �� 51-� �� - ��lost+found.
�� 9 �� , �� (9 * 512) �� 4608 ��.
�� 28-�� . �� 32 ��.

�� -�� , ��-�� , �� hard links, symbolic links, rename, � �.�. 4 �� -�� : super_operations, file_operations, address_space_operations, inode_operations .

Linux Kernel Source

This section shows how to download the Linux kernel source and how to find your way around the kernel source tree to locate files that are of most interest to filesystem development. Later sections show how to configure the kernel to match the hardware on your system, to compile it, and then install the newly built kernel. Both the LILO or GRUBbootloaders are described.

The Linux kernel source can be retrieved from the following Web site:

 	www.kernel.org

The home page of www.kernel.orgshows the latest versions of the kernel. For example, the following line showed the latest stable version at the time of this writing:

 The latest stable version of the Linux kernel is: 2.4.18 2002-07-10 00:40
 UTC F V VI Changelog

The Web site also describes the state of the different kernels including the latest stable version. Click on the kernel version to download the latest kernel. Clicking on Changelog will display all of the updates to the latest kernel.

All of the kernels since Linux inception can be found at this site. Follow the links through to the source repositories and locate the kernel of your choice. To use the source in the book as is, you need the 2.4.18 kernel. Alternatively, as described earlier, newer versions of the filesystem can be obtained from the following Web site:

 	www.wiley.com/compbooks/pate

Also at the site is information about which Linux kernels and the various Linux distributions that uxfs supports.

To locate the required kernel source, follow the various pointers. As an example, from the home page follow the link to Linux respository, including kernel source, then kernel and 2.4. This will take you to the following link:

 	www.kernel.org/pub/linux/kernel/v2.4/

The kernel source is a gzipped tar archive. Once the file has been downloaded, it should be unzipped and untarred. The kernel source resides under /usr/src although this is not mandatory. One possibility is to untar the archive in /usr/src and set a symlink to point to the directory. For example, if the gzipped archive has been placed in /usr/src, perform the following steps:

 # bunzip2 linux-2.4.18.tar.bz2
 # mv linux linux.orig
 # tar xvf linux-2.4.18.tar
 # mv linux linux-2.4.18
 # ln -s linux-2.4.18 linux

Extracting the files from the tar archive will place them in the directory linux in the current working directory by default. The command to move the old linux directory aside may be replaced with something more suitable to your environment. Alternatively, the soruce can be extracted in a separate directory and then moved into /usr/src/linux-2.4.18. Be careful not to overwrite any existing Linux kernel source trees.

Whats in the Kernel Source Tree

There are many files and directories in the Linux kernel source tree. This section provides an overview of how the kernel source tree is laid to allow readers to be able to easily locate the various kernel subsystems or specific files.

arch. This directory contains a directory for each of the different machine architectures that Linux supports including Intel, Sparc, MIPS, and IBM s390.

CREDITS. This file lists all of the major contributors to the kernel together with information about their area of expertise or contribution.

Documentation. There is a whole host of documentation distributed with the kernel source. The filesystems directory contains information about some of the different Linux filesystems in additional to generic filesystem-related information.

drivers.This directory contains all of the Linux device drivers.

fs. This is the directory that will be of most relevance to people interested in filesystems together with the mm directory that contains much of the page cache/data I/O management code. Files in the fs directory implement the dcache, buffer cache, inode cache, and file-related system call handling. Also within the fs directory is a directory for each of the Linux filesystems. Within their respective directories are the filesystem source files themselves.

include. All of the kernel header files can be accessed within this directory. This directory contains architectural-specific header files in addition to header files that are common across all architectures. The common header files can be found in the linux subdirectory. The fs.h header file is of particular importance to filesystem writers. The dcache.h header file defines the structures used by the Linux dcache.

init. This directory contains functions that are executed during kernel bootstrap.

ipc. This directory contains source applicable to System V IPC (Inter Process Communication) including semaphores, shared memory, and message queues.

kdb.If the kdbpatch is installed, this directory contains source for the kernel debugger. Note that the kdb patch also changes other files throughout the kernel.

kernel. This directory contains core kernel routines such as process management, system call handling, module management, and so on.

lib.Some of the standard C library functions have counterparts in the kernel. The source can be found in this directory.

MAINTAINERS. This file lists the people who are responsible for various parts of the kernel.

mm. This directory contains all of the memory management code that is not specific to one architecture or another. The Linux page cache managment routines can be found in this directory.

net. All of the networking protocols (TCP, UDP, IP, etc.) are stored in this directory.

There are too many files and directories to decribe here. However, for readers interested in learning about filesystems, the include, fs, and mm directories are where most of the filesystem-related structures and routines can be found. There are also a few interesting files in the drivers/block directory for those wishing to look at the filesystem/driver interfaces in more detail.

Configuring the Kernel

Before building the kernel, it is necessary to determine the kernel configuration. There are many components that are part of the kernel source tree that you will not need as part of your kernel. For example, there are numerous different device drivers for the various SCSI adaptors. If you dont have a need for SCSI access, building support into the kernel is unnecessary. Thus, you need to determine what hardware configuration you have and therefore which kernel components are required.

There are several different methods of defining the configuration. The Linux kernel HOWTO should be consulted in addition to the notes described here. There are multiple copies of the HOWTO available across the World Wide Web. You can find it at the following Web site:

 	www.tldp.org/HOWTO/Kernel-HOWTO.html

One of the easiest ways to determine which components of the kernel are needed is to install the kernel source when the Linux operating system is installed. This will result in a configuration file for the installed kernel being available for consultation. It is then possible to copy the configuration file from the installed kernel source tree to the new kernel source tree as follows:

 # cp /usr/src/linux-2.4.18-3/.config /usr/src/linux-2.4.18/.config

Care must be taken here. If the new kernel being installed has a substantially different configuration from the installed kernel, some options may or may not be available. However, this method should suffice in most cases.

One method of defining the configuration is to run the following command for both the installed kernel and the new kernel source. For example, for Red Hat 7.3 run the following:

 # cd /usr/src/linux-2.4.18-
 # make menuconfig

And for the new kernel do the following:

 # cd /usr/src/linux-2.4.18
 # make menuconfig

By having both windows side by side it is easy to see which components you need to select for the new kernel by browsing through the configuration of the current kernel. The alternative method is to fully understand what type of hardware you have. When comparing the configurations side by side, it is a safe bet to select everything for the new kernel that is selected in the current kernel.

Items are selected if noted by an asterisk. Loadable kernel modules are denoted by the letter "M." Instructions are available at the top of the screen to indicate how to select. Pressing Enter expands the menu to the next level. Pressing the Escape key takes you back up a level.

Once you have completed changing the configuration, a series of Escape key sequences will prompt you as to whether you wish to save and exit. Note that you do not need to save the configuration for the current kernel. This is particularly important if you have accidently made any changes. After saving the configuration and exiting the program, the following message appears:

 Saving your kernel configuration..
 
 *** End of Linux kernel configuration.
 *** Check the top-level Makefile for additional configuration.
 *** Next, you must run make dep

Follow the instructions by issuing the following commands:

 # make dep
 # make clean

The first step builds all of the necessary kernel dependencies based on the set of options chosen during the kernel configuration process. The next step is to ensure that the build environment is clean such that a subsequent kernel compilation will not pick up any precompiled files that do not match the configuration chosen.

The next step, which is the longest, is to compile the kernel. This can be achieved by typing the following:


 # make bzImage
 ..
 objcopy -O binary -R .note -R .comment -S compressed/bvmlinux 
 compressed/bvmlinux.out
 tools/build -b bbootsect bsetup compressed/bvmlinux.out CURRENT 
 bzImage
 Root device is (3, 2)
 Boot sector 512 bytes.
 Setup is 2536 bytes.
 System is 1301 kB
 warning: kernel is too big for standalone boot from floppy
 make[1]: Leaving directory '/usr/src/linux-2.4.18/arch/i386/boot'

Once the process is complete, the compressed kernel, which is called bzImage, will be placed in the directory arch/i386/boot. This should be copied to /bootand given a unique name as follows:

 # cp arch/i386/boot/bzImage /boot/linux.spate

Note the name of the file that the kernel was copied to. This should be given an easy to remember name and should not overwrite any existing kernels that are already in /boot. One exception to this rule is when you are building kernels frequently and you know which kernels can be safely overwritten.

Because many of the kernel components were probably selected to be kernel modules, they must be compiled and installed as follows:

 # make modules
 # make modules_install

The modules are compiled and installed under the /lib/modules directory. There is one subdirectory for each kernel version. For example, in the case of the kernel being used here, the modules will reside under:

 	/lib/modules/2.4.18

It is important to remember to compile and install the modules selected during configuration, a task that is often easy to forget. Without the modules in place, the kernel may not boot.

Installing and Booting the New Kernel

The next step is to configure the boot loader to recognize the new kernel. Most Linux distributions either use LILO or GRUB as the boot loader. This section decribes how to use LILO, the most commonly used boot loader.

Consider the following lines taken from one specific /etc/lilo.conf file that was created as part of Red Hat 7.3 installation:

 image=/boot/vxlinuz-2.4.18-
 label=linux
 
 initrd=/boot/initrd-2.4.18-3.img
 read-only
 root=/dev/hda2

The image field specifies the kernel to bootstrap. When lilo runs and displays the list of bootable kernels, it displays the names found next to the labelfield, in this case linux. The initrd field specifies an initial root disk (RAM disk) that will be used prior to checking and mounting the real root filesystem. The root field specifies where the root disk can be found.

In order to bootstrap the new kernel, copy these lines to the end of the file and change both the image and label lines as follows:

 image=/boot/linux.spate
 label=linux.spate
 initrd=/boot/initrd-2.4.18-3.img
 read-only
 root=/dev/hda2

This creates an entry for the new kernel and leaves the existing entry for the default kernel unchanged. Note that it is important not to modify any of the configuration information for the kernel installed as part of the Linux installation. It is imperitive to have a kernel that boots safely because there will be times when building new kernels where device drivers are accidently ommitted. For example, it is not uncommon when building a kernel for the first few times to ommit vital information such as the correct disk drivers, rendering the new kernel unbootable.

The final step is to run lilo to install information about the new kernel in the master boot record:

 # lilo

A successful run of lilo should not display anything. Once completed, you will see an entry corresponding to your kernel (the labelfield) next time the machine is rebooted.

Using GRUB to Handle Bootstrap

Many Linux distributions are now using the GRUB (GRand Unified Bootloader) boot loader. This is extremely rich in features but operates in a different manner to LILO. However, adding a new kernel is not difficult. The /etc/grub.conf file is used in a similar manner to /etc/lilo.conf. However, adding an entry to this file is sufficient. GRUB does not need to be run to install the information in the master boot record. For further information on GRUB, see the grub manual page.

Booting the New Kernel

The next step is to reboot the machine. Once the machine boots, lilodisplays the list of kernels that it is able to bootstrap. The newly installed kernel should be visible. This can be selected using the arrow keys and loaded by pressing Enter. If all goes well, the new kernel will boot as expected To verify that the kernel requested is running, the uname command can be used to display the kernel version as follows:

 # uname -
 Linux x.y.com 2.4.18 #2 SMP Tue Jul 30 18:55:27 PDT 2002 i686 unknown

The kernel version is shown in bold. There will be times when you reboot the machine and lilo automatically boots a kernel by default and you often wonder which kernel is running when you return to the machine. It is typically a good idea to have the default kernel set to the kernel that was installed when the Linux operating system was installed.

Installing Debugging Support

Analyzing the filesystem source code is one way to learn about how filesystems work. However, it is extremely difficult following this method to truly understand the flow through the kernel and filesystem in response to certain operations. There is no better method than installing and using one of the different kernel debuggers allowing you to stop in specific functions, display stack backtraces and function arguments, and print other useful information.

There are three main methods under which a filesystem or indeed any other part of the kernel can be debugged. The first approach involves using the kernel printk() command which is very similar to printf(). The second approach involves using a standalone debugger such as kdb whereby flow can be stopped by placing explicit breakpoints or by entering a special key sequence to enter the debugger. The third approach involves the use of two machines connected through a serial cable and over which gdb can be used for source level debugging.

The following sections describe each of these approaches. The amount of work to perform each task is considerably different with printk() being the simplest approach while the gdb approach involves more time to set up and an additional machine. For readers who wish to experiment and have access to all the available resources it is recommended that you start with printk() first, then move to kdb, and finally to gdb. The following sections assume some familiarity with debugging concepts.

The printk Approach to Debugging

One of the oldest and easiest styles of debugging is the printf() method. By placing printf() statements throughout the code it is possible to display information about the running program. This is useful for development or simply to follow the flow through the program.

Linux provides the printk() function for kernel/module writers to use. With the exception of the name change, it can be used in the same manner in which printf() can be called. One method employed when writing uxfs was to place a printk() at the start of each entry point to the filesystem. When typing various commands at the user prompt, it is then easy to see which functions in the filesystem are called.

Because Linux supports loadable modules, and the time to recompile and reload a module is in the order of seconds, this is the easiest way to watch how the filesystem works in practice and should be the method initially followed by anyone new to kernel development who wants to understand how the kernel works. To get a better idea of how the filesystem-related kernel functions work, printk() calls can be placed throughout the kernel, and various structures can be displayed.

Using the SGI kdb Debugger

The kdb debugger is a built-in debugger. It must be compiled with the kernel in order for it to be used. It can be used to set breakpoints, display memory, disassemble instructions, and display machine configuration such as the register set. The debugger operates around the kernel symbol table, and therefore functions and structures can be accessed by name.

The source code for kdb, which was developed by engineers at SGI (Silicon Graphics Inc), can be downloaded from the SGI Web site. The home page for kdb is as follows:

 http://oss.sgi.com/projects/kdb/

Note that when following the link to the download section, the directories displayed are for the versions of kdband not versions of the Linux kernel. For the kernel used to develop uxfs (2.4.18), kdb version 2.1 must be used (the latter versions did not support this kernel at the time of writing).

The README file in the download directory contains instructions on which files to download. This file should be consulted prior to downloading. Note that there may be several versions for the same kernel. The README file specifies how to interpret the version numbers of the patches.

There are two patch files to download. The first is common across all different machine architectures and the second is specific to the machine architecture on which youre running. After downloading the patches, they can be applied as follows:

 # cd /usr/src/linux-2.4.18
 # patch -p1 < ../kdb-v2.1-2.4.18-common-
 patching file kernel/sysctl.
 patching file kernel/ksyms.
 patching file kernel/Makefile
 patching file init/main.
 ..
 patching file Documentation/kdb/kdb_env.man
 patching file Documentation/kdb/kdb.mm
 
 
 
 patching file Documentation/kdb/kdb_bp.man
 patching file Documentation/kdb/slides
 # patch -p2 < ../kdb-v2.1-2.4.18-i386-
 patching file include/asm-i386/hw_irq.
 patching file include/asm-i386/keyboard.
 patching file include/asm-i386/ptrace.
 patching file arch/i386/vmlinux.lds
 ..
 patching file arch/i386/kdb/kdbasupport.
 patching file arch/i386/kdb/ansidecl.
 patching file arch/i386/kdb/bfd.
 patching file arch/i386/kdb/ChangeLog

Once the patch has been successfully applied, the kernel configuration must be changed to incorporate kdb. Under the section marked Kernel hacking , select the option Built-in Kernel Debugger support and select the KDB modules. The kernel must then be built (make dep ; make bzImage) and reinstalled as described in the section Configuring the Kernel earlier in the chapter.

Included with the kdb patch is documentation on how the debugger works, the commands that are available, and so on. The debugger can be entered by pressing the BREAKkey. The kdbprompt is then displayed as follows:

 Entering kdb (current=0xc03b0000,pid 0)on processor 0 due to Keyboard Entry
 [0]kdb>

The ?command can be used to display the available commands. Shown below is a summary of the more commonly used commands. Examples of how they are used in practice will be shown throughout the chapter.

 bp.Set or display a breakpoint.
 bph.Set a hardware breakpoint.
 bc. Clear a breakpoint.
 bl. List the current breakpoints.
 bt.Display the stack backtrace for the current process.
 go. Exit the debugger and restart kernel execution.
 id.Disassemble instructions.
 md.Display the contents of the specified address.
 mds.Display memory symbolically.
 mm.Modify memory.
 reboot.Reboot the machine immediately.
 rd.Display the register contents.
 ss.Single step (instruction at a time)
 ssb.Single step the CPU until a branch is reached.

The kdb(8)man page describes the other commands.

Source Level Debugging with gdb

The GNU debugger gdb has been available for many years, typically being used to debug user-level programs. However, by connecting machines together over a serial line in a host/target configuration, gdb can also be used to debug the Linux kernel. This requires a patch to the kernel to include a kgdbdriver through which gdb on the host machine can communicate. Although this requires an extra machine and some additional setup work, the ease of use of debugging the kernel at source level is well worth the extra work. It is also easier to see how the kernel works because not only can breakpoints be added to show the flow through the kernel, but function arguments can be displayed along with the source code corresponding to the position at which the breakpoint is hit. There are multiple patches for kernel-level gdb debugging. The following Web page:

 http://kgdb.sourceforge.net/

is the homepage for kgdb. It references all of the patches and contains detailed instructions on gdb setup. The following sections highlight some of the main points. For complete details, refer to the kgdb homepage.

Connecting the Host and Target Machines

The first step for gdb debugging is to connect the two machines together and verify that data can be passed through the link. The machines must be connected through a standard null modem between the serial ports of the machines as shown in Figure 14.2.

Serial ports support transmission rates from 110 baud up to 115,200 baud. The default baud rate for a serial port is 9,600. This is generally adequate for simple debugging although higher baud rates are preferred if a lot of information will be transmitted over the wire. This will certainly be the case when displaying multiple thread stacks.

Once the link is in place, the speed of the serial port on each machine must be identical. This can be verified on each machine as follows:

 # stty < /dev/ttyS0
 speed 9600 baud; line = 0;
 min = 0; time = 10;
 -brkint -icrnl -imaxbel
 -opost -onlcr
 -isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

The baud rate is shown here as 9,600. If the baud rate differs between the two machines, the following call to the stty command can set the baud rate:

 # stty ispeed 9600 ospeed 9600 < /dev/ttyS0

Assuming that the baud rate is the same on both machines and the cable is in place, the link can be tested by simply echoing a string through the cable on one end and reading it on another as follows:

 Host  					Target
 # cat /dev/ttyS0		# echo hello > /dev/ttyS0
 
 hello

If any problems are encountered, review the troubleshooting guide on the kgdb kernel Web site.

Downloading the kgdb Patch

The download section of the kgdb kernel Web site contains the kernel patches for specific Linux kernels. Each patch is an ASCII file that contains a set of diffs. Once downloaded, the patch to build kgdb into the kernel can be applied to the kernel as follows:

 # cd /usr/src/linux
 # patch -p1 < ../linux-2.4.18-kgdb-1.5.patch
 patching file Documentation/Configure.help
 patching file Documentation/i386/gdb-serial.txt
 patching file Makefile
 patching file arch/i386/Makefile
 patching file arch/i386/config.in
 patching file arch/i386/kernel/Makefile
 ..
 patching file kernel/ksyms.
 patching file kernel/sched.

Once the patch has been applied, the kernel configuration must be updated to include the kgdb options. Under the Kernel Debugging section, select the following line:

 KGDB: Remote (serial) kernel debugging with gdb (NEW)

and then select each of the kgdb suboptions. Note that the Verbose BUG() reporting option should not be selected. After saving the kernel configuration, run the following:

 # make dep
 # make clean
 # make bzImage

to build the new kernel. As described in earlier sections, the kernel will be found under the arch/i386/boot directory.

Installing the kgdb-Modified Kernel

To install the new kernel, the entry in lilo.confmust be changed to instruct the kernel to wait, on bootstrap, for a connection from gdb on the host machine. Shown below is an entry in lilo.conffor the new kernel:

 image=/boot/linux.gdb
 label=linux.gdb
 initrd=/boot/initrd-2.4.18-3.img
 read-only
 root=/dev/hda2
 append="gdb gdbttyS=0 gdbbaud=9600"

This instructs the kgdbstub which serial port to use (/dev/ttyS0) and the baud rate that was established earlier during gdb configuration. When the new kernel bootstraps, the following message is displayed:

 Waiting for connection from remote gdb..

To connect to the target machine, gdb must be run on the host and the following commands should be entered:

 # gdb
 GNU gdb Red Hat Linux (5.1.90CVS-5)
 Copyright 2002 Free Software Foundation, Inc.
 GDB is free software, covered by the GNU General Public License, and you 
 are welcome to change it and/or distribute copies of it under certain 
 conditions.
 Type "show copying" to see the conditions.
 There is absolutely no warranty for GDB. Type "show warranty" for 
 details.
 This GDB was configured as "i386-redhat-linux"
 (gdb) target remote /dev/ttyS0
 Remote debugging using /dev/ttyS0
 0xc011323d in ?? (
 (gdb) 
 Continuing.
 PCI: PCI BIOS revision 2.10 entry at 0xfbfee, last bus=
 PCI: Using configuration type 
 ..

The "target remote" command specifies the serial port to connect to in order to communicate with the kernel. The c command then continues execution. To break into the debugger and instruct it where to access the symbolic debugging information, hit Control-C as follows:

 Program received signal SIGTRAP, Trace/breakpoint trap.
 0xc011323d in ?? (
 (gdb) symbol-file /usr/src/linux/vmlinux
 Reading symbols from /usr/src/linux/vmlinux...done.

The debugger now has enough information to debug the kernel.

 gdb and Module Interactions

Because uxfs is a loadable module, gdb knows nothing about the location of the module in memory or where to locate the modules symbolic information. The loadmodule script, also located on the kgdb Web site, must be used to load the module. It is assumed that the module source and binary are located on the host machine and that it is possible to rcpfrom the host to the target.

Before running loadmodule, the GDBSCRIPTS variable, located at the top of the script, must be altered to point to a directory where it can install a script for use with gdb. As an example:

 GDBSCRIPTS=/home/spate/uxfs/tools/gdbscripts

The script can then be run as follows:

 # loadmodule target-machine ../kern/uxfs
 Copying ../kern/uxfs to linux
 Loading module ../kern/uxfs
 Generating script /home/spate/uxfs/tools/gdbscripts/loadlinuxuxfs

Once completed, the module should be loaded on the target machine and the script generated is displayed. This should be run from within gdb. Control-C will get you into gdb from which the script can be executed as follows:

 Program received signal SIGTRAP, Trace/breakpoint trap.
 breakpoint () at gdbstub.c:1177
 1177 
 (gdb) so /home/spate/uxfs/tools/gdbscripts/loadlinuxuxfs
 add symbol table from file "/home/spate/uxfs/kern/uxfs" at
 
 
 .text_addr = 0xd0854060
 .rodata_addr = 0xd0855c60
 __ksymtab_addr = 0xd085618c
 __archdata_addr = 0xd08562b0
 __kallsyms_addr = 0xd08562b0
 .data_addr = 0xd08568c0
 .bss_addr = 0xd0856a60

The setup of gdb is now complete. Control-C can be invoked at any time the debugger needs to be entered to add break points and so on. Use of gdb for kernel-level debugging will be shown throughout the chapter.

Building the uxfs Filesystem

The source code for all of the files that are needed to build the uxfs filesystem for the 2.4.18 kernel is included at the end of the chapter. This includes the source for mkfs and fsdb, the kernel makefile, and the kernel source. The source tree downloaded from www.wiley.com/compbooks/spate is a gzipped tar archive. Download to any directory and issue the following commands:

 # gunzip uxfs.tar.gz
 # tar xvf uxfs.tar
 # ls
 uxfs.tar uxfs
 # ls uxfs
 cmds kern

Commands can be easily built. All that is required is for the uxfs.h header file to be located in the "../kern" directory. To build each of the commands, go to the cmdsdirectory and issue the following:

 # make fsdb
 cc fsdb.c -o fsdb 
 # make fsdb 
 cc fsdb.c -o fsdb 
 
 The commands can then be used. 
 The kernel makefileis relatively straightforward as follows: 
 KERNELDIR = /usr/src/linux 
 include $(KERNELDIR)/.config 
 FLAGS = -D__KERNEL__ -DMODULE $(VERCFLAGS) 
 GLOBAL_CFLAGS = -g -I$(KERNELDIR)/include $(FLAGS) 
 M_OBJS = ux_dir.o ux_alloc.o ux_file.o ux_inode.o 
 M_TARGET = uxfs 
 SRCS = $(M_OBJS:.o=.c) 
 CFLAGS = $(GLOBAL_CFLAGS) $(EXTRA_CFLAGS) 
 $(M_TARGET) : $(M_OBJS) 
 ld -r -o $@ $(M_OBJS) 
 $(M_OBJS) : %.o : %.c 
 $(CC) -c $(CFLAGS) -o $@ $< 
 
 
 all: uxfs
 
 
 clean:
 rm -f $(M_OBJS) $(M_TARGET)

To build the kernel source, the KERNELDIR variable at the top of the Makefile must be changed to reference the kernel source directory. Figure 14.3 shows how KERNELDIR is set to reference the 2.4.18 source tree. Once this variable has been set, the kernel can be built as follows:

 # make uxfs
 cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_dir.o ux_dir.
 cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_alloc.
 ux_alloc.
 cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_file.o ux_file.
 cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_inode.
 ux_inode.
 ld -r -o uxfs ux_dir.o ux_alloc.o ux_file.o ux_inode.

This produces the uxfs module that can then be loaded into the kernel. This is shown later in the chapter.

Creating a uxfs Filesystem

The first step when developing a new filesystem is to write a mkfs command to place the intial filesystem layout on disk. This includes the following tasks:
Create and initialize the filesystem superblock and write it to disk.
Create a root dirtectory inode and lost+found directory inode. For each of the inodes, ensure that the "." and ".." entries are in place and for the root directory, add an entry for lost+found.
Account for allocation of the two directories within the inode map.
Account for allocation of two blocks used for the root and lost+found directories.

The code for mkfs can be found on lines 104 to 262. For uxfs, it is a fairly simple program. As with the kernel, it uses various structure definitions and information from ux_fs.h including superblock structural information, inode formats, directory entries and various filesystem boundaries such as the maximum number of blocks and inodes.

Before the filesystem is implemented, it is important to verify the information that mkfs writes to disk. Thus, the next program to write is fsdb, which can read back and display various superblock and inode information. The fsdbcommand (lines 264 to 393) is very simple. It accepts two commands that allow the superblock or a specified inode to be displayed.

The first task is to read the superblock into memory (lines 365 to 369), validate it, and keep in in memory for the duration of the program. From here, it can access any information it needs to about inodes or data blocks.

The remainder of the main() loop involves reading commands and then calling additional routines. For now, only the superblock or an inode can be displayed. By entering 'q', the program will exit.

The following output from fsdb shows the two commands being run on a newly created filesystem:

 # ./mkfs /dev/fd0
 # ./fsdb /dev/fd0
 uxfsdb > 
 
 Superblock contents: 
 s_magic = 0x58494e55 
 s_mod = UX_FSCLEAN 
 s_nifree = 28 
 s_nbfree = 468
 
 uxfsdb > i2
 
 inode number 
 i_mode = 41ed 
 i_nlink = 
 i_atime = Wed Aug 21 09:55:16 2002 
 i_mtime = Wed Aug 21 09:55:16 2002 
 i_ctime = Wed Aug 21 09:55:16 2002 
 i_uid = 
 
 
 
 i_gid = 
 i_size = 512 
 i_blocks = 
 i_addr[ 0] = 50 i_addr[ 1] = 0 i_addr[ 2] = 0 i_addr[ 3] = 
 i_addr[ 4] = 0 i_addr[ 5] = 0 i_addr[ 6] = 0 i_addr[ 7] = 
 i_addr[ 8] = 0 i_addr[ 9] = 0 i_addr[10] = 0 i_addr[11] = 
 i_addr[12] = 0 i_addr[13] = 0 i_addr[14] = 0 i_addr[15] = 
 
 
  Directory entries: 
 inum[ 2],name[.
 inum[ 2],name[..
 inum[ 3],name[lost+found]
 
 
 uxfsdb > q

There are many more features that could be added to fsdb. Some of these changes will be imperitive when completing the exercises at the end of the chapter.

Module Initialization and Deinitialization

When writing a loadable kernel module, there are three different things that need to be defined:
A declaration giving information about the type of module
A function to be called when the module is loaded. This can perform any initialization functions including registering the filesystem type with the kernel.
A function to be called when the module is unloaded. This can clean up any remaining filesystem structures and unregister the filesystem.

The various components that are applicable to uxfs are shown in ux_inode.c on lines 1304 to 1317. The module_init() call specifies the function to be run when the module is loaded while the module_exit() function specifies the function to be called when the module is unloaded. Both of these functions perform little work other than registering and unregistering the filesystem driver respectively. The DECLARE_FSTYPE_DEV()macro is shown below:

 #define DECLARE_FSTYPE(var,type,read,flags) 
 
 
 struct file_system_type var = { 
 name: type, 
 read_super: read, 
 fs_flags: flags, 
 owner: THIS_MODULE, 
 }
 
 #define DECLARE_FSTYPE_DEV(var,type,read) 
 DECLARE_FSTYPE(var,type,read,FS_REQUIRES_DEV)

The kernel maintains a list of all such structures, one per filesystem. The entry for uxfs is added when calling register_filesystem(). When a mount system call enters the kernel, the filesystem name passed to mount is compared with the name field of each file_system_type structure. If a match is found, the read_superfunction is called to mount the filesystem.

The rmmod command is used to remove a kernel module. If there are still filesystems mounted, the removal will fail; otherwise the kernel calls the module exit function, which in the case of uxfs, is the exit_uxfs_fs() function. The only action to perform is to call unregister_filesystem().

Testing the New Filesystem

The following examples show how a uxfs filesystem is created, how the kernel module is loaded, the filesystem is unmounted, and how the module is unloaded. Modules are loaded and unloaded with the insmodand rmmod commands. Note that by default, the insmod command will attempt to look under /lib/modules/ to locate the requested module. For example, if the pathname is not specified as shown below, insmod will fail even though the requested module is in the current directory. For this reason "./uxfs" must be specified.

 # ./mkfs /dev/fd0
 # insmod ./uxfs
 # lsmod
 Module Size Used by Not tainted
 uxfs 8608 0 (unused)
 ext3 71968 2 (autoclean)
 jbd 66208 2 (autoclean) [ext3]
 # mount -t uxfs /dev/fd0 /mnt
 # mount
 /dev/hda2 on / type ext3 (rw)
 none on /proc type proc (rw)
 /dev/hda1 on /boot type ext3 (rw)
 none on /dev/pts type devpts (rw,gid=5,mode=620)
 /dev/hda5 on /home type ext3 (rw)
 none on /dev/shm type tmpfs (rw)
 /dev/fd0 on /mnt type uxfs (rw)
 # rmmod uxfs
 uxfs: Device or resource busy
 # umount /mnt
 # rmmod uxfs
 # lsmod
 Module Size Used by Not tainted
 ext3 71968 2 (autoclean)
 jbd 66208 2 (autoclean) [ext3]

The sequence of commands here is merely to illustrate the basics of how to get a uxfs filesystem mounted. The module displayed by lsmod is the name of the actual binary and does not bear any resemblance to the source code.

Mounting and Unmounting the Filesystem

The ux_read_super() function is called to mount a uxfs filesystem. This function is declared through the DECLARE_FSTYPE_DEV()macro and becomes known to the Linux kernel when the filesystem is registered. The code for this function can be found in ux_inode.con lines 1240 to 1302.

The ux_read_super() function takes three arguments as shown in ux_inode.con line 1234 and iterated below:

 ux_read_super(struct super_block *s, void *data, int silent)

There is one super_block structure per mounted filesystem. One of the tasks to be performed by ux_read_super()is to initialize this structure by filling in the following fields:
s_magic. This field holds the magic number of the filesystem, which for uxfs is 0x58494e55. This field has little practical value.
s_blocksize. This field holds the filesystem block size, which in the case of uxfs is 512 bytes (UX_BSIZE).
s_op.This field holds the super_operations vector, a set of functions that either deal with the filesystem as a whole or allow inodes to be read, written, and deleted.
s_root. This field is set to reference the dentry for the root inode. This is described in more detail later.

The data argument is used by the kernel to pass any arguments that were passed to mount. At this stage, uxfs does not accept any command line arguments to mount, so this parameter is ignored. The silent argument, if set, allows the filesystem writer to display more detailed information when running. This allows debugging information to be displayed.

The ux_read_super() function must also perform the following tasks:
Call set_blocksize()to specify to the underlying driver layer the units of I/O that will be passed through when accessing data through the buffer cache. Note that all subsequent I/O must be in fixed-size chunks.
Allocate and initialize a root inode for the filesystem. This will be explained in more detail later.

The following example shows how to set a breakpoint in gdb, display a stack backtrace, and show how to display various structures. First of all, after the module is loaded, but before a calling made to mount a filesystem, a breakpoint is set in ux_read_super(). Hitting Control-C will enter gdb from which the breakpoint can be set:

 (gdb) b ux_read_super
 Breakpoint 1 at 0xd08557ca: file ux_inode.c, line 237.
 (gdb) 
 Continuing.

In response to mounting the filesystem, the breakpoint will be hit as follows:

 # mount -f uxfs /dev/fd0 /mnt
 
 Breakpoint 1, ux_read_super (s=0xcf15a400, data=0x0, silent=0)
 
 
  at ux_inode.c:237
 237 dev = s->s_dev;
 (gdb) list
 232 struct ux_fs *fs;
 233 struct buffer_head *bh;
 234 struct inode *inode;
 235 kdev_t dev;
 236
 237 dev = s->s_dev;
 238 set_blocksize(dev, UX_BSIZE)
 239 s->s_blocksize = UX_BSIZE;
 240 s->s_blocksize_bits = UX_BSIZE_BITS;
 241

The list command displays the source code from the point at which the breakpoint has been hit. The bt command can be used to display the current stack backtrace as follows:

 (gdb) bt
 #0 ux_read_super (s=0xcf15a400, data=0x0, silent=0) at ux_inode.c:237
 #1 0xc0143868 in get_sb_bdev (fs_type=0xd0856a44, 
 
 
  dev_name=0xccfe8000 "/dev/fd0", flags=0, data=0x0) at super.c:697
 #2 0xc0143d2d in do_kern_mount (type=0xccfe9000 "uxfs", flags=0, 
 name=0xccfe8000 "/dev/fd0", data=0x0) at super.c:879
 
 
 #
 0xc0156ff1 in do_add_mount (nd=0xcd011f5c, type=0xccfe9000 "uxfs"
 flags=0, mnt_flags=0, name=0xccfe8000 "/dev/fd0", data=0x0) 
 at namespace.c:630
 
 
 #4 0xc01572b7 in do_mount (dev_name=0xccfe8000 "/dev/fd0"
 dir_name=0xcf80f000 "/mnt", type_page=0xccfe9000 "uxfs"
 flags=3236757504, data_page=0x0) at namespace.c:746
 
 
 #
 0xc015737f in sys_mount (dev_name=0x805b418 "/dev/fd0"
 dir_name=0x805b428 "/mnt", type=0x805b438 "uxfs", flags=3236757504, 
 data=0x0) at namespace.c:779
 
 
 #
 0xc010730b in system_call ()

The arguments to the function at the current position in the stack trace (ux_read_super()) can be displayed with the print (p) command. Note that gdbunderstands C constructs:

 
 (gdb) print *(struct super_block *)0xcf15a400
 
 
 $1 = {s_list = {next = 0xc0293840, prev = 0xcf6df400}, s_dev = 512, 
 s_blocksize = 0, s_blocksize_bits = 0 \0, s_dirt = 0 \0
 s_maxbytes = 2147483647, s_type = 0xd0856a44, s_op = 0x0, dq_op = 0x0, 
 s_flags = 0, s_magic = 0, s_root = 0x0, s_umount = {count = -65535, 
 
 
  wait_lock = {lock = 1}, wait_list = {next = 0xcf15a43c, 
 
 
 
 prev = 0xcf15a43c}}, s_lock = {count = {counter = 0}, sleepers = 0, 
 
 
  wait = {lock = {lock = 1}, task_list = {next = 0xcf15a450, 
 
 
 prev = 0xcf15a450}}}, s_count = 1073741824, s_active = {counter = 1}
 
 
  s_dirty = 0,
 
 
  ..

Later examples show some of the other features of gdb.

Scanning for a Uxfs Filesystem

The first task to perform when mounting the filesystem is to read the superblock from disk. This involves a call to sb_bread() to read block 0 of the device on which the superblock resides. The sb_read() function is merely a wrapper around bread() that extracts the device from the s_dev field of the super_blockstructure. Thus the following calls are equivalent:

 bh = sb_bread(sb, block)
 bh = bread(sb->s_dev, block, sb->s_blocksize)

On return from sb_bread(), a buffer_head structure will reference the data read from the device. Note that each call to sb_read() must be followed at some stage by a call to brelse() to release the buffer. An attempt to reread the same block from disk prior to calling brelse() will cause the filesystem to block. The data read from disk can be referenced by accessing the b_data field. Because the superblock is located at offset 0 within block 0, the ux_superblock structure can be referenced as shown in line 1253:

 usb = (struct ux_superblock *)bh->b_data;

The first check to perform is to validate that this is a uxfs filesystem. Verification is achieved by checking for presence of the uxfs magic number. Assuming that this is detected and the superblock is not marked UX_FSDIRTY, the filesystem can be mounted. Because all of the inode and data block information is stored in the uxfs superblock, it is imperative to keep the superblock in memory at all times. A ux_fs structure is allocated to keep hold of the buffer_head used to read the superblock. This makes it easy to access the ux_superblock structure from either the Linux super_block structure or from a Linux inode. This is shown in Figure 14.4. Note that the buffer is not released until the filesystem is unmounted.

Access to the ux_fs structure can be achieved through either the Linux super_block structure or indirectly from the Linux inode structure as follows:

 struct super_block *sb = inode->i_sb;
 struct ux_fs *fs = (struct ux_fs *)sb->s_private;
 struct ux_superblock *usb = fs->u_sb;

Because all exported uxfs functions are passed through either the super_block or an inode structure as an argument, it is always possible to get access to the uxfs superblock.

Reading the Root Inode

The final step when mounting the filesystem is to read in the root inode and instantiate it in the dcache. This is achieved through a call to iget()followed by a call to d_alloc_root().

The call to iget() will involve a call back into the filesystem to actually read the inode from disk. Subsequent calls to iget() for the same inode will find the entry in the cache avoiding the need for further filesystem access. For details on how uxfs reads inodes see the section Reading an Inode from Disk a little later in the chapter. The Linux kernel calls find_inode() (fs/inode.c) to scan the inode cache for the inode. If not found, a call to get_new_inode() is made.

The call to d_alloc_root() is a wrapper to d_instantiate() that initializes the d_sb field of the dentry structure to reference the new super_block structure. Note that accessing any further inodes will involve access to dentries that already exist and that have been initialized by the kernel.

At this stage, the mount is complete. The super_block structure has been initialized, the root directory is accessible through the Linux inode cache/dcache, and the kernel has access to the the array of functions exported by the root inode through which subsequent operations can be performed.

As another example of how to use gdb, a breakpoint can be set on the ux_read_inode()function as follows:

 (gdb) b ux_read_inode
 Breakpoint 2 at 0xd0855312: file ux_inode.c, line 54.
 (gdb) 
 Continuing.

As with the gdb example earlier, the source code can be displayed at the point where the breakpoint is hit:

 Breakpoint 2, ux_read_inode (inode=0xcd235460) at ux_inode.c:54
 
 unsigned long ino = inode->i_ino;
 
 (gdb) list
 
 
 49 void
 50 ux_read_inode(struct inode *inode)
 51 { 
 52 struct buffer_head *bh; 
 53 struct ux_inode *di; 
 54 unsigned long ino = inode->i_ino; 
 55 int block; 
 56 
 57 
 if (ino < UX_ROOT_INO || ino > UX_MAXFILES) {
 58 printk("uxfs: Bad inode number %lu\n", ino)

and the stack backtrace is displayed to locate the flow through the kernel from function to function. In the stack backtrace below, you can see the call from ux_read_super() to iget() to read the root inode. Notice the inode number

 (2) passed to iget(). 
 
 (gdb) bt
 #0 ux_read_inode (inode=0xcd235460) at ux_inode.c:54
 #1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=2, head=0xcfda3820, 
 
 
  find_actor=0, opaque=0x0) at inode.c:871
 #2 0xc015439a in iget4 (sb=0xcf15a400, ino=2, find_actor=0, opaque=0x0) 
 at inode.c:984
 #3 0xd0855bfb in iget (sb=0xcf15a400, ino=2) 
 at /usr/src/linux/include/linux/fs.h:1328
 #4 0xd08558c3 in ux_read_super (s=0xcf15a400, data=0x0, silent=0) 
 at ux_inode.c:272
 #5 0xc0143868 in get_sb_bdev (fs_type=0xd0856a44, 
 
 
  dev_name=0xccf35000 "/dev/fd0", flags=0, data=0x0) at super.c:697
 #6 0xc0143d2d in do_kern_mount (type=0xccf36000 "uxfs", flags=0, 
 ..

Finally, the inode structure passed to ux_read_inode() can be displayed. Because the inode has not been read from disk, the in-core inode is only partially initialized. The i_ino field is correct, but some of the other fields are invalid at this stage.

 (gdb) print *(struct inode *)0xcd235460
 $2 = {i_hash = {next = 0xce2c7400, prev = 0xcfda3820}, i_list = 
 
 
 next = 0xcf7aeba8, prev = 0xc0293d84}, i_dentry = {next = 0xcd235470, 
 prev = 0xcd235470}, i_dirty_buffers = {next = 0xcd235478, 
 prev = 0xcd235478}, i_dirty_data_buffers = {next = 0xcd235480, 
 prev = 0xcd235480}, i_ino = 2, i_count = {counter = 1}, i_dev = 512, 
 i_mode = 49663, i_nlink = 1, i_uid = 0, i_gid = 0, 
 i_rdev = 512, i_size = 0,

Because the address of the inode structure is known, it may be displayed at any time. Simply enter gdb and run the above command once more.

Writing the Superblock to Disk

The uxfs superblock contains information about which inodes and data blocks have been allocated along with a summary of both pieces of information. The superblock resides in a single UX_MAXBSIZEbuffer, which is held throughout the duration of the mount. The usual method of ensuring that dirty buffers are flushed to disk is to mark the buffer dirty as follows:

 mark_buffer_dirty(bh)

However, the uxfs superblock is not released until the filesystem is unmounted. Each time the superblock is modified, the s_dirt field of the superblock is set to
1. This informs the kernel that the filesystem should be notified on a periodic basis by the kupdate daemon, which is called on a regular interval to flush dirty buffers to disk. The kupdate() routine can be found in the Linux kernel source in fs/buffer.c.To follow the flow from kupdate() through to the filesystem, the following tasks are performed:

 # ./mkfs /dev/fd0
 # mount -t uxfs /dev/fd0 /mnt
 # touch /mnt/file

Because a new file is created, a new inode is allocated that requires information in the superblock to be updated. As part of this processing, which will be described in more detail later in the chapter, the s_dirtfield of the in-core superblock is set to 1 to indicate that the superblock has been modified.

The ux_write_super() function (lines 1218 to 1229) is called to write the superblock to disk. Setting a breakpoint in ux_write_super() using kdb as follows:

 Entering kdb (current=0xcbe20000, pid 1320) on processor 0 due to
 Keyboard Entry[0]kdb> bp ux_write_super
 Instruction(i) BP #1 at 0xd08ab788 ([uxfs]ux_write_super)
  is enabled globally adjust

and creating the new file as shown will eventually result in the breakpoint being hit, as follows:

 Entering kdb (current=0xc1464000, pid 7) on processor 0 due to Breakpoint
 @ 0xd08ab788
 [0]kdb> bt
 
 
  EBP EIP Function(args)
 0xc1465fc4 0xd08ab788 [uxfs]ux_write_super (0xcc53b400, 0xc1464000) 
 uxfs .text 0xd08aa060 0xd08ab788 0xd08ab7c4 
 0xc014b242 sync_supers+0x142 (0x0, 0xc1464000)
 
 
 kernel .text 0xc0100000 0xc014b100 0xc014b2c0
 0xc1465fd4 0xc0149bd6 sync_old_buffers+0x66 (0xc1464000, 0x10f00, 
 0xcffe5f9c, 0xc0105000)
 
 
 kernel .text 0xc0100000 0xc0149b70 0xc0149cf0
 0xc1465fec 0xc014a223 kupdate+0x273
 kernel .text 0xc0100000 0xc0149fb0 0xc014a230
 
 
  0xc01057c6 kernel_thread+0x26
 kernel .text 
 0xc0100000 0xc01057a0 0xc01057e0

Note the call from kupdate() to sync_old_buffers(). Following through, the kernel code shows an inline function, write_super(), which actually calls into the filesystem as follows:

  if (sb->s_root && sb->s_dirt) 
 if (sb->s_op && sb->s_op->write_super) 
 sb->s_op->write_super(sb)

Thus, the write_super entry of the superblock_operations vector is called. For uxfs, the buffer holding the superblock is simply marked dirty. Although this doesnt flush the superblock to disk immediately, it will be written as part of kupdate() processing at a later date (which is usually fairly quickly).

The only other task to perform by ux_write_super() is to set the s_dirt field of the in-core superblock back to 0. If left at 1, ux_writer_super() would be called every time kupdate() runs and would, for all intents and purposes, lock up the system.

Unmounting the Filesystem

Dirty buffers and inodes are flushed to disk separately and are not therefore really part of unmounting the filesystem. If the filesystem is busy when an unmount command is issued, the kernel does not communicate with the filesystem before returning EBUSYto the user.

If there are no open files on the system, dirty buffers and inodes are flushed to disk and the kernel makes a call to the put_super function exported through the superblock_operations vector. For uxfs, this function is ux_put_super() (lines 1176 to 1188). The path when entering ux_put_super() is as follows:

 Breakpoint 4, ux_put_super (s=0xcede4c00) at ux_inode.c:167
 
 167 struct ux_fs *fs = (struct ux_fs *)s->s_private;
 
 (gdb) bt
 #0 ux_put_super (s=0xcede4c00) at ux_inode.c:167
 #1 0xc0143b32 in kill_super (sb=0xcede4c00) at super.c:800
 #2 0xc01481db in path_release (nd=0xc9da1f80)
 at /usr/src/linux-2.4.18/include/linux/mount.h:50
 #3 0xc0156931 in sys_umount (name=0x8053d28 "/mnt", flags=0)
 	 at namespace.c:395
 #4 0xc015694e in sys_oldumount (name=0x8053d28 "/mnt"
 	at namespace.c:406
 #5 0xc010730b in system_call (

There are only two tasks to be performed by ux_put_super():
Mark the buffer holding the superblock dirty and release it.
Free the structure used to hold the ux_fs structure that was allocated during ux_read_super().

If there are any inodes or buffers used by the filesystem that have not been freed, the kernel will free them and display a message on the console about their existence. There are places within uxfs where this will occur. See the exercises at the end of the chapter for further information.

Directory Lookups and Pathname Resolution

There are three main entry points into the filesystem for dealing with pathname resolution, namely ux_readdir(), ux_lookup(), and ux_read_inode(). One interesting way to see how these three functions work together is to consider the interactions between the kernel and the filesystem in response to the user issuing an ls command on the root directory. When the filesystem is mounted, the kernel already has a handle on the root directory, which exports the following operations:

 struct inode_operations ux_dir_inops = {
 create: ux_create, 
 lookup: ux_lookup, 
 mkdir: ux_mkdir, 
 rmdir: ux_rmdir, 
 link: ux_link, 
 unlink: ux_unlink, 
 }; 
 
 struct file_operations ux_dir_operations = {
 read: generic_read_dir,
 readdir: ux_readdir,
 fsync: file_fsync,
 }

The kernel has two calls at a directory level for name resolution. The first is to call ux_readdir() to obtain the names of all the directory entries. After the filesystem is mounted, the only inode in memory is the root inode so this operation can only be invoked on the root inode. Given a filename, the ux_lookup() function can be called to look up a name relative to a directory. This function is expected to return the inode for the name if found. The following two sections describe each of these operations in more detail.

Reading Directory Entries

When issuing a call to ls, the lscommand needs to know about all of the entries in the specified directory or the current working directory if ls is typed without any arguments. This involves calling the getdents() system call. The prototype for getdents() is as follows:

 int getdents(unsigned int fd, struct dirent *dirp, unsigned int count)

The dirp pointer references an area of memory whose size is specified in count. The kernel will try to read as many directory entries as possible. The number of bytes read is returned from getdents(). The dirent structure is shown below:

 struct dirent
 { 
 long d_ino; /* inode number */ 
 off_t d_off; /* offset to next dirent */ 
 unsigned short d_reclen; /* length of this dirent */ 
 char d_name [NAME_MAX+1]; /* file name (null-terminated) */ 
 }

To read all directory entries, ls may need to call getdents() multiple times depending on the size of the buffer passed in relation to the number of entries in the directory.

To fill in the buffer passed to the kernel, multiple calls may be made into the filesystem through the ux_readdir() function. The definition of this function is as follows:

 int
 ux_readdir(struct file *filp, void *dirent, filldir_t filldir)

Each time the function is called, the current offset within the directory is increased. The first step taken by ux_readdir() is to map the existing offset into a block number as follows:

 pos = filp->f_pos;
 blk = (pos + 1) / UX+BSIZE;
 blk = uip->iaddr[blk]

On first entry pos will be 0 and therefore the block to read will be i_addr[0]. The buffer corresponding to this block is read into memory and a search is made to locate the required filename. Each block is comprised of UX_DIRS_PER_BLOCKux_dirent structures. Assuming that the entry in the block at the appropriate offset is valid (d_inois not 0), the filldir()routine, a generic kernel function used by all filesystems, is called to copy the entry to the users address space.

For each directory entry found, or if a null directory entry is encountered, the offset within the directory is incremented as follows:

 filp->f_pos += sizeof(struct ux_dirent)

to record where to start the next read if ux_readdir()is called again.

Filename Lookup

From a filesystem perspective, pathname resolution is a fairly straightforward affair. All that is needed is to provide the lookup() function of the inode_operations vector that is passed a handle for the parent directory and a name to search for. Recall from the ux_read_super() function described in the section Reading the Root Inode earlier in the chapter, after the superblock has been read into memory and the Linux super_blockstructure has been initialized, the root inode must be read into memory and initialized. The uxfs ux_inode_operations vector is assigned to the i_op field of the root inode. From there, filenames may be searched for, and once those directories are brought into memory, a subsequent search may be made.

The ux_lookup() function in ux_dir.c (lines 838 to 860) is called passing the parent directory inode and a partially initialized dentry for the filename to look up. The next section gives examples showing the arguments passed.

There are two cases that must be handled by ux_lookup():
The name does not exist in the specified directory. In this case an EACCES error is returned in which case the kernel marks the dentry as being negative. If another search is requested for the same name, the kernel finds the negative entry in the dcache and will return an error to the user. This method is also used when creating new files and directories and will be shown later in the chapter.
The name is located in the directory. In this case the filesystem should call iget()to allocate a new Linux inode.

The main task performed by ux_lookup() is to call ux_find_entry() as follows:

 inum = ux_find_entry(dip, (char *)dentry->d_name.name)

Note that the d_namefield of the dentryhas already been initialized to reference the filename. The ux_find_entry() function in ux_inode.c (lines 1031 to 1054) loops through all of the blocks in the directory (i_addr[]) making a call to sb_bread()to read each appropriate block into memory.

For each block, there can be UX_DIRS_PER_BLOCKux_dirent structures. If a directory entry is not in use, the d_inofield will be set to 0. Figure 14.5 shows the root directory inode and how entries are laid out within the inode data blocks. For each block read, a check is made to see if the inode number (i_ino) is not zero indicating that the directory entry is valid. If the entry is valid, a string comparison is made between the name requested (stored in the dentry) and the entry in the directory (d_name). If the names match, the inode number is returned.

If there is no match in any of the directory entries, 0 is returned. Note that inode 0 is unused so callers can detect that the entry is not valid. Once a valid entry is found, ux_lookup() makes a call to iget()to bring the inode into memory, which will call back into the filesystem to actually read the inode.

Filesystem/Kernel Interactions for Listing Directories

This section shows the kernel/filesystem interactions when running ls on the root directory. The two main entry points into the filesystem for dealing with name resolution, which were described in the last two sections, are ux_lookup() and ux_readdir(). To obtain further information about a filename, the ux_read_inode() must be called to bring the inode into memory. The following example sets a breakpoint on all three functions and then an ls is issued on a filesystem that has just been mounted. The filesystem to be mounted has the lost+founddirectory (inode 3) and a copy of the passwd file (inode 4). There are no other files. First, the breakpoints are set in gdb as follows:

 (gdb) b ux_lookup
 Breakpoint 8 at 0xd0854b32: file ux_dir.c, line 367.
 (gdb) b ux_readdir
 Breakpoint 9 at 0xd0854350
 (gdb) b ux_read_inode
 Breakpoint 10 at 0xd0855312: file ux_inode.c, line 54.

The filesystem is then mounted and the the first breakpoint is hit as follows:

 # mount -f uxfs /dev/fd0 /mnt
 
 
 Breakpoint 10, ux_read_inode (inode=0xcd235280) at ux_inode.c:54
 54 
 
 
 unsigned long ino = inode->i_ino;
 (gdb) p inode->i_ino
 $19 = 2

This is a request to read inode number 2 and is called as part of the ux_read_super() operation described in the section Mounting and Unmounting the Filesystem earlier in the chapter. The print (p) command in gdb can be used to display information about any of the parameters passed to the function. Just to ensure that the kernel is still in the process of mounting the filesystem, a portion of the stack trace is displayed as follows, which shows the call to ux_read_super():

 (gdb) bt
 #0 ux_read_inode (inode=0xcd235280) at ux_inode.c:54
 #1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=2, head=0xcfda3820, 
 
 
  find_actor=0, opaque=0x0) at inode.c:871
 #2 0xc015439a in iget4 (sb=0xcf15a400, ino=2, find_actor=0, opaque=0x0) 
 at inode.c:984
 #3 0xd0855bfb in iget (sb=0xcf15a400, ino=2) 
 at /usr/src/linux/include/linux/fs.h:1328
 #4 0xd08558c3 in ux_read_super (s=0xcf15a400, data=0x0, silent=0) 
 at ux_inode.c:272
 ..

The next step is to run ls /mnt, which will result in numerous calls into the filesystem. The first such call is:

 # ls /mnt
 
 
 Breakpoint 9, 0xd0854350 in ux_readdir (filp=0xcd39cc60, 
 dirent=0xccf0dfa0, filldir=0xc014dab0 < filldir64>

This is a request to read directory entries from the root directory. This can be shown by displaying the inode number of the directory on which the operation is taking place. Note how C-like constructs can be used within gdb:

 (gdb) p ((struct inode *)(filp->f_dentry->d_inode))->i_ino
 $20 =

Here is the stack backtrace:

 (gdb) bt
 #0 0xd0854350 in ux_readdir (filp=0xcd39cc60, dirent=0xccf0dfa0, 
 
 
  filldir=0xc014dab0 
 #1 0xc014d64e in vfs_readdir (file=0xcd39cc60, filler=0xc014dab0 
 
 
 
  buf=0xccf0dfa0) at readdir.c:27
 #2 0xc014dc2d in sys_getdents64 (fd=3, dirent=0x8058730, count=512) 
 at readdir.c:311
 #3 0xc010730b in system_call (

Although ls may make repeated calls to getdents(), the kernel records the last offset within the directory after the previous call to readdir(). This can be used by the filesystem to know which directory entry to read next. The ux_readir() routine obtains this offset as follows:

 pos = filp->f_pos;

It can then read the directory at that offset or advance further into the directory if the slot at that offset is unused. Either way, when a valid entry is found, it is copied to the user buffer and the offset is advanced to point to the next entry. Following this call to ux_readdir(), there are two subsequent calls. Without looking too deeply, one can assume that ls will read all directory entries first. The next breakpoint hit is a call to ux_lookup() as follows:

 Breakpoint 8, ux_lookup (dip=0xcd235280, dentry=0xcd1e9ae0) at
 ux_dir.c:367
 367 struct ux_inode *uip = (struct ux_inode *)

The dip argument is the root directory and the dentry is a partially initialized entry in the dcache. The name to lookup can be found within the dentry structure as follows:

 (gdb) p dentry->d_name
 $23 = {name = 0xcd1e9b3c "lost+found", len = 10, hash = 4225228667}

The section Filename Lookup earlier in the chapter showed how the name can be found in the directory and, if found, ux_lookup() will call iget() to read the inode into memory. Thus, the next breakpoint is as follows:

 Breakpoint 10, ux_read_inode (inode=0xcf7aeba0) at ux_inode.c:54
 54 unsigned long ino = inode->i_ino;
 (gdb) p inode->i_ino
 $24 = 3

The inode number being looked up is inode number 3, which is the inode number for the lost+founddirectory. The stack backtrace at this point is:

 (gdb) bt
 #0 ux_read_inode (inode=0xcf7aeba0) at ux_inode.c:54
 #1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=3, head=0xcfda3828, 
 
 
  find_actor=0, opaque=0x0) at inode.c:871
 #2 0xc015439a in iget4 (sb=0xcf15a400, ino=3, find_actor=0, opaque=0x0) 
 at inode.c:984
 #3 0xd0854e73 in iget (sb=0xcf15a400, ino=3) 
 at /usr/src/linux/include/linux/fs.h:1328
 #4 0xd0854b93 in ux_lookup (dip=0xcd235280, dentry=0xcd1e9ae0) 
 at ux_dir.c:379
 #5 0xc01482c0 in real_lookup (parent=0xcd1e9160, 
 name=0xccf0df5c, flags=0) 
 at namei.c:305
 #6 0xc0148ba4 in link_path_walk (name=0xcf80f00f "", nd=0xccf0df98) 
 at namei.c:590
 #7 0xc014943a in __user_walk (name=0x0, flags=8, nd=0xccf0df98) 
 at namei.c:841
 
 
 #8 0xc0145877 in sys_lstat64 (filename=0xbffff950 "/mnt/lost+found"
 statbuf=0x805597c, flags=1108542220) at stat.c:352
 #9 0xc010730b in system_call (

Thus, the ls command has obtained the lost+found directory entry through calling readdir() and is now invoking a stat() system call on the file. To obtain the information to fill in the stat structure, the kernel needs to bring the inode into memory in which to obtain the appropriate information. There are two more calls to ux_readdir() followed by the next breakpoint:

 Breakpoint 8, ux_lookup (dip=0xcd235280,dentry=0xcd1e90e0) at ux_dir.c:367
 367 struct ux_inode *uip = (struct ux_inode *
 (gdb) p dentry->d_name
 $26 = {name = 0xcd1e913c "passwd", len = 6, hash = 3467704878}

This is also invoked in response to the stat() system call. And the final breakpoint hit is:

 Breakpoint 10, ux_read_inode (inode=0xcd0c4c00) at ux_inode.c:54
 54 unsigned long ino = inode->i_ino;
 (gdb) p inode->i_ino
 $27 = 4

in order to read the inode, to fill in the fields of the statstructure. Although not shown here, another method to help understand the flow of control when reading directory entries is either to modify the lssource code itself to see the calls it is making or use the ls program (shown in Chapter 2).

Inode Manipulation

Previous sections have already highlighted some of the interactions between the kernel, the inode cache, and the filesystem. When a lookup request is made into the filesystem, uxfs locates the inode number and then calls iget() to read the inode into memory. The following sections describe the inode cache/filesystem interactions in more detail. Figure 14.6 can be consulted for a high-level view of these interactions.

Reading an Inode from Disk

The ux_read_inode() function (lines 1061 to 1109) is called from the kernel iget()function to read an inode into memory. This is typically called as a result of the kernel calling ux_lookup(). A partially initialized inode structure is passed to ux_read_inode()as follows:

 void
 ux_read_inode(struct inode *inode)

and the inode number of the inode can be found in inode->i_ino. The role of ux_read_inode() is simply to read the inode into memory and copy relevant fields of the disk portion of the disk-based inode into the inode structure passed.

This is a relatively straightforward task in uxfs. The inode number must be converted into a block number within the filesystem and then read through the buffer cache into memory. This is achieved as follows:

 block = UX_INODE_BLOCK + ino;
 bh = sb_bread(inode->i_sb, block)

Recall that each uxfs inode is held in its own block on disk and inode 0 starts at the block number defined by UX_INODE_BLOCK.

Once read into memory, a copy is made of the inode to the location within the in-core inode defined by the i_private field. This address is at the end of the in-core inode where the union of filesystem dependent information is stored. The i_privatefield is defined in ux_fs.has follows:

 #define i_private u_generic_ip

Before freeing the buffer, the in-core inode fields are updated to reflect the on-disk inode. Such information is used by the kernel for operations such as handling the stat()system call.

One additional task to perform in ux_read_inode()is to initialize the i_op, i_fop, and i_mapping fields of the inode structure with the operations applicable to the file type. The set of operations that are applicable to a directory are different to the set of operations that are applicable to regular files. The initialization of both types of inodes can be found on lines 1088 to 1097 and duplicated here:

 if (di->i_mode & S_IFDIR) 
 inode->i_mode |= S_IFDIR;
 inode->i_op = &ux_dir_inops;
 inode->i_fop = &ux_dir_operations;
 } else if (di->i_mode & S_IFREG) 
 inode->i_mode |= S_IFREG;
 inode->i_op = &ux_file_inops;
 inode->i_fop = &ux_file_operations;
 inode->i_mapping->a_ops = &ux_aops;

Operations such as reading directory entries are obviously not applicable to regular files while various I/O operations are not applicable to directories.

Allocating a New Inode

There is no operation exported to the kernel to allocate a new inode. However, in response to requests to create a directory, regular file, and symbolic link, a new inode needs to be allocated. Because uxfs does not support symbolic links, new inodes are allocated when creating regular files or directories. In both cases, there are several tasks to perform:

Call new_inode()to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs disk inode.
Initialize both the in-core and the disk inode.
Mark the superblock dirtythe free inode array and summary have been modified.
Mark the inode dirty so that the new contents will be flushed to disk.
Information about creation of regular files and directories are the subjects of the sections File Creation and Link Management and Creating and Removing Directories later in the chapter. This section only describes the ux_ialloc() function that can be found in the filesystem source code on lines 413 to 434.

Writing an Inode to Disk

Each time an inode is modified, the inode must be written to disk before the filesystem is unmounted. This includes allocating or removing blocks or changing inode attributes such as timestamps.

Within uxfs itself, there are several places where the inode is modified. The only thing that these functions need to perform is to mark the inode dirty as follows:

 mark_inode_dirty(inode)

The kernel will call the ux_write_inode() function to write the dirty inode to disk. This function, which can be found on lines 1115 to 1141, is exported through the superblock_operationsvector.

The following example uses kdb to set a breakpoint on ux_write_inode() in order to see where the function is called from.

 [0]kdb> bp ux_write_inode

The breakpoint can be easily hit by copying files into a uxfs filesystem. The stack backtrace when the breakpoint is encountered is as follows:

 Instruction(i) BP #0 at 0xd08cd4c8 ([uxfs]ux_write_inode)
  is enabled globally adjust 
 Entering kdb (current=0xc1464000, pid 7) on processor 0 due to Breakpoint 
 @ 0xd08cd4c8
 [0]kdb> bt
  EBP EIP Function(args)
 0xc1465fc8 0xd08cd4c8 [uxfs]ux_write_inode (0xc77f962c, 0x0, 0xcf9a8868,
 0xcf9a8800, 0xc1465fd4)
 uxfs .text 0xd08cc060 0xd08cd4c8 0xd08cd5c0
  0xc015d738 sync_unlocked_inodes+0x1d8 (0xc1464000)
  kernel .text 0xc0100000 0xc015d560 
 0xc015d8e0
 0xc1465fd4 0xc0149bc8 sync_old_buffers+0x58 (0xc1464000, 0x10f00,
 0xcffe5f9c, 0xc0105000)
  kernel .text 0xc0100000 0xc0149b70 
 0xc0149cf0
 0xc1465fec 0xc014a223 kupdate+0x273
  kernel .text 0xc0100000 0xc0149fb0 
 0xc014a230
  0xc01057c6 kernel_thread+0x26
 kernel .text 0xc0100000 0xc01057a0
 0xc01057e0

As with flushing the superblock when dirty, the kupdate daemon locates dirty inodes and invokes ux_write_inode()to write them to disk. The tasks to be performed by ux_write_inode()are fairly straightfoward:
Locate the block number where the inode resides. This can be found by adding the inode number to UX_INODE_BLOCK.
Read the inode block into memory by calling sb_bread().
Copy fields of interest from the in-core inode to the disk inode, then copy the disk inode to the buffer.
Mark the buffer dirty and release it.

Because the buffer cache buffer is marked dirty, the periodic run of kupdate will write it to disk.

Deleting Inodes

There are two cases where inodes need to be freed. The first case occurs when a directory needs to be removed; this is described in the section Creating and Removing Directories later in the chapter. The second case occurs when the inode link count reaches zero.

Recall that a regular file is created with a link count of 1. The link count is incremented each time a hard link is created. For example:

 # touch 
 # touch 
 # ln A

Files A and B are created with a link count of 1. The call to ln creates a directory entry for file C and increments the link count of the inode to which A refers. The following commands:

 # rm 
 # rm

result in calls to the unlink() system call. Because B has a link count of 1, the file will be removed. However, file A has a link count of 2; in this case, the link count is decremented and the directory entry for A is removed, but the file still remains and can be accessed through C.

To show the simple case where a file is created and removed, a breakpoint on ux_write_inode() can be set in kdbas follows:

 [0]kdb> bp ux_write_inode
 Instruction(i) BP #0 at 0xd08cd4c8 ([uxfs]ux_write_inode) 
 is enabled globally adjust 
 [0]kdb> go

and the following commands are executed:

 # touch /mnt/file
 # rm /mnt/file

A regular file (file) is created with a link count of 1. As described in previous chapters of the book, the rm command invokes the unlink() system call. For a file that has a link count of 1, this will result in the file being removed as shown below when the stack backtrace is displayed:

 Entering kdb (current=0xcaae6000, pid 1398) 
 on processor 0 due to Breakpoint @ 0xd08bc5c0
 [0]kdb> bt
 EBP EIP Function(args)
 0xcab81f34 0xd08bc5c0 [uxfs]ux_delete_inode (0xcaad2824, 0xcaad2824,
 0xcac4d484, 0xcabc6e0c)
 uxfs .text 0xd08bb060 0xd08bc5c0 0xd08bc6b4 
 0xc015f1f4 iput+0x114 (0xcaad2824, 0xcac4d4e0, 0xcab81f98,
 0xcaad2824, 0xcac4d484)
 kernel .text 0xc0100000 0xc015f0e0 0xc015f3a0
 0xcab81f58 0xc015c466 d_delete+0xd6 (0xcac4d484, 0xcac4d56c, 0xcab81f98,
 0x0, 0xcabc6e0c)
 kernel .text 0xc0100000 0xc015c390 0xc015c590
 0xcab81f80 0xc01537a8 vfs_unlink+0x1e8 (0xcabc6e0c, 0xcac4d484,
 0xcac4d56c, 0xcffefcf8, 0xcea16005)
 kernel .text 0xc0100000 0xc01535c0 0xc01537e0
 0xcab81fbc 0xc0153878 sys_unlink+0x98 (0xbffffc50, 0x2, 0x0, 
 0xbffffc50, 0x0)
 kernel .text 0xc0100000 0xc01537e0 0xc01538e0 
 0xc01077cb system_call+0x33
 kernel .text 0xc0100000 0xc0107798 0xc01077d0

The call to d_delete() is called to update the dcache first. If possible, the kernel will attempt to make a negative dentry, which will simplify a lookup operation in future if the same name is requested. Inside iput(); if the link count of the inode reaches zero, the kernel knows that there are no further references to the file so the filesystem is called to remove the file.

The ux_delete_inode() function (lines 1148 to 1168) needs to perform the following tasks:
Free any data blocks that the file references. This involves updating the s_nbfreefield and s_block[] fields of the superblock.
Free the inode by updating the s_nbfreefield and s_block[] fields of the superblock.
Mark the superblock dirty so it will be flushed to disk to reflect the changes.
Call clear_inode() to free the in-core inode.

As with many functions that deal with inodes and data blocks in uxfs, the tasks performed by ux_delete_inode()and others are greatly simplified because all of the information is held in the superblock.

File Creation and Link Management

Before creating a file, many UNIX utilities will invoke the stat() system call to see is the file exists. This will involve the kernel calling the ux_lookup() function. If the file name does not exist, the kernel will store a negative dentry in the dcache. Thus, if there are additional calls to stat() for the same file, the kernel can see that the file doesnt exist without an additional call to the filesystem.

Shown below is the output from the strace command when using the cp command to copy fileto foo:

 lstat64("foo", 0xbffff8a0) = -1 ENOENT (No such file or directory)
 stat64("file", {st_mode=S_IFREG|0644, st_size=0, ...}) = 
 open("file", O_RDONLY|O_LARGEFILE) = 
 open("foo", O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = 4

The cp command invokes the stat() system call on both files before calling open()to create the new file. The following example shows the call to ux_lookup() in response to the cp command calling the stat()system call:

 Breakpoint 5, ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed3a0) 
 at ux_dir.c:367
 367 struct ux_inode *uip = (struct ux_inode *
 (gdb) bt
 #0 ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed3a0) at ux_dir.c:367
 #1 0xc01482c0 in real_lookup (parent=0xcb5ed320, name=0xc97ebf5c,
 
 
 flags=0) 
 at namei.c:305
 #2 0xc0148ba4 in link_path_walk (name=0xcb0f700b "", nd=0xc97ebf98) 
 at namei.c:590
 #3 0xc014943a in __user_walk 
 name=0xd0856920 "\220D\205,K\205AK\205< L\205"
 flags=9, nd=0xc97ebf98) 
 at namei.c:841
 #4 0xc0145807 in sys_stat64 (filename=0x8054788 "file"
 statbuf=0xbffff720, flags=1108542220) 
 at stat.c:337
 #5 0xc010730b in system_call (

The kernel allocates the dentrybefore calling ux_lookup(). Notice the address of the dentrywhich is highlighted above.

Because the file does not exist, the cpcommand will then call open()to create the file. This results in the kernel invoking the ux_create() function to create the file as follows:

 Breakpoint 6, 0xd0854494 in ux_create
 
 (dip=0xcd73cba0, dentry=0xcb5ed3a0, mode=33188)
 (gdb) bt
 #0 0xd0854494 in ux_create (dip=0xcd73cba0, dentry=0xcb5ed3a0, 
 
 
 mode=33188)
 #1 0xc014958f in vfs_create (dir=0xcd73cba0, dentry=0xcb5ed3a0, 
 mode=33188) 
 at namei.c:958
 #2 0xc014973c in open_namei (pathname=0xcb0f7000 "foo"
 flag=32834, 
 mode=33188, nd=0xc97ebf74) at namei.c:1034
 #3 0xc013cd67 in filp_open (filename=0xcb0f7000 "foo"
 flags=32833, 
 mode=33188) at open.c:644
 #4 0xc013d0d0 in sys_open (filename=0x8054788 "foo"
 flags=32833, mode=33188) 
 at open.c:788
 #5 0xc010730b in system_call (

Note the address of the dentry passed to ux_create(). This is the same as the address of the dentrypassed to ux_lookup(). If the file is created successfully, the dentrywill be updated to reference the newly created inode.

The ux_create()function (lines 629 to 691) has several tasks to perform:
Call ux_find_entry()to check whether the file exists. If it does exist, an error is returned.
Call the kernel new_inode() routine to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs inode. This will be described in more detail later.
Call ux_diradd() to add the new filename to the parent directory. This is passed to ux_create()as the first argument (dip).
Initialize the new inode and call mark_dirty_inode() for both the new inode and the parent inode to ensure that they will be written to disk.

The ux_ialloc()function (lines 413 to 434) is very straightforward working on fields of the uxfs superblock. After checking to make sure there are still inodes available (s_nifree > 0) , it walks through the s_inode[]array until it finds a free entry. This is marked UX_INODE_INUSE, the s_ifree field is decremented, and the inode number is returned.

The ux_diradd() (lines 485 to 539) function is called to add the new filename to the parent directory. There are two cases that ux_diradd()must deal with:
There is space in one of the existing directory blocks. In this case, the name of the new file and its inode number can be written in place. The buffer read into memory, which will hold the new entry, must be marked dirty and released.
There is no more space in any of the existing directory blocks. In this case, a new block must be allocated to the new directory in which to store the name and inode number. This is achieved by calling the ux_block_alloc() function (lines 441 to 469).

When reading through the existing set of directory entries to locate an empty slot, each directory block must be read into memory. This involves cycling through the data blocks in i_addr[] from 0to i_blocks.

Creating a hard link involves adding a new filename to the filesystem and incrementing the link count of the inode to which it refers. In some respects, the paths followed are very similar to ux_create() but without the creation of a new uxfs inode.

The ln command will invoke the stat() system call to check whether both filenames already exist. Because the name of the link does not exist, a negative dentry will be created. The ln command then invokes the link() system call, which will enter the filesystem through ux_link(). The prototype for ux_link()is as follows and the source can be found on lines 866 to 887:

 int
 ux_link(struct dentry *old, struct inode *dip, struct dentry *new)

Thus when executing the following command:

 $ ln filea fileb

the old dentry refers to filea while new is a negative dentry for fileb, which will have been established on a prior call to ux_lookup(). These arguments can be analyzed by setting a breakpoint on ux_link() and running the above lncommand.

 Breakpoint 11, ux_link (old=0xcf2fe740, dip=0xcf23a240, new=0xcf2fe7c0)
  at ux_dir.c:395
 395 
 (gdb) bt
 #0 ux_link (old=0xcf2fe740, dip=0xcf23a240, new=0xcf2fe7c0) 
 at ux_dir.c:395
 #1 0xc014adc4 in vfs_link (old_dentry=0xcf2fe740, dir=0xcf23a240, 
  new_dentry=0xcf2fe7c0) at namei.c:1613
 #2 0xc014aef0 in sys_link (oldname=0xbffffc20 "filea"
  newname=0xbffffc26 "fileb") at namei.c:1662
 #3 0xc010730b in system_call (
 The gdb command can be used to display the arguments passed to ux_link() 
 as follows: 
 (gdb) p new
 $9 = (struct dentry *) 0xcf2fe7c0
 (gdb) p *old
 $10 = {d_count = {counter = 1}, d_flags = 0, d_inode = 0xcd138260, 
 d_parent = 0xcb5ed920, d_hash = {next = 0xc2701750, prev = 0xcfde6168}
  d_lru = {next = 0xcf2fe758, prev = 0xcf2fe758}, d_child = 
  next = 0xcb5ed948, prev = 0xcf2fe7e0}, d_subdirs = {next 
 0xcf2fe768, 
 prev = 0xcf2fe768}, d_alias = {next = 0xcd138270, prev = 0xcd138270}
  d_mounted = 0, d_name = {name = 0xcf2fe79c "filea", len = 5, 
  hash = 291007618}, d_time = 0, d_op = 0x0, d_sb = 0xcede4c00, 
  d_vfs_flags = 8, d_fsdata = 0x0, d_iname = "filea\0g\0\0\0\0\0\0\0\0"
 (gdb) p old->d_name.name
 $11 = (unsigned char *) 0xcf2fe79c "filea"
 (gdb) p new->d_name.name
 $12 = (unsigned char *) 0xcf2fe81c "fileb"

Thus the dentry for old is complely instantiated and references the inode for filea. The name field of the dentry for new has been set but the dentry has not been initialized further.

There is not a great deal of work for ux_link() to perform. In addition to calling ux_diradd() to add the new name to the parent directory, it increments the link count of the inode, calls d_instantiate() to map the negative dentryto the inode, and marks it dirty.

The unlink() system call is managed by the ux_unlink() function (lines 893 to 902). All that this function needs to do is decrement the inode link count and mark the inode dirty. If the link count reaches zero, the kernel will invoke ux_delete_inode()to actually remove the inode from the filesystem.

Creating and Removing Directories

At this point, readers should be familiar with the mechanics of how the kernel looks up a filename and creates a negative dentry before creating a file. Directory creation is a little different in that the kernel performs the lookup rather than the application calling stat() first. This is shown as follows:

 Breakpoint 5, ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed420) 
 at ux_dir.c:367
 367 struct ux_inode *uip = (struct ux_inode *
 (gdb) bt
 #0 ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed420) at ux_dir.c:367
 #1 0xc01492f2 in lookup_hash (name=0xc97ebf98, base=0xcb5ed320) 
 
 at namei.c:781
 #2 0xc0149cd1 in lookup_create (nd=0xc97ebf90, is_dir=1) 
 at namei.c:1206
 #3 0xc014a251 in sys_mkdir (pathname=0xbffffc1c "/mnt/dir", mode=511) 
 at namei.c:1332
 #4 0xc010730b in system_call (

Because the filename wont be found (assuming it doesnt already exist), a negative dentry is created is then passed into ux_mkdir() (lines 698 to 780) as follows:

 Breakpoint 7, 0xd08546d0 in ux_mkdir (dip=0xcd73cba0, dentry=0xcb5ed420,
  mode=493)
 (gdb) bt
 #0 0xd08546d0 in ux_mkdir (dip=0xcd73cba0, dentry=0xcb5ed420, mode=493)
 #1 0xc014a197 in vfs_mkdir (dir=0xcd73cba0, dentry=0xcb5ed420,
 mode=493) 
 at namei.c:1307
 #2 0xc014a282 in sys_mkdir (pathname=0xbffffc1c "/mnt/dir", mode=511) 
 at namei.c:1336
 #3 0xc010730b in system_call (

Note that dentryaddress is the same for both functions. The initial steps performed by ux_mkdir() are very similar to the steps taken by ux_create(), which was described earlier in the chapter, namely:
Call new_inode()to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs inode and call ux_diradd() to add the new directory name to the parent directory.
Initialize the in-core inode and the uxfs disk inode.

One additional step that must be performed is to allocate a block to the new directory in which to store the entries for "." and "..". The ux_block_alloc() function is called, which returns the block number allocated. This must be stored in i_addr[0], i_blocks must be set to 1, and the size of the inode (i_size) is set to 512, which is the size of the data block.

To remove a directory entry, the ux_rmdir() function (lines 786 to 831) is called. The first step performed by ux_rmdir() is to check the link count of the directory inode. If it is greater than 2, the directory is not empty and an error is returned. Recall that a newly created directory has a link count of 2 when created (for both "." and "..").

The stack backtrace when entering ux_rmdir()is shown below:

 Breakpoint 8, 0xd0854a0c in ux_rmdir (dip=0xcd73cba0, dentry=0xcb5ed420)
 (gdb) bt
 #0 0xd0854a0c in ux_rmdir (dip=0xcd73cba0, dentry=0xcb5ed420)
 #1 0xc014a551 in vfs_rmdir (dir=0xcd73cba0, dentry=0xcb5ed420) 
 at namei.c:1397
 #2 0xc014a696 in sys_rmdir (pathname=0xbffffc1c "/mnt/dir"
 at namei.c:1443
 #3 0xc010730b in system_call (

The dip argument is for the parent directory and the dentry argument is for the directory to be removed. The tasks to be performed by ux_rmdir()are as follows:
Call ux_dirdel() to remove the directory name from the parent directory. This is described in more detail later.
Free all of the directory blocks.
Free the inode by incrementing the s_nifree field of the superblock and marking the slot in s_nifree[]to indicate that the inode is free.

The dirdel() function (lines 545 to 576) walks through each of the directory blocks comparing the d_ino field of each ux_dirent structure found with the name passed. If a match is found, the d_ino field is set to 0 to indicate that the slot is free. This is not an ideal solution because if many files are created and removed in the same directory, there will be a fair amount of unused space. However, for the purpose of demonstrating a simple filesystem, it is the easiest solution to implement.

File I/O in uxfs

File I/O is typically one of the most difficult areas of a filesystem to implement. To increase filesystem performance, this is one area where a considerable amount of time is spent. In Linux, it is very easy to provide a fully working filesytem while spending a minimal amount of time of the I/O paths. There are many generic functions in Linux that the filesystem can call to handle all the interactions with the page cache and buffer cache.

The section File I/O in the 2.4 Linux Kernel in Chapter 8 describes some of the interactions with the page cache. Because this chapter presents a simplified view of filesystem activity, the page cache internals wont be described. Instead, the following sections show how the kernel interacts with the ux_get_block() function exported by uxfs. This function can be used to read data from a file or allocate new data blocks and write data.

First of all, consider the main entry points into the filesystem for file I/O. These are exported through the file_operationsstructure as follows:

 struct file_operations ux_file_operations = {
  llseek: generic_file_llseek, 
 read: generic_file_read, 
 write: generic_file_write, 
 mmap: generic_file_mmap, 
 };

So for all of the main file I/O related operations, the filesystem defers to the Linux generic file I/O routines. The same is true for operations on any of the mapped file interactions, whether for user-level mappings or for handling operation within the page cache. The address space related operations are:

 struct address_space_operations ux_aops = {
 readpage: ux_readpage,
 writepage: ux_writepage, 
 sync_page: block_sync_page, 
 prepare_write: ux_prepare_write, 
 commit_write: generic_commit_write, 
 bmap: ux_bmap, 
 };

For all of the functions defined in this vector, uxfs also makes calls to generic kernel routines. For example, consider the ux_readpage() function (lines 976 to 980), which is also shown here:

 int
 ux_readpage(struct file *file, struct page *page)
 {
  return block_read_full_page(page, ux_get_block)
 }

For each of the uxfs routines exported, uxfs makes a call to a generic kernel function and passes the ux_get_block() routine. Before showing the flow into the filesystem for file I/O, the subject of the next three sections, it is first helpful to show how ux_get_block() (lines 929 to 968) works:

 int
 ux_get_block(struct inode *inode, long block, 
 struct buffer_head *bh_result, int create)

The ux_getblock() function is called whenever the kernel needs to access part of a file that is not already cached. The block argument is the logical block within the file such that block0 maps to file offset 0, block 1 maps to file offset 512 and so on. The create argument indicates whether the kernel wants to read from or write to the file. If create is 0, the kernel is reading from the file. If create is 1, the filesystem will need to allocate storage at the offset referenced by block. Taking the case where block is 0, the filesystem must fill in the appropriate fields of the buffer_head as follows:

 bh_result->b_dev = inode->i_dev;
 bh_result->b_blocknr = uip->i_addr[block]

The kernel will then perform the actual read of the data. In the case where create is 1, the filesystem must allocate a new data block by calling ux_block_alloc()and set the appropriate i_addr[]slot to reference the new block. Once allocated, the buffer_head structure must be initialized prior to the kernel performing the I/O operation.

Reading from a Regular File

The filesystem does not do anything specific for reading from regular files. In place of the read operation (file_operations vector), the filesystem specifies the generic_file_read()function.

To show how the filesystem is entered, a breakpoint is set on ux_get_block()and the passwd file is read from a uxfs filesystem by running the catprogram. Looking at the size of passwd:

 # ls -l /mnt/passwd
 -rw-r--r--1 root root 1203 Jul 24 07:51 /etc/passwd

there will be three data blocks to access. When the first breakpoint is hit:

 Breakpoint 1, ux_get_block (inode=0xcf23a420, 
 block=0, bh_result=0xc94f4740, create=0) 
 at ux_file.c:21
 21 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcf23a420, block=0, bh_result=0xc94f4740,
 create=0) 
 at ux_file.c:21
 #1 0xc0140b1f in block_read_full_page (page=0xc1250fc0, 
 get_block=0xd0855094 < ux_get_block>) at buffer.c:1781
 #2 0xd08551ba in ux_readpage (file=0xcd1c9360, page=0xc1250fc0)
  at ux_file.c:67
 #3 0xc012e773 in do_generic_file_read (filp=0xcd1c9360, 
 ppos=0xcd1c9380, 
  desc=0xc96d1f5c, actor=0xc012eaf0 < file_read_actor>
 at filemap.c:1401
 #4 0xc012ec72 in generic_file_read (filp=0xcd1c9360, buf=0x804eb28 ""
 count=4096, ppos=0xcd1c9380) at filemap.c:1594
 #5 0xc013d7c8 in sys_read (fd=3, buf=0x804eb28 "", count=4096) 
 at read_write.c:162
 #6 0xc010730b in system_call (

there are two uxfs entry points shown. The first is a call to ux_readpage(). This is invoked to read a full page of data into the page cache. The routines for manipulating the page cache can be found in mm/filemap.c. The second, is the call the ux_get_block(). Because file I/O is in multiples of the system page size, the block_read_full_page() function is called to fill a page. In the case of the file being read, there are only three blocks of 512 bytes, thus not enough to fill a whole page (4KB). The kernel must therefore read in as much data as possible, and then zero-fill the rest of the page.

The block argument passed to ux_get_block() is 0 so the filesystem will initialize the buffer_headso that the first 512 bytes are read from the file. The next time that the breakpoint is hit:

 Breakpoint 1, ux_get_block (inode=0xcf23a420, 
 block=1, bh_result=0xc94f46e0, create=0) 
 at ux_file.c:21
 21 
 
 
 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcf23a420, block=1, 
 
 
 bh_result=0xc94f46e0, create=0) 
 at ux_file.c:21
 
 
 
 #1 0xc0140b1f in block_read_full_page (page=0xc1250fc0, 
 ..

the kernel passes block1 so the next 512 bytes will be read from the file. The final call to ux_get_block()is shown below:

 (gdb) bt
 #0 ux_get_block (inode=0xcf23a420, block=2, 
 bh_result=0xc94f4680, create=0) 
 at ux_file.c:21
 #1 0xc0140b1f in block_read_full_page (page=0xc1250fc0, 
 get_block=0xd0855094 ) at buffer.c:1781
 #2 0xd08551ba in ux_readpage (file=0xcd1c9360, page=0xc1250fc0) 
 at ux_file.c:67

The kernel passes block2 so the final 512 bytes will be read from the file. For uxfs, reading from files is extremely simple. Once the get_block() function has been written, there is very little other work for the filesystem to do.

Writing to a Regular File

The mechanisms for writing to files are very similar to those used when reading regular files. Consider the following commands, this time to copy the passwd file to a uxfs filesystem:

 # ls -l /etc/passwd
 -rw-r--r--1 root root 1336 Jul 24 14:28 /etc/passwd
 # cp /etc/passwd /mnt

Setting a breakpoint on ux_get_block() once more and running the above cp command, the first breakpoint is hit as follows:

 Breakpoint 1, ux_get_block (inode=0xcd710440, 
 block=0, bh_result=0xc96b72a0, create=1) 
 at ux_file.c:21
 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcd710440, block=0, 
 bh_result=0xc96b72a0, create=1) 
 at ux_file.c:21
 #1 0xc014074b in __block_prepare_write (inode=0xcd710440, 
 page=0xc125e640, from=0, to=1024,
 get_block=0xd0855094 < ux_get_block>
 at buffer.c:1641
 #2 0xc0141071 in block_prepare_write (page=0xc125e640, from=0, to=1024, 
 get_block=0xd0855094 < ux_get_block>) at buffer.c:1960
 #3 0xd08551dd in ux_prepare_write (file=0xcd1c9160, page=0xc125e640,
 from=0, to=1024) 
 at ux_file.c:74
 #4 0xc013085f in generic_file_write (file=0xcd1c9160, 
 buf=0xbffff160 
 "root:x:0:0:root:/root:/bin/bash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaem
 on:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologi
 n\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/"...
 
  count=1024, ppos=0xcd1c9180) at filemap.c:3001
 #5 0xc013d8e8 in sys_write (fd=4, 
 
  buf=0xbffff160 
 "root:x:0:0:root:/root:/bin/bash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaem
 on:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologi
 n\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/"...
 
  count=1024) at read_write.c:188
 #6 0xc010730b in system_call (

This time the create flag is set to 1, indicating that a block must be allocated to the file. Once the block has been allocated, the buffer_head can be initialized and the first 512 bytes of passwd can be copied to the buffer. If the buffer and inode are marked dirty, both will be flushed to disk.

The next breakpoint is hit, and this time the block argument is set to 1, which will result in another block being allocated to cover the file range 512 to 1023.

 Breakpoint 1, ux_get_block (inode=0xcd710440, 
 block=1, bh_result=0xc96b7240, create=1) 
 
 at ux_file.c:21
 21 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcd710440, block=1, 
 
 bh_result=0xc96b7240, create=1) 
 at ux_file.c:21

The final breakpoint is hit as follows:

 Breakpoint 1, ux_get_block (inode=0xcd710440, block=2,
 bh_result=0xc9665900, create=1) 
 
 at ux_file.c:21
 21 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcd710440, block=2, 
 bh_result=0xc9665900, create=1) 
 at ux_file.c:21

and this time the block argument is set to 2 indicating that the final block which is needed should be allocated. As with reading from regular files, writing to regular files is also an easy function for the filesystem to implement.

Memory-Mapped Files

Although this section wont describe the mechanics of how memory-mapped files work in the Linux kernel, it is easy to show how the filesystem can support mapped files through the same mechanisms used for reading from and writing to regular files.

In place of the mmap function, exported through the file_operations vector, uxfs requests that the generic_file_mmap()will be called. All that the filesystem needs to provide is the get_block()interface.

To demonstrate how the filesystem is involved, a breakpoint is set in ux_get_block() and a file is mapped for read-only access. The first address of the mapping is then touched, which will create a page fault. The stack trace when ux_get_block()is entered is as follows:

 Breakpoint 1, ux_get_block (inode=0xcf23a420, 
 block=0, bh_result=0xc94bbba0, create=0) 
 at ux_file.c:21
 struct super_block *sb = inode->i_sb;
 (gdb) bt
 #0 ux_get_block (inode=0xcf23a420, block=0, 
 bh_result=0xc94bbba0, create=0) 
 at ux_file.c:21
 #1 0xc0140b1f in block_read_full_page (page=0xc1238340, 
 get_block=0xd0855094 
 at buffer.c:1781
 #2 0xd08551ba in ux_readpage (file=0xcd1c97e0, page=0xc1238340) 
 at ux_file.c:67
 #3 0xc012dd92 in page_cache_read (file=0xcd1c97e0, offset=3441203168) 
 at filemap.c:714
 #4 0xc012ddef in read_cluster_nonblocking (file=0xcd1c97e0, 
 offset=3475219664, filesize=1) 
 at filemap.c:739
 #5 0xc012f389 in filemap_nopage (area=0xc972a300, address=1073823744,
 unused=0) 
 at filemap.c:1911
 #6 0xc012b512 in do_no_page (mm=0xcf996d00, vma=0xc972a300, 
 address=1073823744, write_access=0, page_table=0xc91e60a0) 
 at memory.c:1249
 #7 0xc012b76c in handle_mm_fault (mm=0xcf996d00, vma=0xc972a300, 
 address=1073823744, write_access=0) 
 at memory.c:1339
 #8 0xc011754a in do_page_fault (regs=0xc952dfc4, error_code=4) 
 at fault.c:263
 #9 0xc01073fc in error_code (

The kernel is entered, not through a system call, but in response to a fault. Because there are no pages backing the mapped file in the user address space, when the process attempts to access the file, a page fault occurs. The kernel establishes where the page of memory is mapped to and must then fill in the page from the appropriate file.

The ux_readpage() function is entered, which calls back into the memory manager. To fill in the page of data, the kernel will make repeated calls into ux_get_block() until either a page of data has been read or the end of the file has been reached. If the latter occurs, the kernel must zero-fill the page so that, if the process accesses within the same page but beyond the end of the file, it will read zeroes.

The Filesystem Stat Interface

The df command displays information about the filesystem usage such as the number of free and used blocks. Through the super_block operations vector, uxfs exports the ux_statfs() function, which is called in response to df invoking the stafs system call (once for each filesystem). The ux_statfs() function can be found on lines 1194 to 1210. The function prototype is shown below:

 int 
 ux_statfs(struct super_block *sb, struct statfs *buf)

The dfcommand will make a call to the statfs() system call for each mounted filesystem. Here is the prototype for statfs().

  int statfs(const char *path, struct statfs *buf)

Note that it also uses the statfs structure which is defined below:

 struct statfs 
 {
 long f_type; /* type of filesystem (see below) */ 
 long f_bsize; /* optimal transfer block size */ 
 long f_blocks; /* total data blocks in file system */ 
 long f_bfree; /* free blocks in fs */ 
 long f_bavail; /* free blocks avail to non-superuser */ 
 long f_files; /* total file nodes in file system */ 
 long f_ffree; /* free file nodes in fs */ 
 fsid_t f_fsid; /* file system id */ 
 long f_namelen; /* maximum length of filenames */ 
 };

As mentioned earlier in the book, understanding the requirements of user level programs is essential to understanding some of the features that must be provided by filesystems. The information passed through the statfs structure corresponds to filesystem limits, such as the total number of files and blocks in the filesystem, and existing free resources, such as the number of available files and data blocks.

The following example shows a breakpoint being set within kdb to stop when the kernel enters ux_statfs(). The debugger is entered by hitting the Break key as indicated by kdb when it is entered:

 Entering kdb (current=0xc03b0000, pid 0) on processor 0 due to Keyboard Entry
 [0]kdb> bp ux_statfs
 Instruction(i) BP #0 at 0xd08bb400 ([uxfs]ux_statfs)
  is enabled globally adjust 
 [0]kdb> bl
 Instruction(i) BP #0 at 0xd08bb400 ([uxfs]ux_statfs)
  is enabled globally adjust 
 [0]kdb> go

The bl command displays the existing breakpoints. This is breakpoint number 0 as indicated by "BP #0 ". Thus, to clear the breakpoint, the bc command can be invoked passing 0 as an argument.

 # df -k /mnt
 Filesystem 1k-blocks Used Available Use% Mounted on
 Instruction(i) breakpoint #0 at 0xd08bb400 (adjusted)
 0xd08bb400 ux_statfs
 Entering kdb (current=0xcd31c000, pid 1509) on processor 0 due to 
 Breakpoint @ 0xd08bb400
 [0]kdb> bt
 EBP EIP Function(args)
 0xcd31df38 0xd08bb400 [uxfs]ux_statfs (0xcc2be400, 0xcd31df50,0xffffffda, 
 uxfs .text 0xd08bb060 0xd08bb400 0xd08bb460
 0xc0141ea2 vfs_statfs+0xa2 (0xcc2be400, 0xcd31df50, 0x43, ..
 kernel .text 0xc0100000 0xc0141e00 0xc0141f20
 0xcd31dfbc 0xc0141f58 sys_statfs+0x38 (0x8052bb8, 0xbffff760, ..
 kernel .text 0xc0100000 0xc0141f20 0xc0141fb0
 0xc01077cb system_call+0x33
 kernel .text 0xc0100000 0xc0107798 0xc01077d0
 [0]kdb> go

When the df command is run and ux_statfs() is reached, the breakpoint is hit and the kernel enters kdb. The bt command can then display the stack backtrace showing that the kernel was entered by a system call that then called through sys_statfs()and vfs_statfs()before entering ux_statfs().

The fields of the statfs structure can be obtained from either predefined defaults in ux_fs.h or from summary information stored in the superblock. Shown below is the result of a call to dffollowing creation of a single directory:

 # ./mkfs /dev/fd0
 # insmod ./uxfs
 # mount -t uxfs /dev/fd0 /mnt
 # df -k 
 Filesystem 1k-blocks Used Available Use% Mounted on 
 /dev/hda2 15120648 2524836 11827716 18% / 
 /dev/hda1 102454 11147 86017 12% /boot 
 /dev/hda5 497829 8240 463887 2% /home 
 none 127076 0 127076 0% /dev/shm 
 /dev/fd0 1000 1 999 1% /mnt

In the example that follows, a directory is created. A uxfs directory involves allocating an inode and one data block to hold the "." and ".." entries plus any subsequent entries added to the directory. Note that the single block allocated for the directory is reflected in the information displayed.

 # mkdir /mnt/dir
 # df -
 Filesystem 1k-blocks Used Available Use% Mounted on
 /dev/hda2 15120648 2524836 11827716 18% 
 /dev/hda1 102454 11147 86017 12% /boot
 /dev/hda5 497829 8240 463887 2% /home
 none 127076 0 127076 0% /dev/shm
 /dev/fd0 1000 2 998 1% /mnt

Similarly, df can also display inode allocation information based on the f_filesand f_ffreefields of the statfs structure as displayed below:

 # df -i /mnt
 Filesystem Inodes IUsed IFree IUse% Mounted on
 /dev/fd0 32 4 28 13% /mnt
 # mkdir /mnt/mydir
 # df -i /mnt
 Filesystem Inodes IUsed IFree IUse% Mounted on
 /dev/fd0 32 5 27 16% /mnt

When first run on an empty filesystem, there are 4 inodes used out of the 32 available (UX_MAXFILES) inodes. By creating a directory an additional inode is used that is returned in the f_ffreefield of the statfsstructure and displayed by df above.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

Автор Комментарий к данной статье
alvert
holassssssssss
2008-03-29 02:56:15