Steve Pate : ext2 & ext3
The first filesystem that was developed as part of Linux was a Minix filesystem
clone. At this time, the Minix filesystem stored its block addresses in 16-bit
integers that restricted the size of the filesystem to 64MB. Also, directory entries
were fixed in size and therefore filenames were limited to 14 characters. Minix
filesystem support was replaced in 1992 by the ext filesystem, which supported
filesystem sizes up to 2GB and filename sizes up to 255 characters. However, ext
inodes did not have separate access, modification, and creation time stamps, and
linked lists were used to manage free blocks and inodes resulting in
fragmentation and less-than-ideal performance.
These inadequacies were addressed by both the Xia filesystem and the ext2
filesystem (which was modelled on the BSD Fast File System), both of which
provided a number of enhancements, including a better on-disk layout for
managing filesystem resources. The improvements resulting in ext2 far
outweighed those of Xia, and in ext2 became the defacto standard on Linux.
The following sections first describe the ext2 filesystem, followed by a
description of how the filesystem has evolved over time to produce the ext3
filesystem which supports journaling and therefore fast recovery.
Features of the ext2 Filesystem
Shown below are the main features supported by ext2:
4TB filesystems. This required changes within the VFS layer. Note that the
maximum file and filesystem size are properties of the underlying filesystem
and the kernel implementation.
255-byte filenames. Directory entries are variable in length with a maximum
size of 255 bytes.
Selectable file semantics. With a mount option, the administrator can choose
whether to have BSD or SVR4 file semantics. This has an effect on the group
ID chosen when a file is created. With BSD semantics, files are created with
the same group ID as the parent directory. For System V semantics, if a
directory has the set group ID bit set, new files inherit the group ID bit of the
parent directory and subdirectories inherit the group ID and set group ID bit;
otherwise, files and directories inherit the primary group ID of the calling
process.
Multiple filesystem block sizes. Block sizes of 1024, 2048, and 4096 bytes can
be specified as an option to mkfs.
Reserved space. Up to 5 percent of the filesystem can be reserved for root-only
files, allowing some recovery in the case of a full filesystem.
Per-file attributes. Attributes can be set on a file or directory to affect
subsequent file access. This is described in detail in the next section.
BSD-like synchronous updates. A mount option ensures that all meta-data
(inodes, bitmaps, indirects and directories) are written to disk synchronously
when modified. This increases filesystem integrity although at the expense
of performance.
Periodic filesystem checks. To enforce filesystem integrity, ext2 has two ways
of ensuring that a full fsck is invoked on the filesystem. A count is kept of
how many times the filesystem is mounted read/write. When it reaches a
specified count, a full fsck is invoked. Alternatively, a time-based system
can be used to ensure that the filesystem is cleaned on a regular basis.
Fast symbolic links. As with VxFS, symbolic links are stored in the inode itself
rather than in a separate allocated block.
The following sections describe some of these features in more detail.
Per-File Attributes
In addition to the features listed in the last section, there is a set of per-file
attributes which can be set using the chattr command and displayed using the
lsattr command. The supported attributes are:
EXT2_SECRM_FL. With this attribute set, whenever a file is truncated the
data blocks are first overwritten with random data. This ensures that once a
file is deleted, it is not possible for the file data to resurface at a later stage in
another file.
EXT2_UNRM_FL. This attribute is used to allow a file to be undeleted.
EXT2_SYNC_FL. With this attribute, file meta-data, including indirect blocks,
is always written synchronously to disk following an update. Note, though,
that this does not apply to regular file data.
EXT2_COMPR_FL. The file is compressed. All subsequent access must use
compression and decompression.
EXT2_APPEND_FL. With this attribute set, a file can only be opened in
append mode (O_APPEND) for writing. The file cannot be deleted by
anyone.
EXT2_IMMUTABLE_FL. If this attribute is set, the file can only be read and
cannot deleted by anyone.
Attributes can be set on both regular files and directories. Attributes that are set
on directories are inherited by files created within the directory.
The following example shows how the immutable attribute can be set on a file.
The passwd file is first copied into the current directory and is shown to be
writable by root. The chattr command is called to set the attribute, which can
then displayed by calling lsattr. The two operations following show that it is
then no longer possible to remove the file or extend it:
# cp /etc/passwd .
# ls -l passwd
-rw-r--r-- 1 root root 960 Jan 28 17:35 passwd
# chattr +i passwd
# lsattr passwd
---i--------passwd
# rm passwd
rm: cannot unlink 'passwd': Operation not permitted
# cat >> passwd
bash: passwd: Permission denied
Note that at the time of writing, not all of the file attributes are implemented.
The ext2 Disk Layout
The layout of structures on disk is shown in Figure 9.6. Aside from the boot
block, the filesystem is divided into a number of fixed size block groups. Each
block group manages a fixed set of inodes and data blocks and contains a copy of
the superblock that is shown as follows. Note that the first block group starts at
an offset of 1024 bytes from the start of the disk slice or volume.
struct ext2_super_block {
unsigned long s_inodes_count; /* Inodes count (in use)*/
Disk-Based Filesystem Case Studies 227
unsigned long s_blocks_count; /* Blocks count (in use) */
unsigned long s_r_blocks_count; /* Reserved blocks count */
unsigned long s_free_blocks_count; /* Free blocks count */
unsigned long s_free_inodes_count; /* Free inodes count */
unsigned long s_first_data_block; /* First Data Block */
unsigned long s_log_block_size; /* Block size */
long s_log_frag_size; /* Fragment size */
unsigned long s_blocks_per_group; /* # Blocks per group */
unsigned long s_frags_per_group; /* # Fragments per group */
unsigned long s_inodes_per_group; /* # Inodes per group */
unsigned long s_mtime; /* Mount time */
unsigned long s_wtime; /* Write time */
unsigned short s_mnt_count; /* Mount count */
short s_max_mnt_count; /* Maximal mount count */
unsigned short s_magic; /* Magic signature */
unsigned short s_state; /* File system state */
unsigned short s_errors; /* Error handling */
unsigned long s_lastcheck; /* time of last check */
unsigned long s_checkinterval; /* max. time between checks */
};
|
Many of the fields shown here are self explanatory and describe the usage of
inodes and data blocks within the block group. The magic number for ext2 is
0xEF58. The fields toward the end of the superblock are used to determine when
a full fsck should be invoked (either based on the number of read/write mounts
or a specified time).
When writing sequentially to a file, ext2 tries to preallocate space in units of 8
contiguous blocks. Unused preallocation is released when the file is closed, so no
space is wasted. This is used to help prevent fragmentation, a situation under
which the majority of the blocks in the file are spread throughout the disk because
contiguous blocks may be unavailable. Contiguous blocks are also good for
performance because when files are accessed sequentially there is minimal disk
head movement.
It is said that ext2 does not need defragmentation under normal load as long as
there is 5 percent of free space on a disk. However, over time continuous addition
and removal of files of various size will undoubtedly result in fragmentation to
some degree. There is a defragmentation tool for ext2 called defrag but users
are cautioned about its use.if a power outage occurs when running defrag,
the file system can be damaged.
The block group is described by the following structure:
struct ext2_group_desc {
unsigned long bg_block_bitmap; /* Blocks bitmap block */
unsigned long bg_inode_bitmap; /* Inodes bitmap block */
unsigned long bg_inode_table; /* Inodes table block */
unsigned short bg_free_blocks_count; /* Free blocks count */
unsigned short bg_free_inodes_count; /* Free inodes count */
unsigned short bg_used_dirs_count; /* Directories count */
};
This structure basically points to other components of the block group, with the
first three fields referencing specific block numbers on disk. By allocating inodes
and disk blocks within the same block group, it is possible to improve
performance because disk head movement may be reduced. The
bg_used_dirs_count field records the number of inodes in the group that are
used for directories. This count is used as part of the scheme to balance
directories across the different block groups and to help locate files and their
parent directories within the same block group.
To better see how the block group structures are used in practice, the following
example, using a small ext2 filesystem, shows how structures are set up when a
file is allocated. Firstly, a filesystem is made on a floppy disk as follows:
# mkfs /dev/fd0
mke2fs 1.24a (02-Sep-2001)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
184 inodes, 1440 blocks
72 blocks (5.00%) reserved for the super user
First data block=1
1 block group
8192 blocks per group, 8192 fragments per group
184 inodes per group
Writing inode tables: 0/1done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 35 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
Analysis of the on-disk structures can be achieved using the debugfs command.
The show_super_stats displays the superblock and the disk group structures.
With the -h option, only the superblock is displayed:
# debugfs /dev/fd0
debugfs 1.24a (02-Sep-2001)
debugfs: show_super_stats -h
Filesystem volume name:
Last mounted on:
Filesystem UUID: e4e5f20a-f5f3-4499-8fe0-183d9f87a5ba
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: filetype sparse_super
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 184
Block count: 1440
Reserved block count: 72
Free blocks: 1399
Free inodes: 173
First block: 1
Block size: 1024
Fragment size: 1024
Blocks per group: 8192
Fragments per group: 8192
Inodes per group: 184
Inode blocks per group: 23
Last mount time: Wed Dec 31 16:00:00 1969
Last write time: Fri Feb 8 16:11:59 2002
Mount count: 0
Maximum mount count: 35
Last checked: Fri Feb 8 16:11:58 2002
Check interval: 15552000 (6 months)
Next check after: Wed Aug 7 17:11:58 2002
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Group 0: block bitmap at 3, inode bitmap at 4, inode table at 5
1399 free blocks, 173 free inodes, 2 used directories
The block group information is shown separate from the superblock. It shows the
block numbers where the various structural information is held. For example, the
inode bitmap for this block group is stored at block 4.recall from the information
displayed when the filesystem was made that the block size is 1024 bytes. This is
stored in the s_log_block_size field in the superblock.
Further information about the block group can be displayed with the
dumpe2fs command as follows:
# dumpe2fs /dev/fd0
dumpe2fs 1.24a (02-Sep-2001)
...
Group 0: (Blocks 1 -1439)
Primary Superblock at 1, Group Descriptors at 2-2
Block bitmap at 3 (+2), Inode bitmap at 4 (+3)
Inode table at 5-27 (+4)
1399 free blocks, 173 free inodes, 2 directories
Free blocks: 41-1439
Free inodes: 12-184
There are 184 inodes per group in the example here. Inodes start at inode number
11 with the lost+found directory occupying inode 11. Thus, the first inode
available for general users is inode 12. The following example shows how all
inodes can be used but without all of the space being consumed:
# cd /mnt
# i=12
# while [ $i -lt 188 ] ; do ; > $i ; i=_eexpr $i + 1_e ; done
bash: 185: No space left on device
bash: 186: No space left on device
bash: 187: No space left on device
# df -k
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda3 19111092 1844084 17267008 10% /
/dev/hda1 21929 3615 17182 18% /boot
shmfs 127780 0 127780 0% /dev/shm
/dev/fd0 1412 15 1325 2% /mnt
So, although the filesystem is only 2 percent full, all of the inodes have been
allocated. This represents one of the difficulties that filesystems have faced over
the years where the number of inodes are statically allocated when the filesystem
is made.
The following example shows the statistics of an allocated file:
# cp /etc/passwd /mnt ; umount /mnt
# debugfs /dev/fd0
debugfs 1.24a (02-Sep-2001)
debugfs: ls -l /
2 40755 0 0 1024 13-Feb-2002 20:20 .
2 40755 0 0 1024 13-Feb-2002 20:20 ..
11 40755 0 0 12288 13-Feb-2002 20:18 lost+found
12 100644 0 0 2064 13-Feb-2002 20:20 passwd
debugfs: stat <12>
Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation: 59537
User: 0 Group: 0 Size: 2064
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 6
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x3c6b3af9 -Wed Feb 13 20:20:09 2002
atime: 0x3c6b3af8 -Wed Feb 13 20:20:08 2002
mtime: 0x3c6b3af8 -Wed Feb 13 20:20:08 2002
BLOCKS:
(0-2):41-43
TOTAL: 3
In this case, the file is displayed by inode number. The size of the file is 2064 bytes
which results in three blocks being allocated: blocks 41 to 43. Recall from
displaying the block group information shown previously that the first data block
started at block 41.
ext2 On-Disk Inodes
The ext2 on-disk inode structure is defined by the ext2_inode structure as
follows:
struct ext2_inode {
__u16 i_mode; /* File mode */
__u16 i_uid; /* Low 16 bits of Owner Uid */
__u32 i_size; /* Size in bytes */
__u32 i_atime; /* Access time */
__u32 i_ctime; /* Creation time */
__u32 i_mtime; /* Modification time */
__u32 i_dtime; /* Deletion Time */
__u16 i_gid; /* Low 16 bits of Group Id */
__u16 i_links_count; /* Links count */
__u32 i_blocks; /* Blocks count */
__u32 i_flags; /* File flags */
__u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
__u32 i_generation; /* File version (for NFS) */
__u32 i_file_acl; /* File ACL */
__u32 i_dir_acl; /* Directory ACL */
__u32 i_faddr; /* Fragment address */
struct {
__u8 l_i_frag; /* Fragment number */
__u8 l_i_fsize; /* Fragment size */
} linux2;
};
The first several fields are self explanatory. The i_blocks field records the
number of blocks that the file has allocated. This value is in 512-byte chunks.
These blocks are stored as either direct data blocks in i_block[] or are
referenced through indirect blocks within the same array. For example, consider
the passwd file copied to an ext2 filesystem as shown above. Because the file is
2064 bytes in size, three 1024 byte blocks are required. The actual block count
shown is 6 (512 byte blocks).
The inode i_block[] array has EXT2_N_BLOCKS (15) pointers to blocks of
data. The first EXT2_NDIR_BLOCKS (12) entries in the array are direct pointers to
data blocks. The i_block[12] element points to an indirect block of pointers to
data blocks. The i_block[13] element points to a double indirect block for
which each element points to an indirect block. The i_block[14] element
points to a triple indirect block of pointers to double indirects.
Various inode numbers are reserved which explains why the first inode
allocated has an inode number of 12 (lost+found is 11). Some reserved inodes
are:
EXT2_BAD_INO (1). This file contains a list of bad blocks on the file system.
EXT2_ROOT_INO (2). This is the root directory of the file system.
EXT2_ACL_IDX_INO (3). ACL inode.
EXT2_ACL_DATA_INO (4). ACL inode.
EXT2_BOOT_LOADER_INO (5). The file contains the boot loader.
EXT2_UNDEL_DIR_INO (6). This file is used for file undelete.
EXT2_FIRST_INO (11). This is the first inode that does not have a special
meaning and can be used for other purposes.
There are many different inode flags that can be stored in i_flags. These map
to the file attributes that can be set with chattr.
The i_faddr field is used in the case where the fragment size and block size
are not equal. If the file does not require an exact number of filesystem-sized
blocks, the last portion of the file data is stored in a fragment. The location of the
fragment is stored in this field.
Repairing Damaged ext2 Filesystems
The e2fsck is used to repair filesystem inconsistencies, that can occur following
a system crash. The process followed is divided into five separate passes which
are listed below. The information shown here is based on material that appears in
the Linux System Administrators Guide [WIRZ95]:
Pass 1. This phase takes the longest time to execute, because all of the inodes
have to be read into memory and checked.
In this phase, e2fsck checks each inode in the filesystem to ensure the
file mode is valid and that all of the blocks in the inode are valid block
numbers. During pass 1, bitmaps indicating which blocks and inodes are in
use are compiled, to be used later.
If e2fsck notices data blocks that are mapped by more than one inode, it
can either clone the duplicated blocks so that each inode has its own copy, or
remove the blocks from one or more of the inodes.
To reduce the I/O time necessary in future passes, critical filesystem
information is cached in memory, including the location on disk of all of the
directory blocks on the filesystem. This removes the need to re-read the
directory inodes during pass 2.
Pass 2. In this phase directories are validated. Because directory entries do not
span disk blocks, each directory block can be checked individually without
reference to other directory blocks. The directory blocks are checked to
make sure that the directory entries are valid and contain references to
inode numbers that are in use (as determined by pass 1).
For the first directory block in each directory inode, the _g._h and _h.._h
entries are checked to make sure they exist, and that the inode number for
the _g._h entry matches the current directory.
Pass 2 also caches information concerning the parent directory in which
each directory is linked. If a directory is referenced by more than one
directory, the second reference of the directory is treated as an illegal hard
link and is removed.
Note that at the end of pass 2, nearly all disk I/O that e2fsck needs to
perform is complete. Information required by passes 3, 4, and 5 are cached in
memory; hence, the remaining passes of e2fsck are largely CPU bound and
take less than 5 to 10 percent of the total running time.
Pass 3. In this phase, the directory connectivity is checked by tracing the path of
each directory back to the root using information that was cached during
pass 2. At this time, the _g.._h entry for each directory is also checked to make
sure it is valid. Any directories that can not be traced back to the root are
linked to the lost+found directory.
Pass 4. In this phase, e2fsck checks the reference counts for all inodes by
iterating over all the inodes and comparing the link counts (which were
cached in pass 1) against internal counters calculated during passes 2 and 3.
Any undeleted files with a zero link count are placed in lost+found during
this pass.
Pass 5. In this last phase e2fsck checks the validity of the filesystem summary
information. It compares the block and inode bitmaps which were
constructed during the previous passes against the actual bitmaps on the
filesystem and corrects the on-disk copies if necessary.
The e2fsck program is designed to run as quickly as possible. Because
filesystem checking programs tend to be disk-bound, this was done by optimizing
the algorithms used by e2fsck so that filesystem structures are not repeatedly
accessed from the disk. In addition, the order in which inodes and directories are
checked are sorted by block number, to reduce the amount of time in disk seeks.
Tuning a ext2 Filesystem
The tune2fs program can be used to change the various tunable parameters of
an ext2 filesystem. Some of the different tunables that can be changed are:
-c max-mount-counts. This option adjusts the count of read/write mounts
between two filesystem checks.
-e error-behavior. When errors are detected, the behavior of the ext2
kernel code can be altered with this option. The value of error-behavior
can be continue in that the kernel continues with normal execution,
remount-ro, which forces the kernel to remount the filesystem read-only,
or panic in which case the kernel will panic.
-u user. This option sets the user who can benefit from the reserved blocks
when the filesystem becomes full. The value of user can be a numerical user
ID or a user name.
For further information on tune2fs see the tune2fs(8) manual page.
Resizing ext2 Filesystems
The resize2fs command can be used to increase or decrease the size of an ext2
filesystem. Note that the filesystem must be unmounted before the resize can
take place. The resize2fs program does not manipulate the size of underlying
partition. To increase the size of a filesystem, the partition must be increased first
using fdisk. Similarly, to decrease the size of an ext2 filesystem, the partition
must be resized with fdisk following the call to resize2fs.
If an ext2 filesystem resides on an LVM (Logical Volume Manager) volume, the
e2fsadm command can be used to resize both the filesystem and the underlying
logical volume.
The ext3 Filesystem
The ext3 filesystem was introduced to solve one specific problem, namely the
amount of time it takes to perform a filesystem check following a system crash.
As described in the section VxFS Journaling, earlier in this chapter, these times
can be significant, measured in many hours, if the filesystem is very large in size.
Note that large in this case is actually a property of the amount of structural data
(inodes) and not specifically the size of the filesystem.
Another goal behind ext3 was to make as few changes to the underlying ext2
code base as possible because ext2 is small in size, easy to maintain, robust, and
well understood.
The use of ext3 was positioned in such a way that it is easy to transition
between ext2 and ext3 filesystems and vice versa.
The actual journaling layer is separate from ext3. The filesystem understands
the concepts of transaction (when one starts, when it finishes) but it is not
actually responsible for the journaling.
How to Use an ext3 Filesystem
A new ext3 filesystem can be created by mkfs or by converting an existing ext2
filesystem. To create a new ext3 filesystem, mkfs is called as follows:
# mkfs -j /dev/sda5
To convert an existing ext2 filesystem to an ext3 filesystem, the tune2fs
command can be invoked as follows:
# tune2fs -j /dev/sda5
Note that the command can be invoked on either a mounted or unmounted
filesystem. If invoked on a mounted filesystem, the journal will appear as a
visible file (.journal). If invoked on an unmounted filesystem or if mkfs -j is
run when making the filesystem, the journal will not be visible.
To actually mount the filesystem, the ext3 filesystem type must be specified:
# mount -t ext3 /dev/sda5 /mnt1
Conversion back to ext2 can be achieved by using the tune2fs command as
follows:
# tune2fs -O ^has_journal /dev/sda5
or simply by replaying the log to make the filesystem clean and then simply
mounting it as an ext2 filesystem.
Data Integrity Models in ext3
As with VxFS, there is a set of choices about the type and level of journaling to be
performed. Users can choose among the following options, which are passed to
mount.
data=writeback. This option limits data integrity guarantees so that file
data itself is not journaled. The filesystem, is however, guaranteed to be
structurally sound at all times.
data=ordered. This mode, which is the default, ensures that data is
consistent at all times. The data is actually written to the file before the
transaction is logged. This ensures that there is no stale data in any
filesystem block after a crash.
data=journal. This option writes all file data through the journal. This
means that the data is actually written to disk twice. This option provides the
best guarantees in terms of filesystem integrity but because data is written
through the journal, performance can be significantly impacted and the time
for recovery after a crash can be much greater.
How Does ext3 Work?
The design of ext3 was presented in [TWEE98]. To provide a transaction
mechanism, all meta-data-related data blocks must be logged in the journal. There
are three distinct types of blocks in question:
Journal blocks. An update to an inode, for example, will write the entire
filesystem block to which the inode belongs in to the journal. In [TWEE98],
Stephen Tweedie claims that this is a relatively cheap method due to the
sequential nature in which data is written to the journal, and that by
following this simple approach, there is little complexity in the kernel and
therefore less CPU overhead.
Descriptor blocks. These blocks describe other journal blocks and are written
to the journal before the journal blocks are written. Because the journal
blocks are the actual meta-data blocks that must be written, the descriptor
blocks are used to record information about the journal blocks, such as the
disk block on which they reside.
Header blocks. The header blocks are written throughout the journal. They
record the start and end of the journal together with a sequence number that
is used during recovery to locate the order in which the blocks were written.
As with VxFS, transactions are delayed in memory to aid performance. With ext3,
a set of transactions is batched into a compound transaction and committed to the
journal on disk. This process is called checkpointing. While checkpointing is in
progress, a new compound transaction is started, that will record any further
changes to the filesystem while the previous compound transaction is being
written to disk.
Crash recovery is performed by walking through the journal and writing any
journal blocks to their correct location on disk. Because this is an idempotent
operation, a crash in the middle of recovery does not matter because the process
can be repeated any number of times with exactly the same effect.
Summary
There are many different UNIX filesystems and to scratch the surface on all of
them would easily fill a book of this size. The three filesystems described in the
chapter represent a good cross section of filesystems from the UNIX and Linux
operating systems and cover the commercial filesystem market (VxFS), the most
widely documented and ported filesystem (UFS), and the most popular open
source filesystems (ext2 and ext3).
Only a few other filesystems have been documented in any detail. [HANC01]
describes the AdvFS filesystem developed by Digital which is the main
filesystem of their True64 operating system. [KELL96] describes IBM_fs JFS
filesystem.
To understand filesystem internals it is always best to start with one of the
simple filesystems such as the original System V filesystem as documented in
[LION96]. If studying Linux, the ext2 filesystem on one of the earlier kernels is a
good place to start before looking at the more elaborate, and therefore more
complex, filesystems.
Pseudo Filesystems
When people think of filesystems, they tend to think of a file hierarchy of files and
directories that are all stored on disk somewhere. However, there are a number of
filesystem types that provide a host of useful information but which have no
physical backing store (disk storage). The most well known pseudo filesystem is
/proc, which is used by the ps command as well as various debuggers.
This chapter describes some of the more well known pseudo filesystem types
and provides a basic implementation of the ps command using the Solaris /proc
filesystem.
The /proc Filesystem
The /proc filesystem was first introduced in 8th Edition UNIX and was described
in Tom Killian_fs 1984 Usenix paper _gProcesses as Files_h [KILL84].
The /proc filesystem was to replace the ptrace() system call, with the
advantage that the full process address space was visible and could be
manipulated with read() and write() system calls. This contrasts with the
interfaces offered by ptrace(), the system call traditionally used by debuggers,
that only provides a word-at-a-time interface.
Roger Faulkner and Ron Gomes ported the research version of /proc to SVR4
and presented their work in another USENIX paper: _gThe Process File System
and Process Model in UNIX System V_h [FAUL91]. At that time, Faulkner was
with Sun Microsystems and Gomes with AT&T Bell Laboratories. As described in
the paper, future work was intended to restructure /proc from a flat file system
into a directory hierarchy describing a process. That work was undertaken at
both Sun and USL and will be described later.
In the early /proc implementation, whose name is derived from the directory
on which it is mounted, there is an entry in the directory for each process in the
system. The name of the file displayed corresponds to the process ID, while the
size of the file represents the size of the process address space. The file
permissions correspond to the user who owns the process.
Figure 11.1 shows at a high level how the /proc filesystem is implemented.
Standard file-related system calls such as open(), read(), and write() are
handled at the filesystem-independent layer in the same manner as for other
filesystem types. Much of the information about a process is held in the process
table (traditionally in the array proc[]). To open a specific process file, the /proc
filesystem must scan the process table looking for an entry whose p_pid field
matches the pathname component passed.
One of the most widely used commands that access /proc is ps. Its role is to
open each file in the /proc directory and then access the process status through
an ioctl() system call. This was originally represented by the prstatus
structure, which could be obtained by opening the file and issuing the
PIOCSTATUS ioctl command. With the SVR4 implementation of /proc, there
were over 40 different ioctl commands that could be issued, many of which
dealt with debugging.
Note that the /proc filesystem does not have to be mounted on the /proc
directory. It can in fact be mounted multiple times, which allows it to be used in
chroot() environments.
The Solaris /proc Implementation
With the introduction of user-level threads of execution, the notion of /proc
changed substantially from the single threaded process-based model of previous
versions of UNIX. Each entry in /proc is a directory under which all of the
information about a specific process is collected.
As an example, consider the following process, which is run in the
background. Using the process ID that is returned, the contents of the
/proc/3707 are displayed:
$ sleep 10000&
[1] 3707
$ cd /proc/3707
$ ls -l
total 1618
-rw------- 1 spate fcf 1630208 May 28 21:24 as
-r-------- 1 spate fcf 152 May 28 21:24 auxv
Pseudo Filesystems 251
-r-------- 1 spate fcf 36 May 28 21:24 cred
--w------- 1 spate fcf 0 May 28 21:24 ctl
lr-x------ 1 spate fcf 0 May 28 21:24 cwd ->
dr-x------ 2 spate fcf 8208 May 28 21:24 fd
-r--r--r-- 1 spate fcf 120 May 28 21:24 lpsinfo
-r-------- 1 spate fcf 912 May 28 21:24 lstatus
-r--r--r-- 1 spate fcf 536 May 28 21:24 lusage
dr-xr-xr-x 3 spate fcf 48 May 28 21:24 lwp
-r-------- 1 spate fcf 1728 May 28 21:24 map
dr-x------ 2 spate fcf 544 May 28 21:24 object
-r-------- 1 spate fcf 2048 May 28 21:24 pagedata
-r--r--r-- 1 spate fcf 336 May 28 21:24 psinfo
-r-------- 1 spate fcf 1728 May 28 21:24 rmap
lr-x------ 1 spate fcf 0 May 28 21:24 root ->
-r-------- 1 spate fcf 1440 May 28 21:24 sigact
-r-------- 1 spate fcf 1232 May 28 21:24 status
-r--r--r-- 1 spate fcf 256 May 28 21:24 usage
-r-------- 1 spate fcf 0 May 28 21:24 watch
-r-------- 1 spate fcf 2736 May 28 21:24 xmap
|
The contents of some of these files are C structures. For each of the structures that
can be accessed, the procfs.h header file can be referenced for further
information. Where structures are described, the file can be opened and the
structure read directly from offset 0 within the file. A primitive ps example,
shown in the section Accessing Files in the Solaris /proc Filesystem, later in this
chapter, demonstrates how this is achieved.
Some of the files make reference to an LWP, a light weight process. The LWP
model is used to provide support for multiple threads of control within a process.
Grouping threads into an LWP alters the scheduling properties of the different
threads.
The various files contained within /proc on a per-process basis are:
Opening this file gives access to the address space of the process. This
allows the caller to find a specific address using lseek() and then either
read from or write to the address using read() and write().
This file contains dynamic linker information.
cred. The process credentials, defined by the pcred structure, can be found
here. This includes information such as the real and effective user IDs, real
and effective group IDs, group, and supplementary group information.
ctl. This write-only file is used for process control and accounting. A request
may be made to stop or start a process or enable process event tracing.
cwd. This file is a symbolic link to the process_f current working directory.
fd. This directory contains files that correspond to the files that the process
has open. There is one entry per open file.
lpsinfo, lstatus, lusage. These files give information about each of
the process LWPs. Note that there can be multiple LWPs per process; each
contains one or more threads.
map. This file contains an array of pmap structures, each of which describes a
segment within the virtual address range of the process.
object. Each address space segment maps an underlying file. This directory
contains read-only files that are referenced by the map and pagedata files.
Opening one of these files gives a file descriptor for the specific mapped file.
pagedata. Opening this file allows the caller to track address space
references and modifications on a per-page basis.
psinfo. This file gives general information about the state of the process that
is used by the ps command. The psinfo structure, defined in procfs.h,
can simply be read from this file.
rmap. Similar to the map file, this file contains an array of prmap structures.
These segments are reserved by the operating system for structures such as
the stack.
root.This file is a symbolic link to the process_f root directory.
sigact. This file contains an array of sigaction structures which define
the disposition of signals associated with the traced process.
status. The information stored in this file, underpinned by the pstatus
structure, gives a fairly detailed account about the state of the process. This
includes a set of flags that indicate whether the process is runnable,
stopped, being single-stepped, and so on. Process group and session
information, memory size, and tracing data are some of the other types of
information that can be found in this file.
usage. This file, underpinned by the prusage structure, gives a wealth of
timing-related information about the process.
watch. This file contains an array of pwatch structures, which enable a
process to be debugged. The controlling process can set breakpoints in the
process by writing a PCWATCH message through the ctl file.
The lwp directory contains further information about each light weight process.
Accessing Files in the Solaris /proc Filesystem
To demonstrate how to access files within /proc, the following simple program
gives an idea of how the ps program is implemented. Much of the information
that is displayed by ps can be accessed through the psinfo file. Reading from
this file returns data underpinned by the psinfo structure. The following
program takes a process ID as an argument and reads the corresponding psinfo
for that process. It then displays some of the information.
#include < fcntl.h>
#include < procfs.h>
main(int argc, char *argv[])
{
struct psinfo ps;
char fname[256];
int fd;
sprintf(fname, "/proc/%s/psinfo", argv[1]);
fd = open(fname, O_RDONLY);
read(fd, (char *)&ps, sizeof(struct psinfo));
printf("UID\tPID\tPPID\tCMD\n");
printf("%d\t%d\t%d\t%s\n",
ps.pr_uid, ps.pr_pid, ps.pr_ppid, ps.pr_psargs);
}
Shown below is a simple run of the program, which displays information about
the sleep process shown earlier:
$ ./mps 3707
UID PID PPID CMD
824 3707 1 sleep 100000
The psinfo file for each /proc entry is readable by anyone. Thus, it is possible
for any user to write a more elaborate version of the preceding program that
displays entries for all processes.
Tracing and Debugging with /proc
The ctl file allows one process to control another process through a rich set of
functions provided by the /proc filesystem. Although all of these functions won_ft
be described here, the aim is to highlight the type of features available and show
how a process can be traced or debugged.
Access to the ctl file, which is write only, is achieved by writing an operational
code to the file together with any additional data required for the operation in
question. The controlling process tracks three different types of events, namely:
Signals. A stop based on a signal is handled in all cases where the signal is
detected, whether on return from a system call or trap, or during process
wakeup.
System calls. The process is stopped either when the kernel is entered to
process a system call or is just about to exit from the kernel back to user
space after the system call has been processed.
Faults. There are a number of different fault types that can be managed, some
of which depend on the type of architecture on which the operating system
is running. Fault types include illegal instructions, breakpoints, memory
access, and trace traps (used for single stepping).
The truss command is a prime example of a utility that controls another
process. Its role is to display the system calls made by another process including
the system call arguments and return values. The PCSENTRY and PCSEXIT
control functions determine whether a process stops on entry to or exit from a
system call. The system calls to be traced are held in the sysset_t structure,
which is passed along with the PCSENTRY and PCSEXIT control functions. The
prfillset() function can be used to build the complete set of system calls,
because truss will monitor all system calls. For a more controlled trace, the set
of system calls monitored can be altered using the praddset() and
prdelset() library functions.
There are a number of different control messages that both stop and start a
process. As an example of those functions that are relevant to truss, the PCSTOP
function directs the process to stop on an event of interest and waits for it to stop.
An event of interest is defined by invoking PCSTRACE (signals to be traced),
PCSFAULT (faults to be traced), PCSENTRY (system call entry), or PCSEXIT
(system call exit). The PCRUN control function makes the process runnable again.
The following pseudo code gives a high-level view of how the truss utility
can be implemented:
prfillset(&syscalls)
PCSENTRY(syscalls)
PCSEXIT(syscalls)
do {
PCSTOP()
extract system call arguments
PCSTART()
PCSTOP()
extract system call return value
display system call type, arguments and return value
PCSTART()
} while (syscall type != exit);
Although this is a simplification, it demonstrates the power of the control
functions implemented by the /proc filesystem.
There are a large number of control functions that make a debugger writer_fs
life much easier. If the debugger is interested in fault types, the following are
relevant:
FLTBPT. A breakpoint trap.
FLTTRACE. A trace trap (used for single stepping).
FLTWATCH. A watchpoint trap (used to trap on memory access).
The PCSFAULT control function can be used to set the faults to be traced. To put a
breakpoint on a specific memory access, the PCWATCH function can be used to
specify the address to be watched and whether an event should be triggered for
read, write, or execute access. This can be used in conjunction with the stop and
start control functions.
Anyone wishing to study how a real debugger makes use of /proc should look
at the Solaris implementation of gdb, the GNU debugger whose source is freely
available.
The Specfs Filesystem
Devices, whether block or character, are represented by special files in the
filesystem. As the number of UNIX filesystem types increased, it was found that
each filesystem was duplicating effort when managing access to the devices
themselves.
Having multiple special files in the namespace caused an additional problem in
that there could be multiple buffers in the buffer cache corresponding to the same
block on disk. Considering how files are accessed, returning a filesystem vnode
for a device file is incorrect. For example, consider the case where the device file
resides on a UFS filesystem. Returning a vnode that has the v_op field of the
vnode set to the list of UFS vnode operations will lead to problems. First, the open
vnode operation on UFS or any other filesystem really has no function to perform
for regular files. Second, many of the operations that are applicable to regular files
are not applicable to device files. To make matters worse, if the vnode goes
inactive, the filesystem may attempt to close the device even though it is open
through access to another special file that references the same device.
All of these problems can be solved by adding additional logic inside the
filesystem. However, consideration must be given on how to handle device access
for each vnode operation. Furthermore, reference counting to determine when the
last close on a device occurs is left up to the device driver. All in all, this leads to a
situation that has a lot of duplication and is prone to errors.
To solve these problems, a new filesystem type, specfs, was introduced in SVR4.
The specfs filesystem is not visible to users in that it cannot be mounted or seen
from within the namespace.
During a VOP_LOOKUP() operation, instead of returning a vnode which
corresponds to the special file, the filesystem makes a call to specvp() which
returns a new specfs vnode, that the filesystem must return from the lookup
operation. This vnode points to a specfs node (snode), a private specfs data
structure that references the real vnode of the filesystem.
In the case where one device has more than one entry in the namespace, the
snode also points to a common specfs vnode. It is through this common vnode
that device access actually takes place.
The following example shows the linkage between two device special files and
the common specfs vnode that represents both. This is also shown in Figure 11.2.
First of all consider the following simple program, which simply opens a file and
pauses awaiting a signal:
#include < fcntl.h>
main(int argc, char *argv[])
{
int fd;
fd = open(argv[1], O_RDONLY);
pause();
}
As shown below, a new special file is created with the same major and minor
number as /dev/null:
# ls -l /dev/null
crw-r--r-- 1 root other 13, 2 May 30 09:17 mynull
# mknod mynull c 13 2
# ls -l mynull
crw-r--r-- 1 root other 13, 2 May 30 09:17 mynull
and the program is run as follows:
# ./dopen /dev/null &
[1] 3715
# ./dopen mynull &
[2] 3719
Using crash, it is possible to trace through the list of file related structures
starting out at the file descriptor for each process, to see which underlying
vnodes they actually reference. First, the process table slots are located where the
two processes reside:
# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> p ! grep dopen
336 s 3719 3713 3719 3713 0 46 dopen load
363 s 3715 3713 3715 3713 0 46 dopen load
Starting with the process that is accessing the mynull special file, the user area is
displayed to locate the open files:
> user 336
...
OPEN FILES, POFILE FLAGS, AND THREAD REFCNT:
[0]: F 300106fc690, 0, 0 [1]: F 300106fc690, 0, 0
[2]: F 300106fc690, 0, 0 [3]: F 300106fca10, 0, 0
...
The file structure and its corresponding vnode are then displayed as shown:
> file 300106fca10
ADDRESS RCNT TYPE/ADDR OFFSET FLAGS
300106fca10 1 SPEC/300180a1bd0 0 read
> vnode 300180a1bd0
VCNT VFSMNTED VFSP STREAMP VTYPE RDEV VDATA VFILOCKS VFLAG
1 0 300222d8578 0 c 13,2 300180a1bc8 0 -
> snode 300180a1bc8
SNODE TABLE SIZE = 256
HASH-SLOT MAJ/MIN REALVP COMMONVP NEXTR SIZE COUNT FLAGS
- 13,2 3001bdcdf50 30001b5d5b0 0 0 0
The REALVP field references the vnode for the special file within the filesystem
that references mynull.
For the process that opens the /dev/null special file, the same sequence of
operations is followed as shown:
> user 363
...
Note that for the snode displayed here, the COMMONVP field is identical to the
COMMONVP field shown for the process that referenced mynull.
To some readers, much of what has been described may sound like overkill.
However, device access has changed substantially since the inception of specfs.
By consolidating all device access, only specfs needs to be changed. Filesystems
still make the same specvp() call that they were making 15 years ago and
therefore have not had to make any changes as device access has evolved.
The BSD Memory-Based Filesystem (MFS)
The BSD team developed an unusual but interesting approach to memory-based
filesystems as documented in [MCKU90]. Their goals were to improve upon the
various RAM disk-based filesystems that had traditionally been used.
A RAM disk is typically a contiguous section of memory that has been set
aside to emulate a disk slice. A RAM disk-based device driver is the interface
between this area of memory and the rest of the kernel. Filesystems access the
RAM disk just as they would any other physical device. The main difference is
that the driver employs memory to memory copies rather than copying between
memory and disk.
The paper describes the problems inherent with RAM disk-based filesystems.
First of all, they occupy dedicated memory. A large RAM disk therefore locks
down memory that could be used for other purposes. If many of the files in the
RAM disk are not being used, this is particularly wasteful of memory. One of the
other negative properties of RAM disks, which the BSD team did not initially
attempt to solve, was the triple copies of data. When a file is read, it is copied
from the file_fs location on the RAM disk into a buffer cache buffer and then out to
the user_fs buffer. Although this is faster than accessing the data on disk, it is
incredibly wasteful of memory.
The BSD MFS Architecture
Figure 11.3 shows the overall architecture of the BSD MFS filesystem. To create
and mount the filesystem, the following steps are taken:
1. A call to newfs is made indicating that the filesystem will be memory-based.
2. The newfs process allocates an area of memory within its own address space
in which to store the filesystem. This area of memory is then initialized with
the new filesystem structure.
3. The newfs command call is made into the kernel to mount the filesystem.
This is handled by the mfs filesystem type that creates a device vnode to
reference the RAM disk together with the process ID of the caller.
4. The UFS mount entry point is called, which performs standard UFS mount
time processing. However, instead of calling spec_strategy() to access
the device, as it would for a disk-based filesystem, it calls
mfs_strategy(), which interfaces with the memory-based RAM disk.
One unusual aspect of the design is that the newfs process does not exit. Instead,
it stays in the kernel acting as an intermediary between UFS and the RAM disk.
As requests for read and write operations enter the kernel, UFS is invoked as
with any other disk-based UFS filesystem. The difference appears at the
filesystem/driver interface. As highlighted above, UFS calls mfs_strategy()
in place of the typical spec_strategy(). This involves waking up the newfs
process, which performs a copy between the appropriate area of the RAM disk
and the I/O buffer in the kernel. After I/O is completed, the newfs process goes
back to sleep in the kernel awaiting the next request.
After the filesystem is unmounted the device close routine is invoked. After
flushing any pending I/O requests, the mfs_mount() call exits causing the
newfs process to exit, resulting in the RAM disk being discarded.
Performance and Observations
Analysis showed MFS to perform at about twice the speed of a filesystem on disk
for raw read and write operations and multiple times better for meta-data
operations (file creates, etc). The benefit over the traditional RAM disk approach
is that because the data within the RAM disk is part of the process address space,
it is pageable just like any other process data. This ensures that if data within the
RAM disk isn_ft being used, it can be paged to the swap device.
There is a disadvantage with this approach; a large RAM disk will consume a
large amount of swap space and therefore could reduce the overall amount of
memory available to other processes. However, swap space can be increased, so
MFS still offers advantages over the traditional RAM disk-based approach.
The Sun tmpfs Filesystem
Sun developed a memory-based filesystem that used the facilities offered by the
virtual memory subsystem [SNYD90]. This differs from RAM disk-based
filesystems in which the RAM disk simply mirrors a copy of a disk slice. The goal
of the design was to increase performance for file reads and writes, allow
dynamic resizing of the filesystem, and avoid an adverse effect on performance.
To the user, the tmpfs filesystem looks like any other UNIX filesystem in that it
provides full UNIX file semantics.
Chapter 7 described the SVR4 filesystem architecture on which tmpfs is based.
In particular, the section An Overview of the SVR4 VM Subsystem in Chapter 7,
described the SVR4/Solaris VM architecture. Familiarity with these sections is
essential to understanding how tmpfs is implemented. Because tmpfs is heavily
tied to the VM subsystem, it is not portable between different versions of UNIX.
However, this does not preclude development of a similar filesystem on the other
architectures.
Architecture of the tmpfs Filesystem
In SVR4, files accessed through the read() and write() system calls go
through the seg_map kernel segment driver, which maintains a cache of recently
accessed pages of file data. Memory-mapped files are backed by a seg_vn kernel
segment that references the underlying vnode for the file. In the case where there
is no backing file, the SVR4 kernel provides anonymous memory that is backed by
swap space. This is described in the section Anonymous Memory in Chapter 7.
Tmpfs uses anonymous memory to store file data and therefore competes with
memory used by all processes in the system (for example, for stack and data
segments). Because anonymous memory can be paged to a swap device, tmpfs
data is also susceptible to paging.
Figure 11.4 shows how the tmpfs filesystem is implemented. The vnode
representing the open tmpfs file references a tmpfs tmpnode structure, which is
similar to an inode in other filesystems. Information within this structure
indicates whether the file is a regular file, directory, or symbolic link. In the case of
a regular file, the tmpnode references an anonymous memory header that
contains the data backing the file.
File Access through tmpfs
Reads and writes through tmpfs function in a very similar manner to other
filesystems. File data is read and written through the seg_map driver. When a
write occurs to a tmpfs file that has no data yet allocated, an anon structure is
allocated, which references the actual pages of the file. When a file grows the
anon structure is extended.
Mapped files are handled in the same way as files in a regular filesystem. Each
mapping is underpinned by a segment vnode.
Performance and Other Observations
Testing performance of tmpfs is highly dependent on the type of data being
measured. Many file operations that manipulate data may show only a marginal
improvement in performance, because meta-data is typically cached in memory.
For structural changes to the filesystem, such as file and directory creations, tmpfs
shows a great improvement in performance since no disk access is performed.
[SNYD90] also shows a test under which the UNIX kernel was recompiled. The
overall time for a UFS filesystem was 32 minutes and for tmpfs, 27 minutes.
Filesystems such as VxFS, which provide a temporary filesystem mode under which
nearly all transactions are delayed in memory, could close this gap significantly.
One aspect that is difficult to measure occurs when tmpfs file data competes for
virtual memory with the applications that are running on the system. The amount
of memory on the system available for applications is a combination of physical
memory and swap space. Because tmpfs file data uses the same memory, the
overall memory available for applications can be largely reduced.
Overall, the deployment of tmpfs is highly dependent on the type of workload
that is running on a machine together with the amount of memory available.
262 UNIX Filesystems.Evolution, Design, and Implementation
CHAPTER 13 Clustered and Distributed Filesystems
With the advent of computer networks, which appeared in the 1970s and became
more widespread in the 1980s, the need to share files between machines became
essential. Initially, UNIX tools such as uucp and ftp were used to transfer files
from one machine to another. However, disks were still relatively expensive; this
resulted in a need to access files across the network without local storage.
The 1980s saw a number of different distributed filesystems make an
appearance in the UNIX community, including Sun_fs Network Filesystem (NFS),
AT&T_fs Remote File Sharing (RFS), and CMU_fs Andrew File System (AFS) which
evolved into the DCE Distributed File Service (DFS). Some of the distributed
filesystems faded as quickly as they appeared. By far, NFS has been the most
successful, being used on tens of thousands of UNIX and non-UNIX operating
systems throughout the world.
Distributed filesystems operate around a client/server model, with one of the
machines owning an underlying disk-based filesystem and serving files to clients
through some well-defined protocol. The protocol and means of transferring files
to the user is transparent, with UNIX file semantics being a key goal.
Clustered filesystems by contrast treat a collection of machines and disks as a
single entity and provide a fully coherent view of the filesystem from any of the
nodes. To the user, clustered filesystems present a single coherent view of the
filesystem and may or may not offer full UNIX file semantics. Clustered
filesystems as well as local filesystems can be exported for use with NFS.
286 UNIX Filesystems.Evolution, Design, and Implementation
Distributed Filesystems
Unlike local filesystems where the storage is physically attached and only
accessible by processes that reside on the same host machine, distributed
filesystems allow access to files on a remote machine through use of a
well-defined protocol. Distributed filesystems employ a client/server model
where a single filesystem server can serve files to multiple clients.
Regardless of the type of distributed filesystem, one goal that is absolutely
essential to all of these filesystems is the need to provide UNIX file semantics
when accessing remote files from the client.
There have been numerous distributed filesystems developed for UNIX over
the last 20 years. Many of them have come and gone. The most successful
distributed filesystem by far is the Sun Network Filesystem (NFS) which appeared
in the mid 1980s. Although not as feature-rich as filesystems such as the DCE
Distributed File Service (DFS), the simplicity of the NFS protocol, together with the
fact that the NFS protocol is in public domain, resulted in it being ported to many
different operating systems, UNIX and non-UNIX alike.
The following sections describe some of the main UNIX distributed
filesystems with particular emphasis on NFS.
The Network File System (NFS)
With the advent of networks providing connectivity between computers, it
became feasible to provide interfaces through which user programs could access
files across a network using the same mechanisms by which they accessed files
on a local machine.
Hard disks were still relatively expensive in the late 1970s and early 1980s. By
providing a client/server file protocol such as NFS, hardware designers were free
to build diskless workstations or least workstations with a minimal amount of
local storage.
NFS Background and History
The Network File System (NFS) was initially developed by Sun Microsystems in
the early to mid 1980s and has been a huge success, with ports to just about every
operating system, UNIX and non-UNIX alike. In the paper that described the first
two versions of NFS [SAND85], the goals of NFS were to:
Provide machine and OS independence. This goal was to ensure that NFS
could work on UNIX and non-UNIX operating systems. The client/server
protocols were to be simple enough that they could be implemented on
PCs.at the time, a DOS-based environment.
Provide resilience to server crashes. If the server crashes, the clients who are
currently accessing the server must be able to recover. Furthermore, the
client should be unable to tell the difference between a server that crashed
and restarted and one that was just slow in responding.
Provide transparent file access. In order for the filesystem to succeed, it was
important that applications could access files through NFS in the same
manner in which they could access files on a local disk.
Maintain UNIX semantics on the client. To satisfy the above goal in UNIX
environments, it was imperative that NFS provide UNIX file semantics to the
applications on the client.
Have acceptable performance. As stated in [SAND85] _gPeople will not want to
use NFS if it is no faster than the existing network utilities, such as rcp, even
if it is easier to use._h The performance targets were set at 80 percent of local
disk access.
There were three pieces that comprised NFS, the protocol, the client, and the
server. All three will be described throughout the following sections.
The original NFS implementation, as described in [SAND85], encompassed
both version 1 and 2 of the protocol when it first became available to the public in
SunOS 2.0 in 1985. The first version of the protocol was only used internally
within Sun.
At the time of writing, NFS implementations adhering to version 3 of the
protocol have been in common use for several years and version 4
implementations are starting to appear. NFS is very well understood by tens of
thousands of people throughout the world, which is a great testament of its
success and an indicator as to why it will still be one of the most dominant of the
distributed filesystems for many years to come.
The Version 1 and 2 NFS Protocols
As described in The Sun VFS/Vnode Architecture in Chapter 7, the SunOS
filesystem architecture was redesigned and implemented to accommodate
multiple filesystem types and provide support for NFS. This was not the only
feature of the kernel that NFS depended on.it also relied on use of the Sun
Remote Procedure Call (RPC) infrastructure, that provided a synchronous, cross
network mechanism for one process (or the kernel) to call another process on a
different machine. This allowed the NFS protocol to be broken down into a set of
procedures specifying their arguments, the results, and the effects.
To communicate across the network, NFS used the User Datagram Protocol
(UDP) on top of the Internet Protocol (IP). Because the protocol was designed to be
independent of machine architectures and operating systems, the encoding of the
protocol messages and their responses was sensitive to issues such as endianess
(the order in which bytes are packed into a machine word). This resulted in the
use of the External Data Representation (XDR) specification.
The use of RPC and XDR are described in the following section. Before
describing how they are used, it first helps to describe the actual procedure calls
NFS introduced for communication across the network. The version 2 protocol
was documented in [RFC1094], which describes the procedure calls shown in
Table 13.1. Most of the operations are self explanatory. The null procedure is
used to ping the server and can also be used as a means of measuring the round
trip time. The statfs procedure returns information that can be displayed when
making a call to the df command.
The file referenced by these procedures is called a file handle. This is an opaque
data structure provided by the server in response to a lookup request. The client
should never try to interpret the contents of the file handle. File handles are
constructed using a combination of operating-specific information and
information provided by the filesystem. For the latter, the information must
provide a means to locate the file, so the file handle is typically a combination of
filesystem specific information together with the inode number of the file and its
generation count.
Many of the procedures deal with file attributes. Not surprisingly, the
attributes correspond to the various fields of the stat structure as described in
the section Basic File Properties in Chapter 2.
The NFS server is stateless in that there is no information kept on the server
about past requests. This avoids any complicated crash recovery mechanism. If
the client does not receive a response from the server within a specific period of
time, the request is retried until it succeeds. This tolerates a server crash and
reboot, ensuring that the client cannot detect the difference between a server
crash and a server that is simply slow in responding.
The version 2 protocol also requires that any file writes are synchronous. This
meets the objective of achieving UNIX file semantics.
Within the file handle, the inode generation count is typically encoded.
Consider the case when a file is opened and a file handle is returned. If the file is
removed on the server and the inode is reused later for a new file, the file handle
is no longer valid because it refers to the old file. To distinguish between the old
and new files, UNIX filesystems contain a generation count that is incremented
each time the inode is reused. By using the generation count within the file
handle, the stale file handle will be detected by the server when the deleted file is
referenced.
One of the main goals of NFS was to make it portable across different
operating systems. [ROSE86] demonstrated early ports to both an SVR2.2-based
version of UNIX and the Sequent Dynix operating system, a System V/BSD
hybrid. There were a number of different PC (DOS-based) implementations of
NFS. [CALL00] describes the various issues encountered with porting NFS to
these platforms.
NFS Client/Server Communications
NFS relies on both the Remote Procedure Call (RPC) [RFC1057] and eXternal Data
Representation (XDR) [RFC1014] specifications as a means of communicating
between client and server. XDR allows for the description and encoding of
different data types. Its goal is to allow communication between machines with
different underlying architectures. RPC provides an infrastructure for creating
client/server applications whereby an application can call a function provided by
the server just as it would call a function within its own address space.
The XDR format assumes that bytes (8-bit) are portable in that the encoding of
bytes does not change from one architecture or hardware device to another.
Building on top of bytes, the XDR specification requires data types to be
constructed from multiples of four bytes of data. If data types require a number of
bytes that is not exactly divisible by 4, any unused bytes will be zero-filled. The
ordering of the bytes is such that, if read from a byte stream, the high order byte is
always read first. This is called big-endian (the biggest digit in a number comes
first). XDR also uses a standard representation for floating point numbers
(following the IEEE standard).
To give a simple example of how XDR is used, consider the encoding of an
integer that is defined as a 32-bit data type. This is encoded as follows:
So if this data type were being read from a regular file as a series of bytes, byte
0, the most significant byte, would be read first.
The XDR specification defines a number of primitive data types including
signed and unsigned integers, booleans, hyper integers (64 bits), fixed and
variable length opaque data types, and strings. It also defines a wide range of
structured data types including arrays, unions, linked lists, and structures. In
basic terms, any data type that can be defined in the most popular high-level
languages can be encoded within XDR.
The RPC mechanism used for NFS was derived from Sun RPC, a simple way
to provide a remote procedure call mechanism between two processes on
different machines. To the caller of such a procedure, there is no difference
between calling an RPC function and calling a local function.
The RPC protocol [RFC1057] can be implemented on top of several different
transport protocols. In the case of the early NFS implementations, this was based
on UDP/IP. Within the last ten years, a move to using TCP/IP has been made in
many environments (typically to avoid packet loss when going through routers).
Description of the RPC protocol is beyond the scope of this book. The NFS
specification [RFC1094] defines the NFS protocol as an RPC program and
[CALL00] provides a more detailed description of the protocol itself.
The overall architecture of NFS in a VFS/vnode architecture is shown in
Figure 13.1. This shows the placement of NFS, XDR, and RPC within the kernel.
To the rest of the kernel, NFS appears as any other filesystem type.
Exporting, Mounting, and Accessing NFS Filesystems
Table 13.1 shows the different NFS protocol procedures. One thing that is missing
from this list of functions is the means by which the very first file handle is
obtained, the handle of the root directory from which subsequent lookup
operations and other procedures can be performed. This is achieved through a
separate mount protocol that returns the file handle for the root directory.
There were two reasons for separating the mount protocol from the NFS
protocol itself (note that both are described in [RFC1094]). First, the means by
which access checking is performed on the server is typically implemented in
user space, which can make use of a number of different security mechanisms.
Because the NFS protocol is implemented within the kernel for performance
reasons, it was felt that it was easiest to allow for this separation. The second
reason that the protocols were separated was that the designers thought a single
pathname to file handle procedure would tie the implementation of NFS too
closely with UNIX.
It was envisaged that the pathname to file handle translation
may be implemented by a protocol that differs from the mount protocol.
The initial mount protocol consisted of six different procedures:
MOUNTPROC_NULL. This procedure performs no specific function. It is used to
ping the server to measure the round-trip time, which can be used to
determine how to time out NFS procedure calls to the server.
MOUNTPROC_MNT. This procedure takes a pathname and returns a file handle
that corresponds to the pathname.
MOUNTPROC_DUMP. This function returns a list of clients and the exported
filesystems that they have mounted. This is used by the UNIX commands
showmount and dfmounts that list the clients that have NFS mounted
filesystems together with the filesystems that they have mounted.
MOUNTPROC_UMNT. This procedure is used to inform the server that the NFS
filesystem on the client has been unmounted.
MOUNTPROC_UMNTALL. This procedure is sent by a client following a reboot
(or after a crash). This prevents the server from maintaining stale mount
entries in the event that the client crashed before sending corresponding
MOUNTPROC_UMNT messages.
MOUNTPROC_EXPORT. This procedure returns a list of exported filesystems.
The mount protocol remained unchanged for a number of years. Since then, only
one additional procedure has been added, MOUNTPROC_PATHCONF, which
retrieves additional information from the server allowing NFS filesystems to
comply with the POSIX pathconf() system call.
Using NFS
Although the NFS implementations differ from one platform to the next in the
way in which filesystems are exported, using NFS is trivial when the appropriate
client and server software is running. NFS daemons/threads are typically started
when the system bootstraps and enters multiuser mode.
As an example, consider the case in Solaris where a server called srv wishes to
export a filesystem mounted on /fs1 to any client. The easiest way to achieve
this is to place an entry in /etc/dfs/dfstab that specifies the filesystem to be
shared and/or exported. This ensures that the filesystem will be exported for use
on each reboot. If no options are needed for this filesystem, the following line is
sufficient:
share -F nfs /fs1
On the client side, a mount call can then be made to mount the filesystem as
follows:
# mount -F nfs srv:/fs1 /mnt
# mount | grep mnt
/mnt on srv:/fs1 remote/read/write/setuid/dev=2fc004
on Wed Jun 19 07:25:18 2002
Once mounted, the filesystem is then usable just as any local filesystem.
The Version 3 NFS Protocol
Despite its huge success, the NFS version 2 protocol was not without problems,
leading to the introduction of the NFS version 3 protocol [RFC1813]. The two
main problems with the version 2 protocol are highlighted below:
Only files up to 4GB in size could be accessed. This limitation was exposed
early on when running NFS on large machines but was rapidly becoming a
problem in most environments.
Because all writes were required to be written synchronously, write
intensive applications suffered a performance bottleneck. There were
numerous workarounds for this problem, including some that violated
the NFS protocol by performing writes asynchronously.
The goals of those involved in designing NFS version 3 were to solve the two
problems described above, tidy up the existing version 2 protocol, and add some
minor features, but at the same time limit the scope of the version 3 protocol to
avoid it becoming too bloated to be implemented in a timely manner.
[PAWL94] provides an overview of the process the designers of the version 3
protocol went through, the problems inherent in the version 2 protocol, the goals
behind the version 3 protocol, the changes introduced with the version 3
protocol, and various implementation and performance-related information. In
this paper, they identify twelve main areas in which the protocol was enhanced:
All arguments within the protocol such as file sizes and file offsets were
widened from 32 to 64-bits. This solved the 4GB file restriction.
The write model was changed to introduce a write/commit phase that
allowed for asynchronous writes. (This will be described further).
A new ACCESS procedure was introduced to solve permission checking
problems when mapping the ID of the superuser. This procedure works in
the presence of ACLs (Access Control Lists).
In the original protocol, some of the procedures required a subsequent call
in order to retrieve file attributes. In the new protocol, all procedures
returned file attributes.
In the original protocol, writes were limited to 8Kb per procedure call. This
restriction was relaxed in the new protocol.
The READDIRPLUS procedure was introduced that returned both a file
handle and attributes. This eliminated some lookup calls when scanning a
directory.
The file handle size in the version 2 protocol was a fixed, 32-byte opaque
data type. In version 3, it was changed to be of variable size up to a
maximum of 64 bytes.
The CREATE procedure was modified to allow for exclusive file creates.
This solved a workaround in the version 2 protocol whereby a LOOKUP
was followed by a CREATE, which left a window in which another client
could create the file.
The version 2 protocol limited the size of filenames and pathnames to 255
and 1024 characters respectively. In version 3, this was replaced by variable
length strings which could be agreed on between the client and server.
The version 3 protocol tightened the errors that could be returned from the
server. All error values are iterated, and no errors outside of the list are
permitted.
For the set of file attributes, the blocksize field was removed. The
blocks field was changed to used and recorded the total number of bytes
used by the file.
A new error type, NFS3ERR_JUKEBOX, was introduced. In a Hierarchical
Storage Management (HSM) environment, a request may be made to the
server to read a file that has been migrated to tape. The time to read the
data back in from tape could be quite large. This error informs the client
that the operation is in progress and that the call should be retried. It
also allows the client to display a message on the user_fs console if
applicable.
Writes in UNIX are asynchronous by default unless the O_SYNC or O_DSYNC
flags are passed to open(). Forcing asynchronous writes to be synchronous
inevitably affects performance. With NFS version 3, the client can send a number
of asynchronous WRITE requests that it can then commit to disk on the server at
a later date by issuing a COMMIT request. Once it receives a COMMIT request,
the server cannot return until all data has been flushed to disk. In some regards,
the COMMIT request is similar to calling fsync(). The most noticeable
difference is that the COMMIT request does not necessarily cover the data for the
whole file. It does however allow the client to flush all data when a file is closed
or to break up a large synchronous write request into a number of smaller writes,
all of which are performed asynchronously but followed by a COMMIT request.
This itself is an important enhancement because it allows the filesystem on the
server or the disk driver to coalesce a number of writes in a single large write,
which is more efficient. The use of asynchronous writes should not affect the
crash/recovery properties of NFS because the client is required to keep a copy of
all data to be written to the file until a COMMIT is issued.
The READDIRPLUS procedure, while it can be extremely effective, also
presents problems. The procedure was introduced to minimize the number of
over-the-wire LOOKUP requests once a READDIR procedure had been invoked.
This would typically be the case when issuing an ls -F request on a directory.
Because the implementation of READDIRPLUS is significantly more
expensive than READDIR, it should be used with caution. Typically, the
operation should be performed only when first accessing the directory in order to
populate the DNLC (or other name cache, depending on the underlying OS). The
operation should then be performed again only in the case where the cache was
invalidated for the directory due to a directory modification.
Because many of the goals of NFS version 3 were to improve performance, the
proof of its success was therefore dependent on how well it performed.
[PAWL94] documented a number of different performance-related tests that
showed that the version 3 protocol did in fact meet its objectives.
The NFS Lock Manager Protocol
One decision that the NFS team made when designing the NFS protocol was to
omit file locking. One of the main reasons for this was that to support record
locking, state would need to be maintained on the server, which would
dramatically increase the complexity of NFS implementations.
However, file locking was not something that could be easily overlooked and
was therefore implemented in SunOS as the Network Lock Manager (NLM).
Various iterations of the NLM protocol appeared, with NLM version 3 being the
version that was most widely used with NFS version 2. Unfortunately, due to the
complexity of implementing support for locking, the NLM protocol was not
widely implemented. With the introduction of NFS version 3 [RFC1813], the
definition of NLM (version 4) was included with the NFS specification but was
still a separate protocol. The NLM protocol also relied on the Network Status
Monitor protocol that was required in order to notify clients and servers of a crash
such that lock state could be recovered.
Crash/recovery involves coordination between both clients and server as
shown here:
Server crash. When locks are handed to clients, the server maintains a list of
clients and the locks that they own. If the server crashes, this information is
lost. When the server reboots, a status monitor runs and sends a message to
all known clients. The lock manager on each client is notified and is given an
opportunity to reclaim all locks it owns for files on the server. There is a fixed
amount of time (or grace period) in which the clients can respond. Note
however, that this is not an ideal situation as notification of clients may be
delayed, because the status monitor typically informs all clients in a serial
manner. This window may be reduced by multithreading the status monitor.
Client crash. If the client crashes, any locks that the client holds on the server
must be cleaned up. When the client resumes, a message is sent to the server
to clean up its locks. Through use of a client state number, which is
incremented on reboot, the server is able to detect that the client has been
rebooted and removes any of the locks that were held by the client before it
crashed/rebooted.
Since the NLM was not widely adopted, version 4 of the NFS protocol has been
extended to include file locking. This is described in the next section.
The Version 4 NFS Protocol and the Future of NFS
NFS was designed for local area networks and was put in place before the
widespread adoption of the World Wide Web. As time goes by there is more of a
need to use NFS in wide area networks. This raises questions on security and
further highlights the need for NFS to address cross-platform issues. Although
one of the goals of the original NFS implementations was to support non-UNIX
platforms, the protocol was still heavily geared towards UNIX environments.
The goals of the version 4 protocol are to address the problems highlighted
above and to provide additional features that were omitted in the version 3
protocol (version 3 changes were kept to a minimum to ensure that it could be
designed and implemented in a timely manner). Although the version 4 protocol
involves some substantial changes, the goals are to allow small incremental
changes that do not require a complete overhaul of the protocol. The time
between the version 2 and 3 protocols was approximately 8 years, which is similar
to the time between the version 3 and 4 protocols. A minor revision to the version
4 protocol allows new features to be added to the version 4 protocol in a much
more timely manner, say a 1 to 2 year timeframe.
The version 4 protocol is described in [RFC3010]. Following are the main
changes that are part of the new protocol. Note that the changes introduced with
the version 4 protocol are substantial and only covered briefly here.
Compound procedures. Many file-related operations over NFS require a large
number of procedure calls. In a local area network this is not such a great
issue. However, when operating in a wide area network the effect on
performance is much more noticeable. By combining a number of procedure
calls into a single, compound procedure, the amount of over-the-wire
communications can be reduced considerably resulting in much better
performance.
Internationalization. In previous versions of the protocol, file names were
handled as an opaque byte stream. Although they were typically limited to
a 7-bit US ASCII representation, they were commonly encoded in 8-bit
ISO-Latin-1. Problems occurred because there was no way to specify the
type of encoding within XDR. This limited the use of NFS in environments
where there may be mixed character sets. To provide better support for
internationalization, file and directory names will be encoded with UTF-8.
Volatile file handles. The NFS version 2 and version 3 protocols provided a
single file handle type with one set of semantics. This file handle is defined
as having a value that is fixed for the lifetime of the filesystem to which it
refers. As an example, a file handle on UNIX comprises, amongst other
things, the inode number and generation count. Because inodes can be freed
and reallocated, the generation count of the inode is increased when reused
to ensure that a client file handle that refers to the old file cannot now refer
to the new file even though the inode number stays the same. There have
also been some implementations that are unable to correctly implement the
traditional file handle which inhibits the adoption of NFS on some
platforms.
The NFS version 4 protocol divides file handles into both persistent file
handles, which describe the traditional file handle, and volatile file handles. In
the case of volatile file handles, the server may not always be able to
determine whether the file handle is still valid. If it detects that a file handle
is in fact invalid, it returns an NFS4ERR_STALE error message. If however it
is unable to determine whether the file handle is valid, it can return an
NFS4ERR_FHEXPIRED error message. Clients are able to detect whether a
server can handle persistent and volatile file handles and therefore act
accordingly.
Attribute classes. The set of attributes that were passed over the wire with
earlier versions of the protocol were very UNIX-centric in that the
information returned by the server was sufficient to respond to a stat()
call on the client. In some environments, these attributes are meaningless,
and in some cases, servers are unable to provide valid values.
In NFS version 4, the set of file attributes is divided into three different
classes, namely mandatory, recommended, and named attributes.
The mandatory set of attributes contain information such as the file type
and size, information about file handle expiration times, whether hard links
and symbolic links are supported, and whether the file has named data
streams/attributes.
The set of recommended attributes contain information such as the type of
ACLs (Access Control Lists) that the filesystem supports, the ACLs
themselves, information about the owner and group, access timestamps, and
quota attributes. It also contains information about the filesystem such as
free space, total number of files, files available for use, and filesystem limits
such as the maximum filename length and maximum number of links.
Named attributes, also called named data streams, allow a single file to have
multiple streams of opaque bytes unlike the traditional UNIX model of
supporting a single stream of bytes per file. To access named data streams
over NFS version 4, the OPENATTR procedure can retrieve a virtual attribute
directory under which READDIR and LOOKUP procedures can be used to view
and access the named attributes.
Better namespace handling. Both NFS version 2 and 3 servers export a set of
independent pieces of their overall namespace and do not allow NFS clients
to cross mountpoints on the server, because NFS expects all lookup
operations to stay within a single filesystem. In NFS v4, the server provides a
single root file handle through which clients can obtain file handles for any of
the accessible exports.
NFS v4 servers can be made browsable by bridging exported subtrees of
the namespace with a pseudo filesystem and allowing clients to cross server
mountpoints. The tree constructed by the server is a logical view of all the
different exports.
File locking. As mentioned earlier, locking is not part of the NFS version 2 or
version 3 protocols which rely instead on the Network Lock Manager (NLM)
protocol, described in the section The NFS Lock Manager Protocol, earlier in
the chapter. The NLM protocol was not, however, widely adopted.
NFS version 4 provides for both UNIX file-level locking functions and
Windows-based share locking functions. NFS version 4 supports both record
and byte range locking functions.
Client side caching. Most NFS clients cache both file data and attributes as
much as possible. When moving more towards wide area networks, the cost
of a cache miss can be significant. The problem with the version 2 and
version 3 protocols is that NFS does not provide a means to support cache
coherency between multiple clients, which can sometimes lead to invalid file
data being read.
NFS version 4 does not provide cache coherency between clients but
defines a limited set of caching guarantees to allow locks and share
reservations to be used without destructive interference from client-side
caching. NFS v4 also provides a delegation scheme that allows clients to make
decisions that were traditionally made by the server.
The delegation mechanism is an important feature in terms of
performance because it limits the number of procedure calls that would
typically go between client and server when accessing a file. When another
client attempts to access a file for which a delegation has been granted, the
server invokes an RPC to the client holding the delegation. The client is then
responsible for flushing any file information, including data, that has
changed before responding to the recall notice. Only after the first client has
responded to the revoke request will the second client be allowed to access
the file.
The NFS version 4 specification provides many details for the different
types of delegations that can be granted and therefore the type of caching
that can be performed on the client.
Built-in security. NFS has relied on the UNIX-centric user ID mechanisms to
provide security. This has generally not been a problem because NFS has
largely been used within private networks. However, because one of the
goals of the version 4 protocol is to widen the use of NFS to wide area
networks, this level of security is insufficient. The basic NFS security
mechanisms are being extended through use of the RPCSEG_GSS
framework. The RPCSEG_GSS mechanisms are implemented at the RPC
layer and are capable of providing both private keys schemes, such as
Kerboros version 5, and public key schemes.
The NFS version 4 protocol is a significant rework of the version 3 protocol. It
provides a wide range of features aimed at continuing its success as it becomes
more widely adopted in wide area networks, and it provides better support for
building a distributed filesystem for heterogeneous operating systems.
There was a huge amount of investment in NFS prior to version 4. Because
NFS version 4 attempts to address many of the prior limitations of the earlier
versions, including more attention to non-UNIX operating systems, NFS is likely
to grow in popularity.
The NFS Automounter
One feature that is part of some distributed filesystems is the ability to provide a
unified namespace across a number of different clients. For example, to see
/home/spate on several different clients would require exporting the
filesystem from the server on which the filesystem resides and NFS mounting it
on all of the required clients. If the mount point is permanent, the appropriate
entries must be placed in the vfstab/fstab table. Obviously this model does not
scale well when dealing with hundreds of filesystems and a very large number of
clients and servers.
This problem was resolved by introduction of the automounter. This aids in
creation of a unified namespace while keeping the number of mounted
filesystems to only those filesystems that are actually in use. The automounter is
simple in nature. When a user attempts to access a file that crosses a mount point
within the boundaries of the automounter, the NFS filesystem is first mounted
prior to allowing the access to proceed. This is shown in Figure 13.2.
The first automounters were implemented as user space daemons, which
typically mount themselves on those directories that require automounter
services and masquerade as NFS servers. When an attempt is made to access a file
within one of these filesystems, the kernel sends an NFS LOOKUP call to the server,
in this case the automounter. The automounter then NFS mounts the real
filesystem onto a directory somewhere within its own mount space. The real
filesystem is then referenced through symbolic links. For example, in Figure 13.2,
the filesystem to be mounted on fs2 may be mounted on /auto/f2 and
/mnt/fs1 will be a symbolic link to this directory.
In many environments it is usual to see a combination of standard NFS
mounted filesystems and automounted filesystems. The automounter should be
used for filesystems that are not accessed frequently, such as manual pages,
source code repositories, and so on. User directories and bin directories are
examples of directories that are typically mounted through standard NFS means.
Another common use of the automounter is to use it in conjunction with the
Network Information Service (NIS) in an environment where user home directories
are distributed throughout the network. In this way, NIS centrally manages all of
the NFS mounts from one of the servers. Although each user_fs home directory
physically resides on only one server, the same server is configured to export the
home directory to all hosts on the network. Each host on the network runs the
automounter as an NFS client and can therefore mount a user_fs home directory.
This allows the user to log in to any host and have access to his/her home
directory. In this environment, file access is enhanced transparently while the use
of the automounter avoids the overhead caused by dozens of active but unused
NFS mounted filesystems.
Automounter Problems and the Autofs Filesystem
[CALL93] highlighted some of the problems inherent with using the
automounter and provided details about autofs, a new automounter that solved
the problems described. The type of problems that the original automounter
exhibited are as follows:
Symbolic links. The preceding section described how the automounter
actually NFS mounts the filesystem on a temporary directory and refers to it
through a symbolic link. Because the goal of the automounter is only to
mount filesystems when required, it periodically unmounts the filesystem if
there is no activity for a predetermined amount of time.
However, if a process issues a getcwd() system call, the real path may
be cached which references the temporary directory structure, that is, where
the filesystem is actually mounted. If the path is used later, there is no
guarantee that the filesystem is still mounted and the automounter is unable
to detect that access is being requested. The user process may therefore see
the local directory structure and thus unpredictable results.
Adding new mountpoints. The list of filesystems that the automounter
manages is consulted only when the automounter first starts. A
workaround is to terminate and restart the automounter.obviously not an
ideal solution.
Performance. The method of sending NFS requests to the automounter when
crossing its mount point, together with the management of symbolic links,
is more time consuming than accessing an NFS filesystem directly.
Single threading. Because the automounter is single threaded it can only
handle one request at a time. Therefore, when in the process of mounting an
NFS filesystem, all subsequent access is blocked.
The autofs filesystem replaced the user-level automounter daemon with an
in-kernel filesystem type. The automounter daemon is still retained. However,
when it starts, it mounts autofs in place for each of the filesystems that is to be
managed. In the previous example, the autofs filesystem would be mounted on
/mnt. When access is detected to, say /mnt/fs2, autofs invokes an RPC request
to communicate with the automounter daemon that NFS mounts the filesystem
on /mnt/fs2. Once this is achieved, the autofs filesystem does not intercept any
further operations. This eradicates the symbolic link problem and therefore
increases the overall performance.
The Remote File Sharing Service (RFS)
At the time Sun was designing NFS, AT&T was working on the development of
another distributed filesystem, Remote File Sharing (RFS) [RIFK86]. The design
goals were quite different for RFS in that they wanted complete UNIX file
semantics. This included access to remote devices as well as providing
UNIX-level file and record locking. Coverage of different operating systems was
not a goal for RFS and thus its implementation was heavily restricted to System V
UNIX environments.
The RFS Architecture
In a manner similar to NFS, the RFS client is able to mount a directory that is
exported by the server. Exporting of filesystems involved advertising a resource
name with a name server. The RFS client could receive details of available resources
from the name server and mount a filesystem without prior knowledge of the
server which owned the filesystem. RFS required a separate name service
protocol in order to manage all resources. Servers would issue an advertise
procedure to the name server, which then registered the resource in the name
server database. When the client requested information about a specific resource,
the name server would return the name of the server on which the resource was
located. Communication could then take place between client and server to
actually mount the filesystem.
RFS also relied on an RPC protocol that provided a procedure on the server for
every file-related system call. XDR was used to encode data types but only where
the machine architecture of the client and server differed. On the server side, the
goal was to emulate the environment of the client and provide a context similar to
one on the caller to handle the remote procedure calls. This was used to provide
management of the process user and group IDs, umask, and so on. This emulation
was a little awkward for some operations. For example, when performing a
lookup on a pathname, if an RFS mount point was to be crossed, the remainder of
the pathname was sent to the server to be resolved. If a series of ".." components
took the pathname out of the RFS mounted filesystem, the operation had to be
aborted and completed on the client.
To provide for full UNIX semantics including file and record locking, RFS was
required to provide a stateful server. This required the server to maintain
reference counts for every open call from every client, file and record lock
information on the server, and information about the state of named pipes. RFS
also maintained a list of all client mounts. If either the server or one of the clients
crashed, there was a significant amount of crash/recovery to be performed. If a
client crashed, the server was required to remove all traces of the client. Amongst
other things, this included decrementing reference counts, releasing any locks
that client held, and so on.
Server side failure resulted in the ENOLINK error message being returned when
any attempts were made to access files on the server. All inodes/vnodes on the
client that accessed RFS files were marked to indicate the failure such that further
attempts at access would return ENOLINK without any attempt to communicate
with the server.
The overall architecture of RFS is shown in Figure 13.3. Unlike NFS, RFS
requires a reliable, virtual circuit transport service. In the figure this is shown as
TCP/IP. A virtual circuit is established during the mount and remains in existence
for the duration of the mount. For each client/server pair, the virtual circuit is
shared if a client mounts more than one RFS filesystem.
Another big difference between RFS and NFS is the support for client-side
caching. RFS implements a write-through cache on each server; that is, writes are
always sent to the server at the time the write occurs but the data is cached for
subsequent access. Obviously this presents a challenge with respect to cache
coherency when a write occurs to a file and data is cached on one of more clients.
RFS must invalidate cached copies of the data on clients other than the one from
which the write is issued. The client-side caching of file data is subsequently
disabled either until the process that issued the write closes the file or a
predetermined amount of time has passed since the last write to the file.
Differences between RFS and NFS
When RFS was introduced, there were a number of differences between RFS and
NFS as defined by the version 2 protocol. Some of these differences were:
NFS is stateless whereas RFS requires a primary name server that
coordinates RFS activity.
RFS can map user and group IDs from client to server based on presence of
a mapping table. NFS by contrast requires that the IDs are the same on both
client and server. More specifically, NFS implementations assume the use
of NIS to maintain a consistent user database across the network.
RFS allows access to device files across the network while devices are not
accessible across NFS.
RFS names resources, the directories that are advertised, which are
communicated to the primary server. This is not required in NFS.
RFS requires a connection-mode virtual circuit environment, while NFS
runs in a connectionless state.
RFS provides support for mandatory file and record locking. This is not
defined as part of the NFS protocol.
NFS can run in heterogeneous environments, while RFS is restricted to
UNIX environments and in particular System V UNIX.
RFS guarantees that when files are opened in append mode (O_APPEND) the
write is appended to the file. This is not guaranteed in NFS.
In an NFS environment, the administrator must know the machine
name from which the filesystem is being exported. This is alleviated
with RFS through use of the primary server.
When reading through this list, it appears that RFS has more features to offer and
would therefore be a better offering in the distributed filesystem arena than NFS.
However, the goals of both projects differed in that RFS supported full UNIX
semantics whereas for NFS, the protocol was close enough for most of the
environments that it was used in.
The fact that NFS was widely publicized and the specification was publicly
open, together with the simplicity of its design and the fact that it was designed to
be portable across operating systems, resulted in its success and the rather quick
death of RFS, which was replaced by NFS in SVR4.
RFS was never open to the public in the same way that NFS was. Because it was
part of the UNIX operating system and required a license from AT&T, it stayed
within the SVR3 area and had little widespread usage. It would be a surprise if
there were still RFS implementations in use today.
The Andrew File System (AFS)
The Andrew Filesystem (AFS) [MORR86] was developed in the early to mid 1980s
at Carnegie Mellon University (CMU) as part of Project Andrew, a joint project
between CMU and IBM to develop an educational-based computing
infrastructure. There were a number of goals for the AFS filesystem. First, they
required that UNIX binaries could run on clients without modification requiring
that the filesystem be implemented in the kernel. They also required a single,
unified namespace such that users be able to access their files wherever they
resided in the network. To help performance, aggressive client-side caching
would be used. AFS also allowed groups of files to be migrated from one server to
another without loss of service, to help load balancing.
The AFS Architecture
An AFS network, shown in Figure 13.4, consists of a group of cells that all reside
under /afs. Issuing a call to ls /afs will display the list of AFS cells. A cell is a
collection of servers that are grouped together and administered as a whole. In the
academic environment, each university may be a single cell. Even though each
cell may be local or remote, all users will see exactly the same file hierarchy
regardless of where they are accessing the filesystem.
Within a cell, there are a number of servers and clients. Servers manage a set of
volumes that are held in the Volume Location Database (VLDB). The VLDB is
replicated on each of the servers. Volumes can be replicated over a number of
different servers. They can also be migrated to enable load balancing or to move
a user_fs files from one location to another based on need. All of this can be done
without interrupting access to the volume. The migration of volumes is achieved
by cloning the volume, which creates a stable snapshot. To migrate the volume,
the clone is moved first while access is still allowed to the original volume. After
the clone has moved, any writes to the original volume are replayed to the clone
volume.
Client-Side Caching of AFS File Data
Clients each require a local disk in order to cache files. The caching is controlled
by a local cache manager. In earlier AFS implementations, whenever a file was
opened, it was first copied in its entirety to the local disk on the client. This
quickly became problematic as file sizes increased, so later AFS versions defined
the copying to be performed in 64KB chunks of data. Note that, in addition to file
data, the cache manager also caches file meta-data, directory information, and
symbolic links.
When retrieving data from the server, the client obtains a callback. If another
client is modifying the data, the server must inform all clients that their cached
data may be invalid. If only one client holds a callback, it can operate on the file
without supervision of the server until a time comes for the client to notify the
server of changes, for example, when the file is closed. The callback is broken if
another client attempts to modify the file. With this mechanism, there is a
potential for callbacks to go astray. To help alleviate this problem, clients with
callbacks send probe messages to the server on a regular basis. If a callback is
missed, the client and server work together to restore cache coherency.
AFS does not provide fully coherent client side caches. A client typically makes
changes locally until the file is closed at which point the changes are
communicated with the server. Thus, if multiple clients are modifying the same
file, the client that closes the file last will write back its changes, which may
overwrite another client_fs changes even with the callback mechanism in place.
Where Is AFS Now?
A number of the original designers of AFS formed their own company Transarc,
which went on to produce commercial implementations of AFS for a number of
different platforms. The technology developed for AFS also became the basis of
DCE DFS, the subject of the next section. Transarc was later acquired by IBM and,
at the time of this writing, the history of AFS is looking rather unclear, at least
from a commercial perspective.
The DCE Distributed File Service (DFS)
The Open Software Foundation started a project in the mid 1980s to define a secure,
robust distributed environment for enterprise computing. The overall project was
called the Distributed Computing Environment (DCE). The goal behind DCE was to
draw together the best of breed technologies into one integrated solution, produce
the Application Environment Specification (AES), and to release source code as an
example implementation of the standard. In 1989, OSF put out a Request For
Technology, an invitation to the computing industry asking them to bid
technologies in each of the identified areas. For the distributed filesystem
component, Transarc won the bid, having persuaded OSF of the value of their
AFS-based technology.
The resulting Distributed File Service (DFS) technology bore a close resemblance
to the AFS architecture. The RPC mechanisms of AFS were replaced with DCE
RPC and the virtual filesystem architecture was replaced with VFS+ that allowed
local filesystems to be used within a DFS framework, and Transarc produced the
Episode filesystem that provided a wide number of features.
DCE / DFS Architecture
The cell nature of AFS was retained, with a DFS cell comprising a number of
servers and clients. DFS servers run services that make data available and
monitor and control other services. The DFS server model differed from the
original AFS model, with some servers performing one of a number of different
functions:
File server. The server that runs the services necessary for storing and
exporting data. This server holds the physical filesystems that comprise the
DFS namespace.
System control server. This server is responsible for updating other servers
with replicas of system configuration files.
Fileset database server. The Fileset Location Database (FLDB) master and
replicas are stored here. The FLDB is similar to the volume database in AFS.
The FLDB holds system and user files.
Backup database server. This holds the master and replicas of the backup
database which holds information used to backup and restore system and
user files.
Note that a DFS server can perform one or more of these tasks.
The fileset location database stores information about the locations of filesets.
Each readable/writeable fileset has an entry in the FLDB that includes
information about the fileset_fs replicas and clones (snapshots).
DFS Local Filesystems
A DFS local filesystem manages an aggregate, which can hold one or more filesets
and is physically equivalent to a filesystem stored within a standard disk
partition. The goal behind the fileset concept was to make it smaller than a disk
partition and therefore more manageable. As an example, a single filesystem is
typically used to store a number of user home directories. With DFS, the
aggregate may hold one fileset per user.
Aggregates also supports fileset operations not found on standard UNIX
partitions, including the ability to move a fileset from one DFS aggregate to
another or from one server to another for load balancing across servers. This is
comparable to the migration performed by AFS.
UNIX partitions and filesystems can also be made visible in the DFS
namespace if they adhere to the VFS+ specification, a modification to the native
VFS/vnode architecture with additional interfaces to support DFS. Note
however that these partitions can store only a single fileset (filesystem) regardless
of the amount of data actually stored in the fileset.
DFS Cache Management
DFS enhanced the client-side caching of AFS by providing fully coherent client
side caches. Whenever a process writes to a file, clients should not see stale data.
Clustered and Distributed Filesystems 307
To provide this level of cache coherency, DFS introduced a token manager that
keeps a reference of all clients that are accessing a specific file.
When a client wishes to access a file, it requests a token for the type of
operation it is about to perform, for example, a read or write token. In some
circumstances, tokens of the same class allow shared access to a file; two clients
reading the same file would thus obtain the same class of token. However, some
tokens are incompatible with tokens of the same class, a write token being the
obvious example. If a client wishes to obtain a write token for a file on which a
write token has already been issued, the server is required to revoke the first
client_fs write token allowing the second write to proceed. When a client receives a
request to revoke a token, it must first flush all modified data before responding
to the server.
The Future of DCE / DFS
The overall DCE framework and particularly the infrastructure required to
support DFS was incredibly complex, which made many OS vendors question the
benefits of supporting DFS. As such, the number of implementations of DFS were
small and adoption of DFS equally limited. The overall DCE program came to a
halt in the early 1990s, leaving a small number of operating systems supporting
their existing DCE efforts. As NFS evolves and new, distributed filesystem
paradigms come into play, the number of DFS installations is likely to decline
further.
Clustered Filesystems
With distributed filesystems, there is a single point of failure in that if the server
(that owns the underlying storage) crashes, service is interrupted until the server
reboots. In the event that the server is unable to reboot immediately, the delay in
service can be significant.
With most critical business functions now heavily reliant on computer-based
technology, this downtime is unacceptable. In some business disciplines, seconds
of downtime can cost a company significant amounts of money.
By making hardware and software more reliable, clusters provide the means by
which downtime can be minimized, if not removed altogether. In addition to
increasing the reliability of the system, by pooling together a network of
interconnected servers, the potential for improvements in both performance and
manageability make cluster-based computing an essential part of any large
enterprise.
The following sections describe the clustering components, both software and
hardware, that are required in order to provide a clustered filesystem (CFS). There
are typically a large number of components that are needed in addition to
filesystem enhancements in order to provide a fully clustered filesystem. After
describing the basic components of clustered environments and filesystems, the
VERITAS clustered filesystem technology is used as a concrete example of how a
clustered filesystem is constructed.
Later sections describe some of the other clustered filesystems that are
available today.
The following sections only scratch the surface of clustered filesystem
technology. For a more in depth look at clustered filesystems, you can refer to
Dilip Ranade_fs book Shared Data Clusters [RANA02].
What Is a Clustered Filesystem?
In simple terms, a clustered filesystem is simply a collection of servers (also
called nodes) that work together to provide a single, unified view of the same
filesystem. A process running on any of these nodes sees exactly the same view
of the filesystem as a process on any other node. Any changes by any of the
nodes are immediately reflected on all of the other nodes.
Clustered filesystem technology is complementary to distributed filesystems.
Any of the nodes in the cluster can export the filesystem, which can then be
viewed across the network using NFS or another distributed filesystem
technology. In fact, each node can export the filesystem, which could be mounted
on several clients.
Although not all clustered filesystems provide identical functionality, the goals
of clustered filesystems are usually stricter than distributed filesystems in that a
single unified view of the filesystem together with full cache coherency and
UNIX semantics, should be a property of all nodes within the cluster. In essence,
each of the nodes in the cluster should give the appearance of a local filesystem.
There are a number of properties of clusters and clustered filesystems that
enhance the capabilities of a traditional computer environment, namely:
Resilience to server failure. Unlike a distributed filesystem environment
where a single server crash results loss of access, failure of one of the servers
in a clustered filesystem environment does not impact access to the cluster
as a whole. One of the other servers in the cluster can take over
responsibility for any work that the failed server was doing.
Resilience to hardware failure. A cluster is also resilient to a number of
different hardware failures, such as loss to part of the network or disks.
Because access to the cluster is typically through one of a number of
different routes, requests can be rerouted as and when necessary
independently of what has failed. Access to disks is also typically through a
shared network.
Application failover. Failure of one of the servers can result in loss of service
to one or more applications. However, by having the same application set in
a hot standby mode on one of the other servers, a detected problem can result
in a failover to one of the other nodes in the cluster. A failover results in one
machine taking the placed of the failed machine. Because a single server
failure does not prevent access to the cluster filesystem on another node, the
application downtime is kept to a minimum; the only work to perform is to
restart the applications. Any form of system restart is largely taken out of the
picture.
Increased scalability. Performance can typically be increased by simply adding
another node to the cluster. In many clustered environments, this may be
achieved without bringing down the cluster.
Better management. Managing a set of distributed filesystems involves
managing each of the servers that export filesystems. A cluster and clustered
filesystem can typically be managed as a whole, reducing the overall cost of
management.
As clusters become more widespread, this increases the choice of underlying
hardware. If much of the reliability and enhanced scalability can be derived from
software, the hardware base of the cluster can be moved from more traditional,
high-end servers to low cost, PC-based solutions.
Clustered Filesystem Components
To achieve the levels of service and manageability described in the previous
section, there are several components that must work together to provide a
clustered filesystem. The following sections describe the various components that
are generic to clusters and cluster filesystems. Later sections put all these
components together to show how complete clustering solutions can be
constructed.
Hardware Solutions for Clustering
When building clusters, one of the first considerations is the type of hardware that
is available. The typical computer environment comprises a set of clients
communicating with servers across Ethernet. Servers typically have local storage
connected via standards such as SCSI or proprietary based I/O protocols.
While Ethernet and communication protocols such as TCP/IP are unlikely to
be replaced as the communication medium between one machine and the next,
the host-based storage model has been evolving over the last few years. Although
SCSI attached storage will remain a strong player in a number of environments,
the choice for storage subsystems has grown rapidly. Fibre channel, which allows
the underlying storage to be physically separate from the server through use of a
fibre channel adaptor in the server and a fibre switch, enables construction of
storage area networks or SANs.
Figure 13.5 shows the contrast between traditional host-based storage and
shared storage through use of a SAN.
Cluster Management
Because all nodes within the cluster are presented as a whole, there must be a
means by which the clusters are grouped and managed together. This includes the
ability to add and remove nodes to or from the cluster. It is also imperative that
any failures within the cluster are communicated as soon as possible, allowing
applications and system services to recover.
These types of services are required by all components within the cluster
including filesystem, volume management, and lock management.
Failure detection is typically achieved through some type of heartbeat
mechanism for which there are a number of methods. For example, a single
master node can be responsible for pinging slaves nodes that must respond
within a predefined amount of time to indicate that all is well. If a slave does not
respond before this time or a specific number of heartbeats have not been
acknowledged, the slave may have failed; this then triggers recovery
mechanisms.
Employing a heartbeat mechanism is obviously prone to failure if the master
itself dies. This can however be solved by having multiple masters along with the
ability for a slave node to be promoted to a master node if one of the master
nodes fails.
Cluster Volume Management
In larger server environments, disks are typically managed through use of a
Logical Volume Manager. Rather than exporting physical disk slices on which
filesystems can be made, the volume manager exports a set of logical volumes.
Volumes look very similar to standard disk slices in that they present a
contiguous set of blocks to the user. Underneath the covers, a volume may
comprise a number of physically disjointed portions of one or more disks.
Mirrored volumes (RAID-1) provide resilience to disk failure by providing one or
more identical copies of the logical volume. Each mirrored volume is stored on a
different disk.
In addition to these basic volume types, volumes can also be striped (RAID 0).
For a striped volume the volume must span at least two disks. The volume data is
then interleaved across these disks. Data is allocated in fixed-sized units called
stripes. For example, Figure 13.6 shows a logical volume where the data is striped
across three disks with a stripe size of 64KB.
The first 64KB of data is written to disk 1, the second 64KB of data is written to
disk 2, the third to disk 3, and so on. Because the data is spread across multiple
disks, this increases both read and write performance because data can be read
from or written to the disks concurrently.
Volume managers can also implement software RAID-5 whereby data is
protected through use of a disk that is used to hold parity information obtained
from each of the stripes from all disks in the volume.
In a SAN-based environment where all servers have shared access to the
underlying storage devices, management of the storage and allocation of logical
volumes must be coordinated between the different servers. This requires a
clustered volume manager, a set of volume managers, one per server, which
communicate to present a single unified view of the storage. This prevents one
server from overwriting the configuration of another server.
Creation of a logical volume on one node in the cluster is visible by all other
nodes in the cluster. This allows parallel applications to run across the cluster and
see the same underlying raw volumes. As an example, Oracle RAC (Reliable
Access Cluster), formerly Oracle Parallel Server (OPS), can run on each node in the
cluster and access the database through the clustered volume manager.
Clustered volume managers are resilient to a server crash. If one of the servers
crashes, there is no loss of configuration since the configuration information is
shared across the cluster. Applications running on other nodes in the cluster see
no loss of data access.
Cluster Filesystem Management
The goal of a clustered filesystem is to present an identical view of the same
filesystem from multiple nodes within the cluster. As shown in the previous
sections on distributed filesystems, providing cache coherency between these
different nodes is not an easy task. Another difficult issue concerns lock
management between different processes accessing the same file.
Clustered filesystems have additional problems in that they must share the
resources of the filesystem across all nodes in the system. Taking a read/write
lock in exclusive mode on one node is inadequate if another process on another
node can do the same thing at the same time. When a node joins the cluster and
when a node fails are also issues that must be taken into consideration. What
happens if one of the nodes in the cluster fails? The recovery mechanisms
involved are substantially different from those found in the distributed filesystem
client/server model.
The local filesystem must be modified substantially to take these
considerations into account. Each operation that is provided by the filesystem
312 UNIX Filesystems.Evolution, Design, and Implementation
must be modified to become cluster aware. For example, take the case of mounting
a filesystem. One of the first operations is to read the superblock from disk, mark
it dirty, and write it back to disk. If the mount command is invoked again for this
filesystem, it will quickly complain that the filesystem is dirty and that fsck
needs to be run. In a cluster, the mount command must know how to respond to
the dirty bit in the superblock.
A transaction-based filesystem is essential for providing a robust, clustered
filesystem because if a node in the cluster fails and another node needs to take
ownership of the filesystem, recovery needs to be performed quickly to reduce
downtime. There are two models in which clustered filesystems can be
constructed, namely:
Single transaction server. In this model, only one of the servers in the cluster,
the primary node, performs transactions. Although any node in the cluster
can perform I/O, if any structural changes are needed to the filesystem, a
request must be sent from the secondary node to the primary node in order to
perform the transaction.
Multiple transaction servers. With this model, any node in the cluster can
perform transactions.
Both types of clustered filesystems have their advantages and disadvantages.
While the single transaction server model is easier to implement, the primary
node can quickly become a bottleneck in environments where there is a lot of
meta-data activity.
There are also two approaches to implementing clustered filesystems. Firstly, a
clustered view of the filesystem can be constructed by layering the cluster
components on top of a local filesystem. Although simpler to implement,
without knowledge of the underlying filesystem implementation, difficulties can
arise in supporting various filesystem features.
The second approach is for the local filesystem to be cluster aware. Any
features that are provided by the filesystem must also be made cluster aware. All
locks taken within the filesystem must be cluster aware and reconfiguration in the
event of a system crash must recover all cluster state.
The section The VERITAS SANPoint Foundation Suite describes the various
components of a clustered filesystem in more detail.
Cluster Lock Management
Filesystems, volume managers, and other system software require different lock
types to coordinate access to their data structures, as described in Chapter 10. This
obviously holds true in a cluster environment. Consider the case where two
processes are trying to write to the same file. The process which obtains the inode
read/write lock in exclusive mode is the process that gets to write to the file first.
The other process must wait until the first process relinquishes the lock.
In a clustered environment, these locks, which are still based on primitives
provided by the underlying operating system, must be enhanced to provide
distributed locks, such that they can be queried and acquired by any node in the
cluster. The infrastructure required to perform this service is provided by a
distributed or global lock manager (GLM).
Linux Kernel
uxfs.
,
.
mkfs fsdb.
, .
printk(),
kdb gdb.
.
2.4.18.
:
www.wiley.com/compbooks/pate
uxfs, uxfs
- .
, , .
, -
.
512- .
UX_BSIZE ux_fs.h.
- 470.
UX_MAXBLOCKS.
32 inodes (UX_MAXFILES). 0 1 .
2 , 3 - lost+found ,
28 .
0.
, ,
, .
, .
struct ux_superblock {
__u32 s_magic;
__u32 s_mod;
__u32 s_nifree;
__u32 s_inode[UX_MAXFILES]
}
superblock block 0 {
__u32 s_nbfree;
__u32 s_block[UX_MAXBLOCKS]
}
struct ux_inode
{
__u32 i_mode;
__u32 i_nlink;
data blocks
__u32 i_atime;
__u32 i_mtime;
__u32 i_ctime;
__s32 i_uid;
__s32 i_gid;
__u32 i_size;
__u32 i_blocks;
__u32 i_addr[UX_DIRECT_BLOCKS]
};
.
8.
0 1 , 10
lost+found 11.
12 39.
33.
50
51- - lost+found.
9 , (9 * 512)
4608 .
28- .
32 .
- , - ,
hard links, symbolic links, rename, ..
4 - : super_operations, file_operations, address_space_operations,
inode_operations .
Linux Kernel Source
This section shows how to download the Linux kernel source and how to find
your way around the kernel source tree to locate files that are of most interest to
filesystem development. Later sections show how to configure the kernel to
match the hardware on your system, to compile it, and then install the newly
built kernel. Both the LILO or GRUBbootloaders are described.
The Linux kernel source can be retrieved from the following Web site:
www.kernel.org
The home page of www.kernel.orgshows the latest versions of the kernel. For
example, the following line showed the latest stable version at the time of this
writing:
The latest stable version of the Linux kernel is: 2.4.18 2002-07-10 00:40
UTC F V VI Changelog
The Web site also describes the state of the different kernels including the latest
stable version. Click on the kernel version to download the latest kernel. Clicking
on Changelog will display all of the updates to the latest kernel.
All of the kernels since Linux inception can be found at this site. Follow the
links through to the source repositories and locate the kernel of your choice. To
use the source in the book as is, you need the 2.4.18 kernel. Alternatively, as
described earlier, newer versions of the filesystem can be obtained from the
following Web site:
www.wiley.com/compbooks/pate
Also at the site is information about which Linux kernels and the various Linux
distributions that uxfs supports.
To locate the required kernel source, follow the various pointers. As an
example, from the home page follow the link to Linux respository, including kernel
source, then kernel and 2.4. This will take you to the following link:
www.kernel.org/pub/linux/kernel/v2.4/
The kernel source is a gzipped tar archive. Once the file has been downloaded, it
should be unzipped and untarred. The kernel source resides under /usr/src
although this is not mandatory. One possibility is to untar the archive in
/usr/src and set a symlink to point to the directory. For example, if the
gzipped archive has been placed in /usr/src, perform the following steps:
# bunzip2 linux-2.4.18.tar.bz2
# mv linux linux.orig
# tar xvf linux-2.4.18.tar
# mv linux linux-2.4.18
# ln -s linux-2.4.18 linux
Extracting the files from the tar archive will place them in the directory linux in
the current working directory by default. The command to move the old linux
directory aside may be replaced with something more suitable to your
environment. Alternatively, the soruce can be extracted in a separate directory
and then moved into /usr/src/linux-2.4.18. Be careful not to overwrite any
existing Linux kernel source trees.
Whats in the Kernel Source Tree
There are many files and directories in the Linux kernel source tree. This section
provides an overview of how the kernel source tree is laid to allow readers to be
able to easily locate the various kernel subsystems or specific files.
arch. This directory contains a directory for each of the different machine
architectures that Linux supports including Intel, Sparc, MIPS, and IBM s390.
CREDITS. This file lists all of the major contributors to the kernel together with
information about their area of expertise or contribution.
Documentation. There is a whole host of documentation distributed with
the kernel source. The filesystems directory contains information about
some of the different Linux filesystems in additional to generic
filesystem-related information.
drivers.This directory contains all of the Linux device drivers.
fs. This is the directory that will be of most relevance to people interested in
filesystems together with the mm directory that contains much of the page
cache/data I/O management code. Files in the fs directory implement the
dcache, buffer cache, inode cache, and file-related system call handling. Also
within the fs directory is a directory for each of the Linux filesystems.
Within their respective directories are the filesystem source files themselves.
include. All of the kernel header files can be accessed within this directory.
This directory contains architectural-specific header files in addition to
header files that are common across all architectures. The common header
files can be found in the linux subdirectory. The fs.h header file is of
particular importance to filesystem writers. The dcache.h header file
defines the structures used by the Linux dcache.
init. This directory contains functions that are executed during kernel
bootstrap.
ipc. This directory contains source applicable to System V IPC (Inter Process
Communication) including semaphores, shared memory, and message
queues.
kdb.If the kdbpatch is installed, this directory contains source for the kernel
debugger. Note that the kdb patch also changes other files throughout the
kernel.
kernel. This directory contains core kernel routines such as process
management, system call handling, module management, and so on.
lib.Some of the standard C library functions have counterparts in the kernel.
The source can be found in this directory.
MAINTAINERS. This file lists the people who are responsible for various parts
of the kernel.
mm. This directory contains all of the memory management code that is not
specific to one architecture or another. The Linux page cache managment
routines can be found in this directory.
net. All of the networking protocols (TCP, UDP, IP, etc.) are stored in this
directory.
There are too many files and directories to decribe here. However, for readers
interested in learning about filesystems, the include, fs, and mm directories are
where most of the filesystem-related structures and routines can be found. There
are also a few interesting files in the drivers/block directory for those
wishing to look at the filesystem/driver interfaces in more detail.
Configuring the Kernel
Before building the kernel, it is necessary to determine the kernel configuration.
There are many components that are part of the kernel source tree that you will
not need as part of your kernel. For example, there are numerous different device
drivers for the various SCSI adaptors. If you dont have a need for SCSI access,
building support into the kernel is unnecessary. Thus, you need to determine
what hardware configuration you have and therefore which kernel components
are required.
There are several different methods of defining the configuration. The Linux
kernel HOWTO should be consulted in addition to the notes described here.
There are multiple copies of the HOWTO available across the World Wide Web.
You can find it at the following Web site:
www.tldp.org/HOWTO/Kernel-HOWTO.html
One of the easiest ways to determine which components of the kernel are needed
is to install the kernel source when the Linux operating system is installed. This
will result in a configuration file for the installed kernel being available for
consultation. It is then possible to copy the configuration file from the installed
kernel source tree to the new kernel source tree as follows:
# cp /usr/src/linux-2.4.18-3/.config /usr/src/linux-2.4.18/.config
Care must be taken here. If the new kernel being installed has a substantially
different configuration from the installed kernel, some options may or may not be
available. However, this method should suffice in most cases.
One method of defining the configuration is to run the following command for
both the installed kernel and the new kernel source. For example, for Red Hat 7.3
run the following:
# cd /usr/src/linux-2.4.18-
# make menuconfig
And for the new kernel do the following:
# cd /usr/src/linux-2.4.18
# make menuconfig
By having both windows side by side it is easy to see which components you need
to select for the new kernel by browsing through the configuration of the current
kernel. The alternative method is to fully understand what type of hardware you
have. When comparing the configurations side by side, it is a safe bet to select
everything for the new kernel that is selected in the current kernel.
Items are selected if noted by an asterisk. Loadable kernel modules are denoted
by the letter "M." Instructions are available at the top of the screen to indicate how
to select. Pressing Enter expands the menu to the next level. Pressing the Escape
key takes you back up a level.
Once you have completed changing the configuration, a series of Escape key
sequences will prompt you as to whether you wish to save and exit. Note that you
do not need to save the configuration for the current kernel. This is particularly
important if you have accidently made any changes. After saving the
configuration and exiting the program, the following message appears:
Saving your kernel configuration..
*** End of Linux kernel configuration.
*** Check the top-level Makefile for additional configuration.
*** Next, you must run make dep
Follow the instructions by issuing the following commands:
# make dep
# make clean
The first step builds all of the necessary kernel dependencies based on the set of
options chosen during the kernel configuration process. The next step is to ensure
that the build environment is clean such that a subsequent kernel compilation will
not pick up any precompiled files that do not match the configuration chosen.
The next step, which is the longest, is to compile the kernel. This can be
achieved by typing the following:
# make bzImage
..
objcopy -O binary -R .note -R .comment -S compressed/bvmlinux
compressed/bvmlinux.out
tools/build -b bbootsect bsetup compressed/bvmlinux.out CURRENT
bzImage
Root device is (3, 2)
Boot sector 512 bytes.
Setup is 2536 bytes.
System is 1301 kB
warning: kernel is too big for standalone boot from floppy
make[1]: Leaving directory '/usr/src/linux-2.4.18/arch/i386/boot'
Once the process is complete, the compressed kernel, which is called bzImage,
will be placed in the directory arch/i386/boot. This should be copied to
/bootand given a unique name as follows:
# cp arch/i386/boot/bzImage /boot/linux.spate
Note the name of the file that the kernel was copied to. This should be given an
easy to remember name and should not overwrite any existing kernels that are
already in /boot. One exception to this rule is when you are building kernels
frequently and you know which kernels can be safely overwritten.
Because many of the kernel components were probably selected to be kernel
modules, they must be compiled and installed as follows:
# make modules
# make modules_install
The modules are compiled and installed under the /lib/modules directory.
There is one subdirectory for each kernel version. For example, in the case of the
kernel being used here, the modules will reside under:
/lib/modules/2.4.18
It is important to remember to compile and install the modules selected during
configuration, a task that is often easy to forget. Without the modules in place,
the kernel may not boot.
Installing and Booting the New Kernel
The next step is to configure the boot loader to recognize the new kernel. Most
Linux distributions either use LILO or GRUB as the boot loader. This section
decribes how to use LILO, the most commonly used boot loader.
Consider the following lines taken from one specific /etc/lilo.conf file
that was created as part of Red Hat 7.3 installation:
image=/boot/vxlinuz-2.4.18-
label=linux
initrd=/boot/initrd-2.4.18-3.img
read-only
root=/dev/hda2
The image field specifies the kernel to bootstrap. When lilo runs and displays
the list of bootable kernels, it displays the names found next to the labelfield, in
this case linux. The initrd field specifies an initial root disk (RAM disk) that
will be used prior to checking and mounting the real root filesystem. The root
field specifies where the root disk can be found.
In order to bootstrap the new kernel, copy these lines to the end of the file and
change both the image and label lines as follows:
image=/boot/linux.spate
label=linux.spate
initrd=/boot/initrd-2.4.18-3.img
read-only
root=/dev/hda2
This creates an entry for the new kernel and leaves the existing entry for the
default kernel unchanged. Note that it is important not to modify any of the
configuration information for the kernel installed as part of the Linux installation.
It is imperitive to have a kernel that boots safely because there will be times when
building new kernels where device drivers are accidently ommitted. For example,
it is not uncommon when building a kernel for the first few times to ommit vital
information such as the correct disk drivers, rendering the new kernel
unbootable.
The final step is to run lilo to install information about the new kernel in the
master boot record:
# lilo
A successful run of lilo should not display anything. Once completed, you will
see an entry corresponding to your kernel (the labelfield) next time the machine
is rebooted.
Using GRUB to Handle Bootstrap
Many Linux distributions are now using the GRUB (GRand Unified Bootloader)
boot loader. This is extremely rich in features but operates in a different manner to
LILO. However, adding a new kernel is not difficult. The /etc/grub.conf file
is used in a similar manner to /etc/lilo.conf. However, adding an entry to
this file is sufficient. GRUB does not need to be run to install the information in
the master boot record.
For further information on GRUB, see the grub manual page.
Booting the New Kernel
The next step is to reboot the machine. Once the machine boots, lilodisplays the
list of kernels that it is able to bootstrap. The newly installed kernel should be
visible. This can be selected using the arrow keys and loaded by pressing Enter. If
all goes well, the new kernel will boot as expected
To verify that the kernel requested is running, the uname command can be
used to display the kernel version as follows:
# uname -
Linux x.y.com 2.4.18 #2 SMP Tue Jul 30 18:55:27 PDT 2002 i686 unknown
The kernel version is shown in bold. There will be times when you reboot the
machine and lilo automatically boots a kernel by default and you often wonder
which kernel is running when you return to the machine. It is typically a good
idea to have the default kernel set to the kernel that was installed when the Linux
operating system was installed.
Installing Debugging Support
Analyzing the filesystem source code is one way to learn about how filesystems
work. However, it is extremely difficult following this method to truly
understand the flow through the kernel and filesystem in response to certain
operations. There is no better method than installing and using one of the
different kernel debuggers allowing you to stop in specific functions, display
stack backtraces and function arguments, and print other useful information.
There are three main methods under which a filesystem or indeed any other
part of the kernel can be debugged. The first approach involves using the kernel
printk() command which is very similar to printf(). The second approach
involves using a standalone debugger such as kdb whereby flow can be stopped
by placing explicit breakpoints or by entering a special key sequence to enter the
debugger. The third approach involves the use of two machines connected
through a serial cable and over which gdb can be used for source level
debugging.
The following sections describe each of these approaches. The amount of work
to perform each task is considerably different with printk() being the simplest
approach while the gdb approach involves more time to set up and an additional
machine. For readers who wish to experiment and have access to all the available
resources it is recommended that you start with printk() first, then move to
kdb, and finally to gdb.
The following sections assume some familiarity with debugging concepts.
The printk Approach to Debugging
One of the oldest and easiest styles of debugging is the printf() method. By
placing printf() statements throughout the code it is possible to display
information about the running program. This is useful for development or
simply to follow the flow through the program.
Linux provides the printk() function for kernel/module writers to use.
With the exception of the name change, it can be used in the same manner in
which printf() can be called. One method employed when writing uxfs was to
place a printk() at the start of each entry point to the filesystem. When typing
various commands at the user prompt, it is then easy to see which functions in the
filesystem are called.
Because Linux supports loadable modules, and the time to recompile and
reload a module is in the order of seconds, this is the easiest way to watch how the
filesystem works in practice and should be the method initially followed by
anyone new to kernel development who wants to understand how the kernel
works. To get a better idea of how the filesystem-related kernel functions work,
printk() calls can be placed throughout the kernel, and various structures can
be displayed.
Using the SGI kdb Debugger
The kdb debugger is a built-in debugger. It must be compiled with the kernel in
order for it to be used. It can be used to set breakpoints, display memory,
disassemble instructions, and display machine configuration such as the register
set. The debugger operates around the kernel symbol table, and therefore
functions and structures can be accessed by name.
The source code for kdb, which was developed by engineers at SGI (Silicon
Graphics Inc), can be downloaded from the SGI Web site. The home page for kdb
is as follows:
http://oss.sgi.com/projects/kdb/
Note that when following the link to the download section, the directories
displayed are for the versions of kdband not versions of the Linux kernel. For the
kernel used to develop uxfs (2.4.18), kdb version 2.1 must be used (the latter
versions did not support this kernel at the time of writing).
The README file in the download directory contains instructions on which files
to download. This file should be consulted prior to downloading. Note that there
may be several versions for the same kernel. The README file specifies how to
interpret the version numbers of the patches.
There are two patch files to download. The first is common across all different
machine architectures and the second is specific to the machine architecture on
which youre running. After downloading the patches, they can be applied as
follows:
# cd /usr/src/linux-2.4.18
# patch -p1 < ../kdb-v2.1-2.4.18-common-
patching file kernel/sysctl.
patching file kernel/ksyms.
patching file kernel/Makefile
patching file init/main.
..
patching file Documentation/kdb/kdb_env.man
patching file Documentation/kdb/kdb.mm
patching file Documentation/kdb/kdb_bp.man
patching file Documentation/kdb/slides
# patch -p2 < ../kdb-v2.1-2.4.18-i386-
patching file include/asm-i386/hw_irq.
patching file include/asm-i386/keyboard.
patching file include/asm-i386/ptrace.
patching file arch/i386/vmlinux.lds
..
patching file arch/i386/kdb/kdbasupport.
patching file arch/i386/kdb/ansidecl.
patching file arch/i386/kdb/bfd.
patching file arch/i386/kdb/ChangeLog
Once the patch has been successfully applied, the kernel configuration must be
changed to incorporate kdb. Under the section marked Kernel hacking , select the
option Built-in Kernel Debugger support and select the KDB modules. The kernel
must then be built (make dep ; make bzImage) and reinstalled as described
in the section Configuring the Kernel earlier in the chapter.
Included with the kdb patch is documentation on how the debugger works,
the commands that are available, and so on. The debugger can be entered by
pressing the BREAKkey. The kdbprompt is then displayed as follows:
Entering kdb (current=0xc03b0000,pid 0)on processor 0 due to Keyboard Entry
[0]kdb>
The ?command can be used to display the available commands. Shown below is
a summary of the more commonly used commands. Examples of how they are
used in practice will be shown throughout the chapter.
bp.Set or display a breakpoint.
bph.Set a hardware breakpoint.
bc. Clear a breakpoint.
bl. List the current breakpoints.
bt.Display the stack backtrace for the current process.
go. Exit the debugger and restart kernel execution.
id.Disassemble instructions.
md.Display the contents of the specified address.
mds.Display memory symbolically.
mm.Modify memory.
reboot.Reboot the machine immediately.
rd.Display the register contents.
ss.Single step (instruction at a time)
ssb.Single step the CPU until a branch is reached.
The kdb(8)man page describes the other commands.
Source Level Debugging with gdb
The GNU debugger gdb has been available for many years, typically being used
to debug user-level programs. However, by connecting machines together over a
serial line in a host/target configuration, gdb can also be used to debug the Linux
kernel. This requires a patch to the kernel to include a kgdbdriver through which
gdb on the host machine can communicate. Although this requires an extra
machine and some additional setup work, the ease of use of debugging the kernel
at source level is well worth the extra work. It is also easier to see how the kernel
works because not only can breakpoints be added to show the flow through the
kernel, but function arguments can be displayed along with the source code
corresponding to the position at which the breakpoint is hit.
There are multiple patches for kernel-level gdb debugging. The following Web
page:
http://kgdb.sourceforge.net/
is the homepage for kgdb. It references all of the patches and contains detailed
instructions on gdb setup. The following sections highlight some of the main
points. For complete details, refer to the kgdb homepage.
Connecting the Host and Target Machines
The first step for gdb debugging is to connect the two machines together and
verify that data can be passed through the link. The machines must be connected
through a standard null modem between the serial ports of the machines as
shown in Figure 14.2.
Serial ports support transmission rates from 110 baud up to 115,200 baud. The
default baud rate for a serial port is 9,600. This is generally adequate for simple
debugging although higher baud rates are preferred if a lot of information will be
transmitted over the wire. This will certainly be the case when displaying
multiple thread stacks.
Once the link is in place, the speed of the serial port on each machine must be
identical. This can be verified on each machine as follows:
# stty < /dev/ttyS0
speed 9600 baud; line = 0;
min = 0; time = 10;
-brkint -icrnl -imaxbel
-opost -onlcr
-isig -icanon -iexten -echo -echoe -echok -echoctl -echoke
The baud rate is shown here as 9,600. If the baud rate differs between the two
machines, the following call to the stty command can set the baud rate:
# stty ispeed 9600 ospeed 9600 < /dev/ttyS0
Assuming that the baud rate is the same on both machines and the cable is in
place, the link can be tested by simply echoing a string through the cable on one
end and reading it on another as follows:
Host Target
# cat /dev/ttyS0 # echo hello > /dev/ttyS0
hello
If any problems are encountered, review the troubleshooting guide on the kgdb
kernel Web site.
Downloading the kgdb Patch
The download section of the kgdb kernel Web site contains the kernel patches for
specific Linux kernels. Each patch is an ASCII file that contains a set of diffs. Once
downloaded, the patch to build kgdb into the kernel can be applied to the kernel
as follows:
# cd /usr/src/linux
# patch -p1 < ../linux-2.4.18-kgdb-1.5.patch
patching file Documentation/Configure.help
patching file Documentation/i386/gdb-serial.txt
patching file Makefile
patching file arch/i386/Makefile
patching file arch/i386/config.in
patching file arch/i386/kernel/Makefile
..
patching file kernel/ksyms.
patching file kernel/sched.
Once the patch has been applied, the kernel configuration must be updated to
include the kgdb options. Under the Kernel Debugging section, select the
following line:
KGDB: Remote (serial) kernel debugging with gdb (NEW)
and then select each of the kgdb suboptions. Note that the Verbose BUG() reporting
option should not be selected.
After saving the kernel configuration, run the following:
# make dep
# make clean
# make bzImage
to build the new kernel. As described in earlier sections, the kernel will be found
under the arch/i386/boot directory.
Installing the kgdb-Modified Kernel
To install the new kernel, the entry in lilo.confmust be changed to instruct the
kernel to wait, on bootstrap, for a connection from gdb on the host machine.
Shown below is an entry in lilo.conffor the new kernel:
image=/boot/linux.gdb
label=linux.gdb
initrd=/boot/initrd-2.4.18-3.img
read-only
root=/dev/hda2
append="gdb gdbttyS=0 gdbbaud=9600"
This instructs the kgdbstub which serial port to use (/dev/ttyS0) and the baud
rate that was established earlier during gdb configuration.
When the new kernel bootstraps, the following message is displayed:
Waiting for connection from remote gdb..
To connect to the target machine, gdb must be run on the host and the following
commands should be entered:
# gdb
GNU gdb Red Hat Linux (5.1.90CVS-5)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"
(gdb) target remote /dev/ttyS0
Remote debugging using /dev/ttyS0
0xc011323d in ?? (
(gdb)
Continuing.
PCI: PCI BIOS revision 2.10 entry at 0xfbfee, last bus=
PCI: Using configuration type
..
The "target remote" command specifies the serial port to connect to in order
to communicate with the kernel. The c command then continues execution.
To break into the debugger and instruct it where to access the symbolic
debugging information, hit Control-C as follows:
Program received signal SIGTRAP, Trace/breakpoint trap.
0xc011323d in ?? (
(gdb) symbol-file /usr/src/linux/vmlinux
Reading symbols from /usr/src/linux/vmlinux...done.
The debugger now has enough information to debug the kernel.
gdb and Module Interactions
Because uxfs is a loadable module, gdb knows nothing about the location of the
module in memory or where to locate the modules symbolic information.
The loadmodule script, also located on the kgdb Web site, must be used to
load the module. It is assumed that the module source and binary are located on
the host machine and that it is possible to rcpfrom the host to the target.
Before running loadmodule, the GDBSCRIPTS variable, located at the top of
the script, must be altered to point to a directory where it can install a script for
use with gdb. As an example:
GDBSCRIPTS=/home/spate/uxfs/tools/gdbscripts
The script can then be run as follows:
# loadmodule target-machine ../kern/uxfs
Copying ../kern/uxfs to linux
Loading module ../kern/uxfs
Generating script /home/spate/uxfs/tools/gdbscripts/loadlinuxuxfs
Once completed, the module should be loaded on the target machine and the
script generated is displayed. This should be run from within gdb. Control-C will
get you into gdb from which the script can be executed as follows:
Program received signal SIGTRAP, Trace/breakpoint trap.
breakpoint () at gdbstub.c:1177
1177
(gdb) so /home/spate/uxfs/tools/gdbscripts/loadlinuxuxfs
add symbol table from file "/home/spate/uxfs/kern/uxfs" at
.text_addr = 0xd0854060
.rodata_addr = 0xd0855c60
__ksymtab_addr = 0xd085618c
__archdata_addr = 0xd08562b0
__kallsyms_addr = 0xd08562b0
.data_addr = 0xd08568c0
.bss_addr = 0xd0856a60
The setup of gdb is now complete. Control-C can be invoked at any time the
debugger needs to be entered to add break points and so on. Use of gdb for
kernel-level debugging will be shown throughout the chapter.
Building the uxfs Filesystem
The source code for all of the files that are needed to build the uxfs filesystem for
the 2.4.18 kernel is included at the end of the chapter. This includes the source for
mkfs and fsdb, the kernel makefile, and the kernel source. The source tree
downloaded from www.wiley.com/compbooks/spate is a gzipped tar
archive. Download to any directory and issue the following commands:
# gunzip uxfs.tar.gz
# tar xvf uxfs.tar
# ls
uxfs.tar uxfs
# ls uxfs
cmds kern
Commands can be easily built. All that is required is for the uxfs.h header file to
be located in the "../kern" directory. To build each of the commands, go to the
cmdsdirectory and issue the following:
# make fsdb
cc fsdb.c -o fsdb
# make fsdb
cc fsdb.c -o fsdb
The commands can then be used.
The kernel makefileis relatively straightforward as follows:
KERNELDIR = /usr/src/linux
include $(KERNELDIR)/.config
FLAGS = -D__KERNEL__ -DMODULE $(VERCFLAGS)
GLOBAL_CFLAGS = -g -I$(KERNELDIR)/include $(FLAGS)
M_OBJS = ux_dir.o ux_alloc.o ux_file.o ux_inode.o
M_TARGET = uxfs
SRCS = $(M_OBJS:.o=.c)
CFLAGS = $(GLOBAL_CFLAGS) $(EXTRA_CFLAGS)
$(M_TARGET) : $(M_OBJS)
ld -r -o $@ $(M_OBJS)
$(M_OBJS) : %.o : %.c
$(CC) -c $(CFLAGS) -o $@ $<
all: uxfs
clean:
rm -f $(M_OBJS) $(M_TARGET)
To build the kernel source, the KERNELDIR variable at the top of the Makefile
must be changed to reference the kernel source directory. Figure 14.3 shows how
KERNELDIR is set to reference the 2.4.18 source tree.
Once this variable has been set, the kernel can be built as follows:
# make uxfs
cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_dir.o ux_dir.
cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_alloc.
ux_alloc.
cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_file.o ux_file.
cc -c -g -I/usr/src/linux/include -D__KERNEL__ -DMODULE -o ux_inode.
ux_inode.
ld -r -o uxfs ux_dir.o ux_alloc.o ux_file.o ux_inode.
This produces the uxfs module that can then be loaded into the kernel. This is
shown later in the chapter.
Creating a uxfs Filesystem
The first step when developing a new filesystem is to write a mkfs command to
place the intial filesystem layout on disk. This includes the following tasks:
Create and initialize the filesystem superblock and write it to disk.
Create a root dirtectory inode and lost+found directory inode. For each
of the inodes, ensure that the "." and ".." entries are in place and for the root
directory, add an entry for lost+found.
Account for allocation of the two directories within the inode map.
Account for allocation of two blocks used for the root and lost+found
directories.
The code for mkfs can be found on lines 104 to 262. For uxfs, it is a fairly simple
program. As with the kernel, it uses various structure definitions and
information from ux_fs.h including superblock structural information, inode
formats, directory entries and various filesystem boundaries such as the
maximum number of blocks and inodes.
Before the filesystem is implemented, it is important to verify the information
that mkfs writes to disk. Thus, the next program to write is fsdb, which can read
back and display various superblock and inode information.
The fsdbcommand (lines 264 to 393) is very simple. It accepts two commands
that allow the superblock or a specified inode to be displayed.
The first task is to read the superblock into memory (lines 365 to 369), validate
it, and keep in in memory for the duration of the program. From here, it can
access any information it needs to about inodes or data blocks.
The remainder of the main() loop involves reading commands and then
calling additional routines. For now, only the superblock or an inode can be
displayed. By entering 'q', the program will exit.
The following output from fsdb shows the two commands being run on a
newly created filesystem:
# ./mkfs /dev/fd0
# ./fsdb /dev/fd0
uxfsdb >
Superblock contents:
s_magic = 0x58494e55
s_mod = UX_FSCLEAN
s_nifree = 28
s_nbfree = 468
uxfsdb > i2
inode number
i_mode = 41ed
i_nlink =
i_atime = Wed Aug 21 09:55:16 2002
i_mtime = Wed Aug 21 09:55:16 2002
i_ctime = Wed Aug 21 09:55:16 2002
i_uid =
i_gid =
i_size = 512
i_blocks =
i_addr[ 0] = 50 i_addr[ 1] = 0 i_addr[ 2] = 0 i_addr[ 3] =
i_addr[ 4] = 0 i_addr[ 5] = 0 i_addr[ 6] = 0 i_addr[ 7] =
i_addr[ 8] = 0 i_addr[ 9] = 0 i_addr[10] = 0 i_addr[11] =
i_addr[12] = 0 i_addr[13] = 0 i_addr[14] = 0 i_addr[15] =
Directory entries:
inum[ 2],name[.
inum[ 2],name[..
inum[ 3],name[lost+found]
uxfsdb > q
There are many more features that could be added to fsdb. Some of these
changes will be imperitive when completing the exercises at the end of the
chapter.
Module Initialization and Deinitialization
When writing a loadable kernel module, there are three different things that need
to be defined:
A declaration giving information about the type of module
A function to be called when the module is loaded. This can perform any
initialization functions including registering the filesystem type with the
kernel.
A function to be called when the module is unloaded. This can clean up
any remaining filesystem structures and unregister the filesystem.
The various components that are applicable to uxfs are shown in ux_inode.c
on lines 1304 to 1317. The module_init() call specifies the function to be run
when the module is loaded while the module_exit() function specifies the
function to be called when the module is unloaded. Both of these functions
perform little work other than registering and unregistering the filesystem driver
respectively. The DECLARE_FSTYPE_DEV()macro is shown below:
#define DECLARE_FSTYPE(var,type,read,flags)
struct file_system_type var = {
name: type,
read_super: read,
fs_flags: flags,
owner: THIS_MODULE,
}
#define DECLARE_FSTYPE_DEV(var,type,read)
DECLARE_FSTYPE(var,type,read,FS_REQUIRES_DEV)
The kernel maintains a list of all such structures, one per filesystem. The entry for
uxfs is added when calling register_filesystem(). When a mount system
call enters the kernel, the filesystem name passed to mount is compared with the
name field of each file_system_type structure. If a match is found, the
read_superfunction is called to mount the filesystem.
The rmmod command is used to remove a kernel module. If there are still
filesystems mounted, the removal will fail; otherwise the kernel calls the module
exit function, which in the case of uxfs, is the exit_uxfs_fs() function. The
only action to perform is to call unregister_filesystem().
Testing the New Filesystem
The following examples show how a uxfs filesystem is created, how the kernel
module is loaded, the filesystem is unmounted, and how the module is unloaded.
Modules are loaded and unloaded with the insmodand rmmod commands. Note
that by default, the insmod command will attempt to look under
/lib/modules/ to locate the requested module. For
example, if the pathname is not specified as shown below, insmod will fail even
though the requested module is in the current directory. For this reason "./uxfs"
must be specified.
# ./mkfs /dev/fd0
# insmod ./uxfs
# lsmod
Module Size Used by Not tainted
uxfs 8608 0 (unused)
ext3 71968 2 (autoclean)
jbd 66208 2 (autoclean) [ext3]
# mount -t uxfs /dev/fd0 /mnt
# mount
/dev/hda2 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/hda1 on /boot type ext3 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/hda5 on /home type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/fd0 on /mnt type uxfs (rw)
# rmmod uxfs
uxfs: Device or resource busy
# umount /mnt
# rmmod uxfs
# lsmod
Module Size Used by Not tainted
ext3 71968 2 (autoclean)
jbd 66208 2 (autoclean) [ext3]
The sequence of commands here is merely to illustrate the basics of how to get a
uxfs filesystem mounted. The module displayed by lsmod is the name of the
actual binary and does not bear any resemblance to the source code.
Mounting and Unmounting the Filesystem
The ux_read_super() function is called to mount a uxfs filesystem. This
function is declared through the DECLARE_FSTYPE_DEV()macro and becomes
known to the Linux kernel when the filesystem is registered. The code for this
function can be found in ux_inode.con lines 1240 to 1302.
The ux_read_super() function takes three arguments as shown in
ux_inode.con line 1234 and iterated below:
ux_read_super(struct super_block *s, void *data, int silent)
There is one super_block structure per mounted filesystem. One of the tasks to
be performed by ux_read_super()is to initialize this structure by filling in the
following fields:
s_magic. This field holds the magic number of the filesystem, which for uxfs
is 0x58494e55. This field has little practical value.
s_blocksize. This field holds the filesystem block size, which in the case of
uxfs is 512 bytes (UX_BSIZE).
s_op.This field holds the super_operations vector, a set of functions that
either deal with the filesystem as a whole or allow inodes to be read,
written, and deleted.
s_root. This field is set to reference the dentry for the root inode. This is
described in more detail later.
The data argument is used by the kernel to pass any arguments that were
passed to mount. At this stage, uxfs does not accept any command line
arguments to mount, so this parameter is ignored. The silent argument, if set,
allows the filesystem writer to display more detailed information when running.
This allows debugging information to be displayed.
The ux_read_super() function must also perform the following tasks:
Call set_blocksize()to specify to the underlying driver layer the units
of I/O that will be passed through when accessing data through the buffer
cache. Note that all subsequent I/O must be in fixed-size chunks.
Allocate and initialize a root inode for the filesystem. This will be
explained in more detail later.
The following example shows how to set a breakpoint in gdb, display a stack
backtrace, and show how to display various structures. First of all, after the
module is loaded, but before a calling made to mount a filesystem, a breakpoint
is set in ux_read_super(). Hitting Control-C will enter gdb from which the
breakpoint can be set:
(gdb) b ux_read_super
Breakpoint 1 at 0xd08557ca: file ux_inode.c, line 237.
(gdb)
Continuing.
In response to mounting the filesystem, the breakpoint will be hit as follows:
# mount -f uxfs /dev/fd0 /mnt
Breakpoint 1, ux_read_super (s=0xcf15a400, data=0x0, silent=0)
at ux_inode.c:237
237 dev = s->s_dev;
(gdb) list
232 struct ux_fs *fs;
233 struct buffer_head *bh;
234 struct inode *inode;
235 kdev_t dev;
236
237 dev = s->s_dev;
238 set_blocksize(dev, UX_BSIZE)
239 s->s_blocksize = UX_BSIZE;
240 s->s_blocksize_bits = UX_BSIZE_BITS;
241
The list command displays the source code from the point at which the
breakpoint has been hit. The bt command can be used to display the current stack
backtrace as follows:
(gdb) bt
#0 ux_read_super (s=0xcf15a400, data=0x0, silent=0) at ux_inode.c:237
#1 0xc0143868 in get_sb_bdev (fs_type=0xd0856a44,
dev_name=0xccfe8000 "/dev/fd0", flags=0, data=0x0) at super.c:697
#2 0xc0143d2d in do_kern_mount (type=0xccfe9000 "uxfs", flags=0,
name=0xccfe8000 "/dev/fd0", data=0x0) at super.c:879
#
0xc0156ff1 in do_add_mount (nd=0xcd011f5c, type=0xccfe9000 "uxfs"
flags=0, mnt_flags=0, name=0xccfe8000 "/dev/fd0", data=0x0)
at namespace.c:630
#4 0xc01572b7 in do_mount (dev_name=0xccfe8000 "/dev/fd0"
dir_name=0xcf80f000 "/mnt", type_page=0xccfe9000 "uxfs"
flags=3236757504, data_page=0x0) at namespace.c:746
#
0xc015737f in sys_mount (dev_name=0x805b418 "/dev/fd0"
dir_name=0x805b428 "/mnt", type=0x805b438 "uxfs", flags=3236757504,
data=0x0) at namespace.c:779
#
0xc010730b in system_call ()
The arguments to the function at the current position in the stack trace
(ux_read_super()) can be displayed with the print (p) command. Note that
gdbunderstands C constructs:
(gdb) print *(struct super_block *)0xcf15a400
$1 = {s_list = {next = 0xc0293840, prev = 0xcf6df400}, s_dev = 512,
s_blocksize = 0, s_blocksize_bits = 0 \0, s_dirt = 0 \0
s_maxbytes = 2147483647, s_type = 0xd0856a44, s_op = 0x0, dq_op = 0x0,
s_flags = 0, s_magic = 0, s_root = 0x0, s_umount = {count = -65535,
wait_lock = {lock = 1}, wait_list = {next = 0xcf15a43c,
prev = 0xcf15a43c}}, s_lock = {count = {counter = 0}, sleepers = 0,
wait = {lock = {lock = 1}, task_list = {next = 0xcf15a450,
prev = 0xcf15a450}}}, s_count = 1073741824, s_active = {counter = 1}
s_dirty = 0,
..
Later examples show some of the other features of gdb.
Scanning for a Uxfs Filesystem
The first task to perform when mounting the filesystem is to read the superblock
from disk. This involves a call to sb_bread() to read block 0 of the device on
which the superblock resides. The sb_read() function is merely a wrapper
around bread() that extracts the device from the s_dev field of the
super_blockstructure. Thus the following calls are equivalent:
bh = sb_bread(sb, block)
bh = bread(sb->s_dev, block, sb->s_blocksize)
On return from sb_bread(), a buffer_head structure will reference the data
read from the device. Note that each call to sb_read() must be followed at
some stage by a call to brelse() to release the buffer. An attempt to reread the
same block from disk prior to calling brelse() will cause the filesystem to
block. The data read from disk can be referenced by accessing the b_data field.
Because the superblock is located at offset 0 within block 0, the ux_superblock
structure can be referenced as shown in line 1253:
usb = (struct ux_superblock *)bh->b_data;
The first check to perform is to validate that this is a uxfs filesystem. Verification
is achieved by checking for presence of the uxfs magic number. Assuming that
this is detected and the superblock is not marked UX_FSDIRTY, the filesystem
can be mounted. Because all of the inode and data block information is stored in
the uxfs superblock, it is imperative to keep the superblock in memory at all
times. A ux_fs structure is allocated to keep hold of the buffer_head used to
read the superblock. This makes it easy to access the ux_superblock structure
from either the Linux super_block structure or from a Linux inode. This is
shown in Figure 14.4. Note that the buffer is not released until the filesystem is
unmounted.
Access to the ux_fs structure can be achieved through either the Linux
super_block structure or indirectly from the Linux inode structure as follows:
struct super_block *sb = inode->i_sb;
struct ux_fs *fs = (struct ux_fs *)sb->s_private;
struct ux_superblock *usb = fs->u_sb;
Because all exported uxfs functions are passed through either the super_block
or an inode structure as an argument, it is always possible to get access to the
uxfs superblock.
Reading the Root Inode
The final step when mounting the filesystem is to read in the root inode and
instantiate it in the dcache. This is achieved through a call to iget()followed by
a call to d_alloc_root().
The call to iget() will involve a call back into the filesystem to actually read
the inode from disk. Subsequent calls to iget() for the same inode will find the
entry in the cache avoiding the need for further filesystem access. For details on
how uxfs reads inodes see the section Reading an Inode from Disk a little later in the
chapter. The Linux kernel calls find_inode() (fs/inode.c) to scan the inode
cache for the inode. If not found, a call to get_new_inode() is made.
The call to d_alloc_root() is a wrapper to d_instantiate() that
initializes the d_sb field of the dentry structure to reference the new
super_block structure. Note that accessing any further inodes will involve
access to dentries that already exist and that have been initialized by the kernel.
At this stage, the mount is complete. The super_block structure has been
initialized, the root directory is accessible through the Linux inode cache/dcache,
and the kernel has access to the the array of functions exported by the root inode
through which subsequent operations can be performed.
As another example of how to use gdb, a breakpoint can be set on the
ux_read_inode()function as follows:
(gdb) b ux_read_inode
Breakpoint 2 at 0xd0855312: file ux_inode.c, line 54.
(gdb)
Continuing.
As with the gdb example earlier, the source code can be displayed at the point
where the breakpoint is hit:
Breakpoint 2, ux_read_inode (inode=0xcd235460) at ux_inode.c:54
unsigned long ino = inode->i_ino;
(gdb) list
49 void
50 ux_read_inode(struct inode *inode)
51 {
52 struct buffer_head *bh;
53 struct ux_inode *di;
54 unsigned long ino = inode->i_ino;
55 int block;
56
57
if (ino < UX_ROOT_INO || ino > UX_MAXFILES) {
58 printk("uxfs: Bad inode number %lu\n", ino)
and the stack backtrace is displayed to locate the flow through the kernel from
function to function. In the stack backtrace below, you can see the call from
ux_read_super() to iget() to read the root inode. Notice the inode number
(2) passed to iget().
(gdb) bt
#0 ux_read_inode (inode=0xcd235460) at ux_inode.c:54
#1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=2, head=0xcfda3820,
find_actor=0, opaque=0x0) at inode.c:871
#2 0xc015439a in iget4 (sb=0xcf15a400, ino=2, find_actor=0, opaque=0x0)
at inode.c:984
#3 0xd0855bfb in iget (sb=0xcf15a400, ino=2)
at /usr/src/linux/include/linux/fs.h:1328
#4 0xd08558c3 in ux_read_super (s=0xcf15a400, data=0x0, silent=0)
at ux_inode.c:272
#5 0xc0143868 in get_sb_bdev (fs_type=0xd0856a44,
dev_name=0xccf35000 "/dev/fd0", flags=0, data=0x0) at super.c:697
#6 0xc0143d2d in do_kern_mount (type=0xccf36000 "uxfs", flags=0,
..
Finally, the inode structure passed to ux_read_inode() can be displayed.
Because the inode has not been read from disk, the in-core inode is only partially
initialized. The i_ino field is correct, but some of the other fields are invalid at
this stage.
(gdb) print *(struct inode *)0xcd235460
$2 = {i_hash = {next = 0xce2c7400, prev = 0xcfda3820}, i_list =
next = 0xcf7aeba8, prev = 0xc0293d84}, i_dentry = {next = 0xcd235470,
prev = 0xcd235470}, i_dirty_buffers = {next = 0xcd235478,
prev = 0xcd235478}, i_dirty_data_buffers = {next = 0xcd235480,
prev = 0xcd235480}, i_ino = 2, i_count = {counter = 1}, i_dev = 512,
i_mode = 49663, i_nlink = 1, i_uid = 0, i_gid = 0,
i_rdev = 512, i_size = 0,
Because the address of the inode structure is known, it may be displayed at any
time. Simply enter gdb and run the above command once more.
Writing the Superblock to Disk
The uxfs superblock contains information about which inodes and data blocks
have been allocated along with a summary of both pieces of information. The
superblock resides in a single UX_MAXBSIZEbuffer, which is held throughout the
duration of the mount. The usual method of ensuring that dirty buffers are
flushed to disk is to mark the buffer dirty as follows:
mark_buffer_dirty(bh)
However, the uxfs superblock is not released until the filesystem is unmounted.
Each time the superblock is modified, the s_dirt field of the superblock is set to
1. This informs the kernel that the filesystem should be notified on a periodic
basis by the kupdate daemon, which is called on a regular interval to flush dirty
buffers to disk. The kupdate() routine can be found in the Linux kernel source
in fs/buffer.c.To follow the flow from kupdate() through to the filesystem,
the following tasks are performed:
# ./mkfs /dev/fd0
# mount -t uxfs /dev/fd0 /mnt
# touch /mnt/file
Because a new file is created, a new inode is allocated that requires information in
the superblock to be updated. As part of this processing, which will be described
in more detail later in the chapter, the s_dirtfield of the in-core superblock is set
to 1 to indicate that the superblock has been modified.
The ux_write_super() function (lines 1218 to 1229) is called to write the
superblock to disk. Setting a breakpoint in ux_write_super() using kdb as
follows:
Entering kdb (current=0xcbe20000, pid 1320) on processor 0 due to
Keyboard Entry[0]kdb> bp ux_write_super
Instruction(i) BP #1 at 0xd08ab788 ([uxfs]ux_write_super)
is enabled globally adjust
and creating the new file as shown will eventually result in the breakpoint being
hit, as follows:
Entering kdb (current=0xc1464000, pid 7) on processor 0 due to Breakpoint
@ 0xd08ab788
[0]kdb> bt
EBP EIP Function(args)
0xc1465fc4 0xd08ab788 [uxfs]ux_write_super (0xcc53b400, 0xc1464000)
uxfs .text 0xd08aa060 0xd08ab788 0xd08ab7c4
0xc014b242 sync_supers+0x142 (0x0, 0xc1464000)
kernel .text 0xc0100000 0xc014b100 0xc014b2c0
0xc1465fd4 0xc0149bd6 sync_old_buffers+0x66 (0xc1464000, 0x10f00,
0xcffe5f9c, 0xc0105000)
kernel .text 0xc0100000 0xc0149b70 0xc0149cf0
0xc1465fec 0xc014a223 kupdate+0x273
kernel .text 0xc0100000 0xc0149fb0 0xc014a230
0xc01057c6 kernel_thread+0x26
kernel .text
0xc0100000 0xc01057a0 0xc01057e0
Note the call from kupdate() to sync_old_buffers(). Following through,
the kernel code shows an inline function, write_super(), which actually calls
into the filesystem as follows:
if (sb->s_root && sb->s_dirt)
if (sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb)
Thus, the write_super entry of the superblock_operations vector is
called. For uxfs, the buffer holding the superblock is simply marked dirty.
Although this doesnt flush the superblock to disk immediately, it will be written
as part of kupdate() processing at a later date (which is usually fairly quickly).
The only other task to perform by ux_write_super() is to set the s_dirt
field of the in-core superblock back to 0. If left at 1, ux_writer_super() would
be called every time kupdate() runs and would, for all intents and purposes,
lock up the system.
Unmounting the Filesystem
Dirty buffers and inodes are flushed to disk separately and are not therefore
really part of unmounting the filesystem. If the filesystem is busy when an
unmount command is issued, the kernel does not communicate with the
filesystem before returning EBUSYto the user.
If there are no open files on the system, dirty buffers and inodes are flushed to
disk and the kernel makes a call to the put_super function exported through
the superblock_operations vector. For uxfs, this function is
ux_put_super() (lines 1176 to 1188).
The path when entering ux_put_super() is as follows:
Breakpoint 4, ux_put_super (s=0xcede4c00) at ux_inode.c:167
167 struct ux_fs *fs = (struct ux_fs *)s->s_private;
(gdb) bt
#0 ux_put_super (s=0xcede4c00) at ux_inode.c:167
#1 0xc0143b32 in kill_super (sb=0xcede4c00) at super.c:800
#2 0xc01481db in path_release (nd=0xc9da1f80)
at /usr/src/linux-2.4.18/include/linux/mount.h:50
#3 0xc0156931 in sys_umount (name=0x8053d28 "/mnt", flags=0)
at namespace.c:395
#4 0xc015694e in sys_oldumount (name=0x8053d28 "/mnt"
at namespace.c:406
#5 0xc010730b in system_call (
There are only two tasks to be performed by ux_put_super():
Mark the buffer holding the superblock dirty and release it.
Free the structure used to hold the ux_fs structure that was allocated
during ux_read_super().
If there are any inodes or buffers used by the filesystem that have not been freed,
the kernel will free them and display a message on the console about their
existence. There are places within uxfs where this will occur. See the exercises at
the end of the chapter for further information.
Directory Lookups and Pathname Resolution
There are three main entry points into the filesystem for dealing with pathname
resolution, namely ux_readdir(), ux_lookup(), and ux_read_inode().
One interesting way to see how these three functions work together is to consider
the interactions between the kernel and the filesystem in response to the user
issuing an ls command on the root directory. When the filesystem is mounted,
the kernel already has a handle on the root directory, which exports the following
operations:
struct inode_operations ux_dir_inops = {
create: ux_create,
lookup: ux_lookup,
mkdir: ux_mkdir,
rmdir: ux_rmdir,
link: ux_link,
unlink: ux_unlink,
};
struct file_operations ux_dir_operations = {
read: generic_read_dir,
readdir: ux_readdir,
fsync: file_fsync,
}
The kernel has two calls at a directory level for name resolution. The first is to call
ux_readdir() to obtain the names of all the directory entries. After the
filesystem is mounted, the only inode in memory is the root inode so this
operation can only be invoked on the root inode. Given a filename, the
ux_lookup() function can be called to look up a name relative to a directory.
This function is expected to return the inode for the name if found.
The following two sections describe each of these operations in more detail.
Reading Directory Entries
When issuing a call to ls, the lscommand needs to know about all of the entries
in the specified directory or the current working directory if ls is typed without
any arguments. This involves calling the getdents() system call. The prototype
for getdents() is as follows:
int getdents(unsigned int fd, struct dirent *dirp, unsigned int count)
The dirp pointer references an area of memory whose size is specified in count.
The kernel will try to read as many directory entries as possible. The number of
bytes read is returned from getdents(). The dirent structure is shown below:
struct dirent
{
long d_ino; /* inode number */
off_t d_off; /* offset to next dirent */
unsigned short d_reclen; /* length of this dirent */
char d_name [NAME_MAX+1]; /* file name (null-terminated) */
}
To read all directory entries, ls may need to call getdents() multiple times
depending on the size of the buffer passed in relation to the number of entries in
the directory.
To fill in the buffer passed to the kernel, multiple calls may be made into the
filesystem through the ux_readdir() function. The definition of this function
is as follows:
int
ux_readdir(struct file *filp, void *dirent, filldir_t filldir)
Each time the function is called, the current offset within the directory is
increased. The first step taken by ux_readdir() is to map the existing offset
into a block number as follows:
pos = filp->f_pos;
blk = (pos + 1) / UX+BSIZE;
blk = uip->iaddr[blk]
On first entry pos will be 0 and therefore the block to read will be i_addr[0].
The buffer corresponding to this block is read into memory and a search is made
to locate the required filename. Each block is comprised of
UX_DIRS_PER_BLOCKux_dirent structures. Assuming that the entry in the
block at the appropriate offset is valid (d_inois not 0), the filldir()routine, a
generic kernel function used by all filesystems, is called to copy the entry to the
users address space.
For each directory entry found, or if a null directory entry is encountered, the
offset within the directory is incremented as follows:
filp->f_pos += sizeof(struct ux_dirent)
to record where to start the next read if ux_readdir()is called again.
Filename Lookup
From a filesystem perspective, pathname resolution is a fairly straightforward
affair. All that is needed is to provide the lookup() function of the
inode_operations vector that is passed a handle for the parent directory and a
name to search for. Recall from the ux_read_super() function described in the
section Reading the Root Inode earlier in the chapter, after the superblock has been
read into memory and the Linux super_blockstructure has been initialized, the
root inode must be read into memory and initialized. The uxfs
ux_inode_operations vector is assigned to the i_op field of the root inode.
From there, filenames may be searched for, and once those directories are brought
into memory, a subsequent search may be made.
The ux_lookup() function in ux_dir.c (lines 838 to 860) is called passing
the parent directory inode and a partially initialized dentry for the filename to
look up. The next section gives examples showing the arguments passed.
There are two cases that must be handled by ux_lookup():
The name does not exist in the specified directory. In this case an EACCES
error is returned in which case the kernel marks the dentry as being
negative. If another search is requested for the same name, the kernel finds
the negative entry in the dcache and will return an error to the user. This
method is also used when creating new files and directories and will be
shown later in the chapter.
The name is located in the directory. In this case the filesystem should
call iget()to allocate a new Linux inode.
The main task performed by ux_lookup() is to call ux_find_entry() as
follows:
inum = ux_find_entry(dip, (char *)dentry->d_name.name)
Note that the d_namefield of the dentryhas already been initialized to reference
the filename. The ux_find_entry() function in ux_inode.c (lines 1031 to
1054) loops through all of the blocks in the directory (i_addr[]) making a call to
sb_bread()to read each appropriate block into memory.
For each block, there can be UX_DIRS_PER_BLOCKux_dirent structures. If a
directory entry is not in use, the d_inofield will be set to 0. Figure 14.5 shows the
root directory inode and how entries are laid out within the inode data blocks. For
each block read, a check is made to see if the inode number (i_ino) is not zero
indicating that the directory entry is valid. If the entry is valid, a string
comparison is made between the name requested (stored in the dentry) and the
entry in the directory (d_name). If the names match, the inode number is
returned.
If there is no match in any of the directory entries, 0 is returned. Note that inode
0 is unused so callers can detect that the entry is not valid.
Once a valid entry is found, ux_lookup() makes a call to iget()to bring the
inode into memory, which will call back into the filesystem to actually read the
inode.
Filesystem/Kernel Interactions for Listing Directories
This section shows the kernel/filesystem interactions when running ls on the
root directory. The two main entry points into the filesystem for dealing with
name resolution, which were described in the last two sections, are
ux_lookup() and ux_readdir(). To obtain further information about a
filename, the ux_read_inode() must be called to bring the inode into memory.
The following example sets a breakpoint on all three functions and then an ls is
issued on a filesystem that has just been mounted. The filesystem to be mounted
has the lost+founddirectory (inode 3) and a copy of the passwd file (inode 4).
There are no other files.
First, the breakpoints are set in gdb as follows:
(gdb) b ux_lookup
Breakpoint 8 at 0xd0854b32: file ux_dir.c, line 367.
(gdb) b ux_readdir
Breakpoint 9 at 0xd0854350
(gdb) b ux_read_inode
Breakpoint 10 at 0xd0855312: file ux_inode.c, line 54.
The filesystem is then mounted and the the first breakpoint is hit as follows:
# mount -f uxfs /dev/fd0 /mnt
Breakpoint 10, ux_read_inode (inode=0xcd235280) at ux_inode.c:54
54
unsigned long ino = inode->i_ino;
(gdb) p inode->i_ino
$19 = 2
This is a request to read inode number 2 and is called as part of the
ux_read_super() operation described in the section Mounting and Unmounting
the Filesystem earlier in the chapter. The print (p) command in gdb can be used
to display information about any of the parameters passed to the function.
Just to ensure that the kernel is still in the process of mounting the filesystem, a
portion of the stack trace is displayed as follows, which shows the call to
ux_read_super():
(gdb) bt
#0 ux_read_inode (inode=0xcd235280) at ux_inode.c:54
#1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=2, head=0xcfda3820,
find_actor=0, opaque=0x0) at inode.c:871
#2 0xc015439a in iget4 (sb=0xcf15a400, ino=2, find_actor=0, opaque=0x0)
at inode.c:984
#3 0xd0855bfb in iget (sb=0xcf15a400, ino=2)
at /usr/src/linux/include/linux/fs.h:1328
#4 0xd08558c3 in ux_read_super (s=0xcf15a400, data=0x0, silent=0)
at ux_inode.c:272
..
The next step is to run ls /mnt, which will result in numerous calls into the
filesystem. The first such call is:
# ls /mnt
Breakpoint 9, 0xd0854350 in ux_readdir (filp=0xcd39cc60,
dirent=0xccf0dfa0, filldir=0xc014dab0 < filldir64>
This is a request to read directory entries from the root directory. This can be
shown by displaying the inode number of the directory on which the operation is
taking place. Note how C-like constructs can be used within gdb:
(gdb) p ((struct inode *)(filp->f_dentry->d_inode))->i_ino
$20 =
Here is the stack backtrace:
(gdb) bt
#0 0xd0854350 in ux_readdir (filp=0xcd39cc60, dirent=0xccf0dfa0,
filldir=0xc014dab0
#1 0xc014d64e in vfs_readdir (file=0xcd39cc60, filler=0xc014dab0
buf=0xccf0dfa0) at readdir.c:27
#2 0xc014dc2d in sys_getdents64 (fd=3, dirent=0x8058730, count=512)
at readdir.c:311
#3 0xc010730b in system_call (
Although ls may make repeated calls to getdents(), the kernel records the last
offset within the directory after the previous call to readdir(). This can be used
by the filesystem to know which directory entry to read next. The ux_readir()
routine obtains this offset as follows:
pos = filp->f_pos;
It can then read the directory at that offset or advance further into the directory if
the slot at that offset is unused. Either way, when a valid entry is found, it is
copied to the user buffer and the offset is advanced to point to the next entry.
Following this call to ux_readdir(), there are two subsequent calls. Without
looking too deeply, one can assume that ls will read all directory entries first.
The next breakpoint hit is a call to ux_lookup() as follows:
Breakpoint 8, ux_lookup (dip=0xcd235280, dentry=0xcd1e9ae0) at
ux_dir.c:367
367 struct ux_inode *uip = (struct ux_inode *)
The dip argument is the root directory and the dentry is a partially initialized
entry in the dcache. The name to lookup can be found within the dentry
structure as follows:
(gdb) p dentry->d_name
$23 = {name = 0xcd1e9b3c "lost+found", len = 10, hash = 4225228667}
The section Filename Lookup earlier in the chapter showed how the name can be
found in the directory and, if found, ux_lookup() will call iget() to read the
inode into memory. Thus, the next breakpoint is as follows:
Breakpoint 10, ux_read_inode (inode=0xcf7aeba0) at ux_inode.c:54
54 unsigned long ino = inode->i_ino;
(gdb) p inode->i_ino
$24 = 3
The inode number being looked up is inode number 3, which is the inode
number for the lost+founddirectory. The stack backtrace at this point is:
(gdb) bt
#0 ux_read_inode (inode=0xcf7aeba0) at ux_inode.c:54
#1 0xc015411a in get_new_inode (sb=0xcf15a400, ino=3, head=0xcfda3828,
find_actor=0, opaque=0x0) at inode.c:871
#2 0xc015439a in iget4 (sb=0xcf15a400, ino=3, find_actor=0, opaque=0x0)
at inode.c:984
#3 0xd0854e73 in iget (sb=0xcf15a400, ino=3)
at /usr/src/linux/include/linux/fs.h:1328
#4 0xd0854b93 in ux_lookup (dip=0xcd235280, dentry=0xcd1e9ae0)
at ux_dir.c:379
#5 0xc01482c0 in real_lookup (parent=0xcd1e9160,
name=0xccf0df5c, flags=0)
at namei.c:305
#6 0xc0148ba4 in link_path_walk (name=0xcf80f00f "", nd=0xccf0df98)
at namei.c:590
#7 0xc014943a in __user_walk (name=0x0, flags=8, nd=0xccf0df98)
at namei.c:841
#8 0xc0145877 in sys_lstat64 (filename=0xbffff950 "/mnt/lost+found"
statbuf=0x805597c, flags=1108542220) at stat.c:352
#9 0xc010730b in system_call (
Thus, the ls command has obtained the lost+found directory entry through
calling readdir() and is now invoking a stat() system call on the file. To
obtain the information to fill in the stat structure, the kernel needs to bring the
inode into memory in which to obtain the appropriate information.
There are two more calls to ux_readdir() followed by the next breakpoint:
Breakpoint 8, ux_lookup (dip=0xcd235280,dentry=0xcd1e90e0) at ux_dir.c:367
367 struct ux_inode *uip = (struct ux_inode *
(gdb) p dentry->d_name
$26 = {name = 0xcd1e913c "passwd", len = 6, hash = 3467704878}
This is also invoked in response to the stat() system call. And the final
breakpoint hit is:
Breakpoint 10, ux_read_inode (inode=0xcd0c4c00) at ux_inode.c:54
54 unsigned long ino = inode->i_ino;
(gdb) p inode->i_ino
$27 = 4
in order to read the inode, to fill in the fields of the statstructure.
Although not shown here, another method to help understand the flow of
control when reading directory entries is either to modify the lssource code itself
to see the calls it is making or use the ls program (shown in Chapter 2).
Inode Manipulation
Previous sections have already highlighted some of the interactions between the
kernel, the inode cache, and the filesystem. When a lookup request is made into
the filesystem, uxfs locates the inode number and then calls iget() to read the
inode into memory. The following sections describe the inode cache/filesystem
interactions in more detail. Figure 14.6 can be consulted for a high-level view of
these interactions.
Reading an Inode from Disk
The ux_read_inode() function (lines 1061 to 1109) is called from the kernel
iget()function to read an inode into memory. This is typically called as a result
of the kernel calling ux_lookup(). A partially initialized inode structure is
passed to ux_read_inode()as follows:
void
ux_read_inode(struct inode *inode)
and the inode number of the inode can be found in inode->i_ino. The role of
ux_read_inode() is simply to read the inode into memory and copy relevant
fields of the disk portion of the disk-based inode into the inode structure
passed.
This is a relatively straightforward task in uxfs. The inode number must be
converted into a block number within the filesystem and then read through the
buffer cache into memory. This is achieved as follows:
block = UX_INODE_BLOCK + ino;
bh = sb_bread(inode->i_sb, block)
Recall that each uxfs inode is held in its own block on disk and inode 0 starts at
the block number defined by UX_INODE_BLOCK.
Once read into memory, a copy is made of the inode to the location within the
in-core inode defined by the i_private field. This address is at the end of the
in-core inode where the union of filesystem dependent information is stored. The
i_privatefield is defined in ux_fs.has follows:
#define i_private u_generic_ip
Before freeing the buffer, the in-core inode fields are updated to reflect the on-disk
inode. Such information is used by the kernel for operations such as handling the
stat()system call.
One additional task to perform in ux_read_inode()is to initialize the i_op,
i_fop, and i_mapping fields of the inode structure with the operations
applicable to the file type. The set of operations that are applicable to a directory
are different to the set of operations that are applicable to regular files. The
initialization of both types of inodes can be found on lines 1088 to 1097 and
duplicated here:
if (di->i_mode & S_IFDIR)
inode->i_mode |= S_IFDIR;
inode->i_op = &ux_dir_inops;
inode->i_fop = &ux_dir_operations;
} else if (di->i_mode & S_IFREG)
inode->i_mode |= S_IFREG;
inode->i_op = &ux_file_inops;
inode->i_fop = &ux_file_operations;
inode->i_mapping->a_ops = &ux_aops;
Operations such as reading directory entries are obviously not applicable to
regular files while various I/O operations are not applicable to directories.
Allocating a New Inode
There is no operation exported to the kernel to allocate a new inode. However, in
response to requests to create a directory, regular file, and symbolic link, a new
inode needs to be allocated. Because uxfs does not support symbolic links, new
inodes are allocated when creating regular files or directories. In both cases, there
are several tasks to perform:
Call new_inode()to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs disk inode.
Initialize both the in-core and the disk inode.
Mark the superblock dirtythe free inode array and summary have been
modified.
Mark the inode dirty so that the new contents will be flushed to disk.
Information about creation of regular files and directories are the subjects of the
sections File Creation and Link Management and Creating and Removing Directories
later in the chapter. This section only describes the ux_ialloc() function that
can be found in the filesystem source code on lines 413 to 434.
Writing an Inode to Disk
Each time an inode is modified, the inode must be written to disk before the
filesystem is unmounted. This includes allocating or removing blocks or
changing inode attributes such as timestamps.
Within uxfs itself, there are several places where the inode is modified. The
only thing that these functions need to perform is to mark the inode dirty as
follows:
mark_inode_dirty(inode)
The kernel will call the ux_write_inode() function to write the dirty inode to
disk. This function, which can be found on lines 1115 to 1141, is exported through
the superblock_operationsvector.
The following example uses kdb to set a breakpoint on ux_write_inode()
in order to see where the function is called from.
[0]kdb> bp ux_write_inode
The breakpoint can be easily hit by copying files into a uxfs filesystem. The stack
backtrace when the breakpoint is encountered is as follows:
Instruction(i) BP #0 at 0xd08cd4c8 ([uxfs]ux_write_inode)
is enabled globally adjust
Entering kdb (current=0xc1464000, pid 7) on processor 0 due to Breakpoint
@ 0xd08cd4c8
[0]kdb> bt
EBP EIP Function(args)
0xc1465fc8 0xd08cd4c8 [uxfs]ux_write_inode (0xc77f962c, 0x0, 0xcf9a8868,
0xcf9a8800, 0xc1465fd4)
uxfs .text 0xd08cc060 0xd08cd4c8 0xd08cd5c0
0xc015d738 sync_unlocked_inodes+0x1d8 (0xc1464000)
kernel .text 0xc0100000 0xc015d560
0xc015d8e0
0xc1465fd4 0xc0149bc8 sync_old_buffers+0x58 (0xc1464000, 0x10f00,
0xcffe5f9c, 0xc0105000)
kernel .text 0xc0100000 0xc0149b70
0xc0149cf0
0xc1465fec 0xc014a223 kupdate+0x273
kernel .text 0xc0100000 0xc0149fb0
0xc014a230
0xc01057c6 kernel_thread+0x26
kernel .text 0xc0100000 0xc01057a0
0xc01057e0
As with flushing the superblock when dirty, the kupdate daemon locates dirty
inodes and invokes ux_write_inode()to write them to disk.
The tasks to be performed by ux_write_inode()are fairly straightfoward:
Locate the block number where the inode resides. This can be found by
adding the inode number to UX_INODE_BLOCK.
Read the inode block into memory by calling sb_bread().
Copy fields of interest from the in-core inode to the disk inode, then copy
the disk inode to the buffer.
Mark the buffer dirty and release it.
Because the buffer cache buffer is marked dirty, the periodic run of kupdate will
write it to disk.
Deleting Inodes
There are two cases where inodes need to be freed. The first case occurs when a
directory needs to be removed; this is described in the section Creating and
Removing Directories later in the chapter. The second case occurs when the inode
link count reaches zero.
Recall that a regular file is created with a link count of 1. The link count is
incremented each time a hard link is created. For example:
# touch
# touch
# ln A
Files A and B are created with a link count of 1. The call to ln creates a directory
entry for file C and increments the link count of the inode to which A refers. The
following commands:
# rm
# rm
result in calls to the unlink() system call. Because B has a link count of 1, the
file will be removed. However, file A has a link count of 2; in this case, the link
count is decremented and the directory entry for A is removed, but the file still
remains and can be accessed through C.
To show the simple case where a file is created and removed, a breakpoint on
ux_write_inode() can be set in kdbas follows:
[0]kdb> bp ux_write_inode
Instruction(i) BP #0 at 0xd08cd4c8 ([uxfs]ux_write_inode)
is enabled globally adjust
[0]kdb> go
and the following commands are executed:
# touch /mnt/file
# rm /mnt/file
A regular file (file) is created with a link count of 1. As described in previous
chapters of the book, the rm command invokes the unlink() system call. For a
file that has a link count of 1, this will result in the file being removed as shown
below when the stack backtrace is displayed:
Entering kdb (current=0xcaae6000, pid 1398)
on processor 0 due to Breakpoint @ 0xd08bc5c0
[0]kdb> bt
EBP EIP Function(args)
0xcab81f34 0xd08bc5c0 [uxfs]ux_delete_inode (0xcaad2824, 0xcaad2824,
0xcac4d484, 0xcabc6e0c)
uxfs .text 0xd08bb060 0xd08bc5c0 0xd08bc6b4
0xc015f1f4 iput+0x114 (0xcaad2824, 0xcac4d4e0, 0xcab81f98,
0xcaad2824, 0xcac4d484)
kernel .text 0xc0100000 0xc015f0e0 0xc015f3a0
0xcab81f58 0xc015c466 d_delete+0xd6 (0xcac4d484, 0xcac4d56c, 0xcab81f98,
0x0, 0xcabc6e0c)
kernel .text 0xc0100000 0xc015c390 0xc015c590
0xcab81f80 0xc01537a8 vfs_unlink+0x1e8 (0xcabc6e0c, 0xcac4d484,
0xcac4d56c, 0xcffefcf8, 0xcea16005)
kernel .text 0xc0100000 0xc01535c0 0xc01537e0
0xcab81fbc 0xc0153878 sys_unlink+0x98 (0xbffffc50, 0x2, 0x0,
0xbffffc50, 0x0)
kernel .text 0xc0100000 0xc01537e0 0xc01538e0
0xc01077cb system_call+0x33
kernel .text 0xc0100000 0xc0107798 0xc01077d0
The call to d_delete() is called to update the dcache first. If possible, the kernel
will attempt to make a negative dentry, which will simplify a lookup operation
in future if the same name is requested. Inside iput(); if the link count of the
inode reaches zero, the kernel knows that there are no further references to the
file so the filesystem is called to remove the file.
The ux_delete_inode() function (lines 1148 to 1168) needs to perform the
following tasks:
Free any data blocks that the file references. This involves updating the
s_nbfreefield and s_block[] fields of the superblock.
Free the inode by updating the s_nbfreefield and s_block[] fields of the
superblock.
Mark the superblock dirty so it will be flushed to disk to reflect the
changes.
Call clear_inode() to free the in-core inode.
As with many functions that deal with inodes and data blocks in uxfs, the tasks
performed by ux_delete_inode()and others are greatly simplified because all
of the information is held in the superblock.
File Creation and Link Management
Before creating a file, many UNIX utilities will invoke the stat() system call to
see is the file exists. This will involve the kernel calling the ux_lookup()
function. If the file name does not exist, the kernel will store a negative dentry in
the dcache. Thus, if there are additional calls to stat() for the same file, the
kernel can see that the file doesnt exist without an additional call to the
filesystem.
Shown below is the output from the strace command when using the cp
command to copy fileto foo:
lstat64("foo", 0xbffff8a0) = -1 ENOENT (No such file or directory)
stat64("file", {st_mode=S_IFREG|0644, st_size=0, ...}) =
open("file", O_RDONLY|O_LARGEFILE) =
open("foo", O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = 4
The cp command invokes the stat() system call on both files before calling
open()to create the new file.
The following example shows the call to ux_lookup() in response to the cp
command calling the stat()system call:
Breakpoint 5, ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed3a0)
at ux_dir.c:367
367 struct ux_inode *uip = (struct ux_inode *
(gdb) bt
#0 ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed3a0) at ux_dir.c:367
#1 0xc01482c0 in real_lookup (parent=0xcb5ed320, name=0xc97ebf5c,
flags=0)
at namei.c:305
#2 0xc0148ba4 in link_path_walk (name=0xcb0f700b "", nd=0xc97ebf98)
at namei.c:590
#3 0xc014943a in __user_walk
name=0xd0856920 "\220D\205,K\205AK\205< L\205"
flags=9, nd=0xc97ebf98)
at namei.c:841
#4 0xc0145807 in sys_stat64 (filename=0x8054788 "file"
statbuf=0xbffff720, flags=1108542220)
at stat.c:337
#5 0xc010730b in system_call (
The kernel allocates the dentrybefore calling ux_lookup(). Notice the address
of the dentrywhich is highlighted above.
Because the file does not exist, the cpcommand will then call open()to create
the file. This results in the kernel invoking the ux_create() function to create
the file as follows:
Breakpoint 6, 0xd0854494 in ux_create
(dip=0xcd73cba0, dentry=0xcb5ed3a0, mode=33188)
(gdb) bt
#0 0xd0854494 in ux_create (dip=0xcd73cba0, dentry=0xcb5ed3a0,
mode=33188)
#1 0xc014958f in vfs_create (dir=0xcd73cba0, dentry=0xcb5ed3a0,
mode=33188)
at namei.c:958
#2 0xc014973c in open_namei (pathname=0xcb0f7000 "foo"
flag=32834,
mode=33188, nd=0xc97ebf74) at namei.c:1034
#3 0xc013cd67 in filp_open (filename=0xcb0f7000 "foo"
flags=32833,
mode=33188) at open.c:644
#4 0xc013d0d0 in sys_open (filename=0x8054788 "foo"
flags=32833, mode=33188)
at open.c:788
#5 0xc010730b in system_call (
Note the address of the dentry passed to ux_create(). This is the same as the
address of the dentrypassed to ux_lookup(). If the file is created successfully,
the dentrywill be updated to reference the newly created inode.
The ux_create()function (lines 629 to 691) has several tasks to perform:
Call ux_find_entry()to check whether the file exists. If it does exist, an
error is returned.
Call the kernel new_inode() routine to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs inode. This will be described in
more detail later.
Call ux_diradd() to add the new filename to the parent directory. This is
passed to ux_create()as the first argument (dip).
Initialize the new inode and call mark_dirty_inode() for both the
new inode and the parent inode to ensure that they will be written to
disk.
The ux_ialloc()function (lines 413 to 434) is very straightforward working on
fields of the uxfs superblock. After checking to make sure there are still inodes
available (s_nifree > 0) , it walks through the s_inode[]array until it finds
a free entry. This is marked UX_INODE_INUSE, the s_ifree field is
decremented, and the inode number is returned.
The ux_diradd() (lines 485 to 539) function is called to add the new filename
to the parent directory. There are two cases that ux_diradd()must deal with:
There is space in one of the existing directory blocks. In this case, the name
of the new file and its inode number can be written in place. The buffer read
into memory, which will hold the new entry, must be marked dirty and
released.
There is no more space in any of the existing directory blocks. In this
case, a new block must be allocated to the new directory in which to
store the name and inode number. This is achieved by calling the
ux_block_alloc() function (lines 441 to 469).
When reading through the existing set of directory entries to locate an empty slot,
each directory block must be read into memory. This involves cycling through the
data blocks in i_addr[] from 0to i_blocks.
Creating a hard link involves adding a new filename to the filesystem and
incrementing the link count of the inode to which it refers. In some respects, the
paths followed are very similar to ux_create() but without the creation of a
new uxfs inode.
The ln command will invoke the stat() system call to check whether both
filenames already exist. Because the name of the link does not exist, a negative
dentry will be created. The ln command then invokes the link() system call,
which will enter the filesystem through ux_link(). The prototype for
ux_link()is as follows and the source can be found on lines 866 to 887:
int
ux_link(struct dentry *old, struct inode *dip, struct dentry *new)
Thus when executing the following command:
$ ln filea fileb
the old dentry refers to filea while new is a negative dentry for fileb,
which will have been established on a prior call to ux_lookup().
These arguments can be analyzed by setting a breakpoint on ux_link() and
running the above lncommand.
Breakpoint 11, ux_link (old=0xcf2fe740, dip=0xcf23a240, new=0xcf2fe7c0)
at ux_dir.c:395
395
(gdb) bt
#0 ux_link (old=0xcf2fe740, dip=0xcf23a240, new=0xcf2fe7c0)
at ux_dir.c:395
#1 0xc014adc4 in vfs_link (old_dentry=0xcf2fe740, dir=0xcf23a240,
new_dentry=0xcf2fe7c0) at namei.c:1613
#2 0xc014aef0 in sys_link (oldname=0xbffffc20 "filea"
newname=0xbffffc26 "fileb") at namei.c:1662
#3 0xc010730b in system_call (
The gdb command can be used to display the arguments passed to ux_link()
as follows:
(gdb) p new
$9 = (struct dentry *) 0xcf2fe7c0
(gdb) p *old
$10 = {d_count = {counter = 1}, d_flags = 0, d_inode = 0xcd138260,
d_parent = 0xcb5ed920, d_hash = {next = 0xc2701750, prev = 0xcfde6168}
d_lru = {next = 0xcf2fe758, prev = 0xcf2fe758}, d_child =
next = 0xcb5ed948, prev = 0xcf2fe7e0}, d_subdirs = {next
0xcf2fe768,
prev = 0xcf2fe768}, d_alias = {next = 0xcd138270, prev = 0xcd138270}
d_mounted = 0, d_name = {name = 0xcf2fe79c "filea", len = 5,
hash = 291007618}, d_time = 0, d_op = 0x0, d_sb = 0xcede4c00,
d_vfs_flags = 8, d_fsdata = 0x0, d_iname = "filea\0g\0\0\0\0\0\0\0\0"
(gdb) p old->d_name.name
$11 = (unsigned char *) 0xcf2fe79c "filea"
(gdb) p new->d_name.name
$12 = (unsigned char *) 0xcf2fe81c "fileb"
Thus the dentry for old is complely instantiated and references the inode for
filea. The name field of the dentry for new has been set but the dentry has
not been initialized further.
There is not a great deal of work for ux_link() to perform. In addition to
calling ux_diradd() to add the new name to the parent directory, it increments
the link count of the inode, calls d_instantiate() to map the negative
dentryto the inode, and marks it dirty.
The unlink() system call is managed by the ux_unlink() function (lines
893 to 902). All that this function needs to do is decrement the inode link count
and mark the inode dirty. If the link count reaches zero, the kernel will invoke
ux_delete_inode()to actually remove the inode from the filesystem.
Creating and Removing Directories
At this point, readers should be familiar with the mechanics of how the kernel
looks up a filename and creates a negative dentry before creating a file.
Directory creation is a little different in that the kernel performs the lookup rather
than the application calling stat() first. This is shown as follows:
Breakpoint 5, ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed420)
at ux_dir.c:367
367 struct ux_inode *uip = (struct ux_inode *
(gdb) bt
#0 ux_lookup (dip=0xcd73cba0, dentry=0xcb5ed420) at ux_dir.c:367
#1 0xc01492f2 in lookup_hash (name=0xc97ebf98, base=0xcb5ed320)
at namei.c:781
#2 0xc0149cd1 in lookup_create (nd=0xc97ebf90, is_dir=1)
at namei.c:1206
#3 0xc014a251 in sys_mkdir (pathname=0xbffffc1c "/mnt/dir", mode=511)
at namei.c:1332
#4 0xc010730b in system_call (
Because the filename wont be found (assuming it doesnt already exist), a
negative dentry is created is then passed into ux_mkdir() (lines 698 to 780) as
follows:
Breakpoint 7, 0xd08546d0 in ux_mkdir (dip=0xcd73cba0, dentry=0xcb5ed420,
mode=493)
(gdb) bt
#0 0xd08546d0 in ux_mkdir (dip=0xcd73cba0, dentry=0xcb5ed420, mode=493)
#1 0xc014a197 in vfs_mkdir (dir=0xcd73cba0, dentry=0xcb5ed420,
mode=493)
at namei.c:1307
#2 0xc014a282 in sys_mkdir (pathname=0xbffffc1c "/mnt/dir", mode=511)
at namei.c:1336
#3 0xc010730b in system_call (
Note that dentryaddress is the same for both functions.
The initial steps performed by ux_mkdir() are very similar to the steps taken
by ux_create(), which was described earlier in the chapter, namely:
Call new_inode()to allocate a new in-core inode.
Call ux_ialloc() to allocate a new uxfs inode and call ux_diradd() to
add the new directory name to the parent directory.
Initialize the in-core inode and the uxfs disk inode.
One additional step that must be performed is to allocate a block to the new
directory in which to store the entries for "." and "..". The ux_block_alloc()
function is called, which returns the block number allocated. This must be stored
in i_addr[0], i_blocks must be set to 1, and the size of the inode (i_size) is
set to 512, which is the size of the data block.
To remove a directory entry, the ux_rmdir() function (lines 786 to 831) is
called. The first step performed by ux_rmdir() is to check the link count of the
directory inode. If it is greater than 2, the directory is not empty and an error is
returned. Recall that a newly created directory has a link count of 2 when created
(for both "." and "..").
The stack backtrace when entering ux_rmdir()is shown below:
Breakpoint 8, 0xd0854a0c in ux_rmdir (dip=0xcd73cba0, dentry=0xcb5ed420)
(gdb) bt
#0 0xd0854a0c in ux_rmdir (dip=0xcd73cba0, dentry=0xcb5ed420)
#1 0xc014a551 in vfs_rmdir (dir=0xcd73cba0, dentry=0xcb5ed420)
at namei.c:1397
#2 0xc014a696 in sys_rmdir (pathname=0xbffffc1c "/mnt/dir"
at namei.c:1443
#3 0xc010730b in system_call (
The dip argument is for the parent directory and the dentry argument is for the
directory to be removed.
The tasks to be performed by ux_rmdir()are as follows:
Call ux_dirdel() to remove the directory name from the parent
directory. This is described in more detail later.
Free all of the directory blocks.
Free the inode by incrementing the s_nifree field of the superblock
and marking the slot in s_nifree[]to indicate that the inode is free.
The dirdel() function (lines 545 to 576) walks through each of the directory
blocks comparing the d_ino field of each ux_dirent structure found with the
name passed. If a match is found, the d_ino field is set to 0 to indicate that the
slot is free. This is not an ideal solution because if many files are created and
removed in the same directory, there will be a fair amount of unused space.
However, for the purpose of demonstrating a simple filesystem, it is the easiest
solution to implement.
File I/O in uxfs
File I/O is typically one of the most difficult areas of a filesystem to implement.
To increase filesystem performance, this is one area where a considerable amount
of time is spent. In Linux, it is very easy to provide a fully working filesytem
while spending a minimal amount of time of the I/O paths. There are many
generic functions in Linux that the filesystem can call to handle all the
interactions with the page cache and buffer cache.
The section File I/O in the 2.4 Linux Kernel in Chapter 8 describes some of the
interactions with the page cache. Because this chapter presents a simplified view
of filesystem activity, the page cache internals wont be described. Instead, the
following sections show how the kernel interacts with the ux_get_block()
function exported by uxfs. This function can be used to read data from a file or
allocate new data blocks and write data.
First of all, consider the main entry points into the filesystem for file I/O.
These are exported through the file_operationsstructure as follows:
struct file_operations ux_file_operations = {
llseek: generic_file_llseek,
read: generic_file_read,
write: generic_file_write,
mmap: generic_file_mmap,
};
So for all of the main file I/O related operations, the filesystem defers to the
Linux generic file I/O routines. The same is true for operations on any of the
mapped file interactions, whether for user-level mappings or for handling
operation within the page cache. The address space related operations are:
struct address_space_operations ux_aops = {
readpage: ux_readpage,
writepage: ux_writepage,
sync_page: block_sync_page,
prepare_write: ux_prepare_write,
commit_write: generic_commit_write,
bmap: ux_bmap,
};
For all of the functions defined in this vector, uxfs also makes calls to generic
kernel routines. For example, consider the ux_readpage() function (lines 976 to
980), which is also shown here:
int
ux_readpage(struct file *file, struct page *page)
{
return block_read_full_page(page, ux_get_block)
}
For each of the uxfs routines exported, uxfs makes a call to a generic kernel
function and passes the ux_get_block() routine. Before showing the flow into
the filesystem for file I/O, the subject of the next three sections, it is first helpful to
show how ux_get_block() (lines 929 to 968) works:
int
ux_get_block(struct inode *inode, long block,
struct buffer_head *bh_result, int create)
The ux_getblock() function is called whenever the kernel needs to access part
of a file that is not already cached. The block argument is the logical block within
the file such that block0 maps to file offset 0, block 1 maps to file offset 512 and
so on. The create argument indicates whether the kernel wants to read from or
write to the file. If create is 0, the kernel is reading from the file. If create is 1,
the filesystem will need to allocate storage at the offset referenced by block.
Taking the case where block is 0, the filesystem must fill in the appropriate
fields of the buffer_head as follows:
bh_result->b_dev = inode->i_dev;
bh_result->b_blocknr = uip->i_addr[block]
The kernel will then perform the actual read of the data. In the case where
create is 1, the filesystem must allocate a new data block by calling
ux_block_alloc()and set the appropriate i_addr[]slot to reference the new
block. Once allocated, the buffer_head structure must be initialized prior to the
kernel performing the I/O operation.
Reading from a Regular File
The filesystem does not do anything specific for reading from regular files. In
place of the read operation (file_operations vector), the filesystem specifies
the generic_file_read()function.
To show how the filesystem is entered, a breakpoint is set on
ux_get_block()and the passwd file is read from a uxfs filesystem by running
the catprogram. Looking at the size of passwd:
# ls -l /mnt/passwd
-rw-r--r--1 root root 1203 Jul 24 07:51 /etc/passwd
there will be three data blocks to access. When the first breakpoint is hit:
Breakpoint 1, ux_get_block (inode=0xcf23a420,
block=0, bh_result=0xc94f4740, create=0)
at ux_file.c:21
21 struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcf23a420, block=0, bh_result=0xc94f4740,
create=0)
at ux_file.c:21
#1 0xc0140b1f in block_read_full_page (page=0xc1250fc0,
get_block=0xd0855094 < ux_get_block>) at buffer.c:1781
#2 0xd08551ba in ux_readpage (file=0xcd1c9360, page=0xc1250fc0)
at ux_file.c:67
#3 0xc012e773 in do_generic_file_read (filp=0xcd1c9360,
ppos=0xcd1c9380,
desc=0xc96d1f5c, actor=0xc012eaf0 < file_read_actor>
at filemap.c:1401
#4 0xc012ec72 in generic_file_read (filp=0xcd1c9360, buf=0x804eb28 ""
count=4096, ppos=0xcd1c9380) at filemap.c:1594
#5 0xc013d7c8 in sys_read (fd=3, buf=0x804eb28 "", count=4096)
at read_write.c:162
#6 0xc010730b in system_call (
there are two uxfs entry points shown. The first is a call to ux_readpage(). This
is invoked to read a full page of data into the page cache. The routines for
manipulating the page cache can be found in mm/filemap.c. The second, is the
call the ux_get_block(). Because file I/O is in multiples of the system page
size, the block_read_full_page() function is called to fill a page. In the case
of the file being read, there are only three blocks of 512 bytes, thus not enough to
fill a whole page (4KB). The kernel must therefore read in as much data as
possible, and then zero-fill the rest of the page.
The block argument passed to ux_get_block() is 0 so the filesystem will
initialize the buffer_headso that the first 512 bytes are read from the file.
The next time that the breakpoint is hit:
Breakpoint 1, ux_get_block (inode=0xcf23a420,
block=1, bh_result=0xc94f46e0, create=0)
at ux_file.c:21
21
struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcf23a420, block=1,
bh_result=0xc94f46e0, create=0)
at ux_file.c:21
#1 0xc0140b1f in block_read_full_page (page=0xc1250fc0,
..
the kernel passes block1 so the next 512 bytes will be read from the file. The final
call to ux_get_block()is shown below:
(gdb) bt
#0 ux_get_block (inode=0xcf23a420, block=2,
bh_result=0xc94f4680, create=0)
at ux_file.c:21
#1 0xc0140b1f in block_read_full_page (page=0xc1250fc0,
get_block=0xd0855094 ) at buffer.c:1781
#2 0xd08551ba in ux_readpage (file=0xcd1c9360, page=0xc1250fc0)
at ux_file.c:67
The kernel passes block2 so the final 512 bytes will be read from the file.
For uxfs, reading from files is extremely simple. Once the get_block()
function has been written, there is very little other work for the filesystem to do.
Writing to a Regular File
The mechanisms for writing to files are very similar to those used when reading
regular files. Consider the following commands, this time to copy the passwd file
to a uxfs filesystem:
# ls -l /etc/passwd
-rw-r--r--1 root root 1336 Jul 24 14:28 /etc/passwd
# cp /etc/passwd /mnt
Setting a breakpoint on ux_get_block() once more and running the above cp
command, the first breakpoint is hit as follows:
Breakpoint 1, ux_get_block (inode=0xcd710440,
block=0, bh_result=0xc96b72a0, create=1)
at ux_file.c:21
struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcd710440, block=0,
bh_result=0xc96b72a0, create=1)
at ux_file.c:21
#1 0xc014074b in __block_prepare_write (inode=0xcd710440,
page=0xc125e640, from=0, to=1024,
get_block=0xd0855094 < ux_get_block>
at buffer.c:1641
#2 0xc0141071 in block_prepare_write (page=0xc125e640, from=0, to=1024,
get_block=0xd0855094 < ux_get_block>) at buffer.c:1960
#3 0xd08551dd in ux_prepare_write (file=0xcd1c9160, page=0xc125e640,
from=0, to=1024)
at ux_file.c:74
#4 0xc013085f in generic_file_write (file=0xcd1c9160,
buf=0xbffff160
"root:x:0:0:root:/root:/bin/bash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaem
on:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologi
n\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/"...
count=1024, ppos=0xcd1c9180) at filemap.c:3001
#5 0xc013d8e8 in sys_write (fd=4,
buf=0xbffff160
"root:x:0:0:root:/root:/bin/bash\nbin:x:1:1:bin:/bin:/sbin/nologin\ndaem
on:x:2:2:daemon:/sbin:/sbin/nologin\nadm:x:3:4:adm:/var/adm:/sbin/nologi
n\nlp:x:4:7:lp:/var/spool/lpd:/sbin/nologin\nsync:x:5:0:sync:/"...
count=1024) at read_write.c:188
#6 0xc010730b in system_call (
This time the create flag is set to 1, indicating that a block must be allocated to
the file. Once the block has been allocated, the buffer_head can be initialized
and the first 512 bytes of passwd can be copied to the buffer. If the buffer and
inode are marked dirty, both will be flushed to disk.
The next breakpoint is hit, and this time the block argument is set to 1, which
will result in another block being allocated to cover the file range 512 to 1023.
Breakpoint 1, ux_get_block (inode=0xcd710440,
block=1, bh_result=0xc96b7240, create=1)
at ux_file.c:21
21 struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcd710440, block=1,
bh_result=0xc96b7240, create=1)
at ux_file.c:21
The final breakpoint is hit as follows:
Breakpoint 1, ux_get_block (inode=0xcd710440, block=2,
bh_result=0xc9665900, create=1)
at ux_file.c:21
21 struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcd710440, block=2,
bh_result=0xc9665900, create=1)
at ux_file.c:21
and this time the block argument is set to 2 indicating that the final block which
is needed should be allocated. As with reading from regular files, writing to
regular files is also an easy function for the filesystem to implement.
Memory-Mapped Files
Although this section wont describe the mechanics of how memory-mapped
files work in the Linux kernel, it is easy to show how the filesystem can support
mapped files through the same mechanisms used for reading from and writing to
regular files.
In place of the mmap function, exported through the file_operations
vector, uxfs requests that the generic_file_mmap()will be called. All that the
filesystem needs to provide is the get_block()interface.
To demonstrate how the filesystem is involved, a breakpoint is set in
ux_get_block() and a file is mapped for read-only access. The first address of
the mapping is then touched, which will create a page fault. The stack trace when
ux_get_block()is entered is as follows:
Breakpoint 1, ux_get_block (inode=0xcf23a420,
block=0, bh_result=0xc94bbba0, create=0)
at ux_file.c:21
struct super_block *sb = inode->i_sb;
(gdb) bt
#0 ux_get_block (inode=0xcf23a420, block=0,
bh_result=0xc94bbba0, create=0)
at ux_file.c:21
#1 0xc0140b1f in block_read_full_page (page=0xc1238340,
get_block=0xd0855094
at buffer.c:1781
#2 0xd08551ba in ux_readpage (file=0xcd1c97e0, page=0xc1238340)
at ux_file.c:67
#3 0xc012dd92 in page_cache_read (file=0xcd1c97e0, offset=3441203168)
at filemap.c:714
#4 0xc012ddef in read_cluster_nonblocking (file=0xcd1c97e0,
offset=3475219664, filesize=1)
at filemap.c:739
#5 0xc012f389 in filemap_nopage (area=0xc972a300, address=1073823744,
unused=0)
at filemap.c:1911
#6 0xc012b512 in do_no_page (mm=0xcf996d00, vma=0xc972a300,
address=1073823744, write_access=0, page_table=0xc91e60a0)
at memory.c:1249
#7 0xc012b76c in handle_mm_fault (mm=0xcf996d00, vma=0xc972a300,
address=1073823744, write_access=0)
at memory.c:1339
#8 0xc011754a in do_page_fault (regs=0xc952dfc4, error_code=4)
at fault.c:263
#9 0xc01073fc in error_code (
The kernel is entered, not through a system call, but in response to a fault. Because
there are no pages backing the mapped file in the user address space, when the
process attempts to access the file, a page fault occurs. The kernel establishes
where the page of memory is mapped to and must then fill in the page from the
appropriate file.
The ux_readpage() function is entered, which calls back into the memory
manager. To fill in the page of data, the kernel will make repeated calls into
ux_get_block() until either a page of data has been read or the end of the file
has been reached. If the latter occurs, the kernel must zero-fill the page so that, if
the process accesses within the same page but beyond the end of the file, it will
read zeroes.
The Filesystem Stat Interface
The df command displays information about the filesystem usage such as the
number of free and used blocks. Through the super_block operations vector,
uxfs exports the ux_statfs() function, which is called in response to df
invoking the stafs system call (once for each filesystem). The ux_statfs()
function can be found on lines 1194 to 1210. The function prototype is shown
below:
int
ux_statfs(struct super_block *sb, struct statfs *buf)
The dfcommand will make a call to the statfs() system call for each mounted
filesystem. Here is the prototype for statfs().
int statfs(const char *path, struct statfs *buf)
Note that it also uses the statfs structure which is defined below:
struct statfs
{
long f_type; /* type of filesystem (see below) */
long f_bsize; /* optimal transfer block size */
long f_blocks; /* total data blocks in file system */
long f_bfree; /* free blocks in fs */
long f_bavail; /* free blocks avail to non-superuser */
long f_files; /* total file nodes in file system */
long f_ffree; /* free file nodes in fs */
fsid_t f_fsid; /* file system id */
long f_namelen; /* maximum length of filenames */
};
As mentioned earlier in the book, understanding the requirements of user level
programs is essential to understanding some of the features that must be
provided by filesystems. The information passed through the statfs structure
corresponds to filesystem limits, such as the total number of files and blocks in
the filesystem, and existing free resources, such as the number of available files
and data blocks.
The following example shows a breakpoint being set within kdb to stop when
the kernel enters ux_statfs(). The debugger is entered by hitting the Break
key as indicated by kdb when it is entered:
Entering kdb (current=0xc03b0000, pid 0) on processor 0 due to Keyboard Entry
[0]kdb> bp ux_statfs
Instruction(i) BP #0 at 0xd08bb400 ([uxfs]ux_statfs)
is enabled globally adjust
[0]kdb> bl
Instruction(i) BP #0 at 0xd08bb400 ([uxfs]ux_statfs)
is enabled globally adjust
[0]kdb> go
The bl command displays the existing breakpoints. This is breakpoint number 0
as indicated by "BP #0 ". Thus, to clear the breakpoint, the bc command can be
invoked passing 0 as an argument.
# df -k /mnt
Filesystem 1k-blocks Used Available Use% Mounted on
Instruction(i) breakpoint #0 at 0xd08bb400 (adjusted)
0xd08bb400 ux_statfs
Entering kdb (current=0xcd31c000, pid 1509) on processor 0 due to
Breakpoint @ 0xd08bb400
[0]kdb> bt
EBP EIP Function(args)
0xcd31df38 0xd08bb400 [uxfs]ux_statfs (0xcc2be400, 0xcd31df50,0xffffffda,
uxfs .text 0xd08bb060 0xd08bb400 0xd08bb460
0xc0141ea2 vfs_statfs+0xa2 (0xcc2be400, 0xcd31df50, 0x43, ..
kernel .text 0xc0100000 0xc0141e00 0xc0141f20
0xcd31dfbc 0xc0141f58 sys_statfs+0x38 (0x8052bb8, 0xbffff760, ..
kernel .text 0xc0100000 0xc0141f20 0xc0141fb0
0xc01077cb system_call+0x33
kernel .text 0xc0100000 0xc0107798 0xc01077d0
[0]kdb> go
When the df command is run and ux_statfs() is reached, the breakpoint is hit
and the kernel enters kdb. The bt command can then display the stack backtrace
showing that the kernel was entered by a system call that then called through
sys_statfs()and vfs_statfs()before entering ux_statfs().
The fields of the statfs structure can be obtained from either predefined
defaults in ux_fs.h or from summary information stored in the superblock.
Shown below is the result of a call to dffollowing creation of a single directory:
# ./mkfs /dev/fd0
# insmod ./uxfs
# mount -t uxfs /dev/fd0 /mnt
# df -k
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda2 15120648 2524836 11827716 18% /
/dev/hda1 102454 11147 86017 12% /boot
/dev/hda5 497829 8240 463887 2% /home
none 127076 0 127076 0% /dev/shm
/dev/fd0 1000 1 999 1% /mnt
In the example that follows, a directory is created. A uxfs directory involves
allocating an inode and one data block to hold the "." and ".." entries plus any
subsequent entries added to the directory. Note that the single block allocated for
the directory is reflected in the information displayed.
# mkdir /mnt/dir
# df -
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda2 15120648 2524836 11827716 18%
/dev/hda1 102454 11147 86017 12% /boot
/dev/hda5 497829 8240 463887 2% /home
none 127076 0 127076 0% /dev/shm
/dev/fd0 1000 2 998 1% /mnt
Similarly, df can also display inode allocation information based on the
f_filesand f_ffreefields of the statfs structure as displayed below:
# df -i /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/fd0 32 4 28 13% /mnt
# mkdir /mnt/mydir
# df -i /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/fd0 32 5 27 16% /mnt
When first run on an empty filesystem, there are 4 inodes used out of the 32
available (UX_MAXFILES) inodes. By creating a directory an additional inode is
used that is returned in the f_ffreefield of the statfsstructure and displayed
by df above.
|
alvert | holassssssssss 2008-03-29 02:56:15 | |
|