Steve Pate : Файловая система UNIX : эволюция , разработка , реализация
Эта книга описывает файловую систему,присущую всем версиям юникса-линукса.
Автор раскрывает аспекты программирования file I/O,
описывает внутренности различных версий юникса,
такие популярные файловые системы , как UFS, ext2, VERITAS,VxFS.
Книга включает примеры,с которыми вы можете экспериментировать.
Файловые концепции
Для получения полной картины о файловой системе нужно понять главные концепции.
Эта глава обьясняет основные концепции.
Начинающие программисты юникса найдут здесь много полезного.
Подробно будет рассмотрена реализация известной утилиты ls ,
связанные с ней библиотеки и системные вызовы.
Как известно , в UNIX буквально все представляет из себя файл ,
и все операции сводятся к операциям файлового типа.
Открыть и прочитать директорию можно в том же порядке ,
что и открыть и прочитать файл.
Типы файлов UNIX
Есть 2 основных типа - это регулярные файлы и директории.
К регулярным файлам относятся файлы текстового формата , документы , исполняемый файлы,и т.д.
Каталоги предназначены для организации файловой системы в иерархическую структуру.
Существуют и другие типы файлов :
Regular files. Такие файлы хранят данные различного типа ,
и такие файлы никак особо не интерпретируются файловой системой.
Directories. Придают структурированность файловой системе.
Каталоги могут индексировать входящие в них файлы в произвольном порядке.
Symbolic links.
Символическая ссылка , называемая также симлинком - symlink -
означает , что один файл может ссылаться на другой файл с другим именем.
Удаление симлинка никак не влияет на ссылаемый файл.
Hard links.
Такой линк отличается от симлинка тем , что имеет счетчик ,
который увеличивается каждый раз на единицу при его создании.
При каждом удалении такого линка счетчик уменьшается на единицу.
Когда счетчик становится равным нулю , источник ссылки удаляется.
Named pipes. Именованный канал - это дву-направленный IPC (Inter Process
Communication) механизм , который связывает 2 процесса.
Отличается от обычных UNIX pipes тем , что доступ к ним имеет ограничение.
Special files.
Специальный файл ссылается на устройство типа диска.
Для доступа к такому устройству нужно открыть специальный файл.
Xenix special file.
Семафоры и расшаренные сегменты памяти в операционной системе Xenix
могут управляться из UNIX.
Специальный файл нулевой длины может быть представлен как семафор или сегмент памяти.
Для получения свойств файла любого типа может быть вызван системный вызов stat().
Он неявно вызывается в утилите ls .
Файловые дескрипторы
Рассмотрим несколько примеров на С.
Пример :
#include < sys/types.h>
#include < sys/stat.h>
#include < fcntl.h>
main(
int fd;
fd = open("/etc/passwd", O_RDONLY)
printf("fd = %d\n", fd)
close(fd)
$ make open
cc open.c -o open
$ ./open
fd = 3
Файл сначала надо открыть - open().
Для этого надо включить 3 хидера:
int open(const char *path, int oflag, ...)
DESCRIPTION
The open() function establishes the connection between
file and a file descriptor. It creates an ..
Если все нормально , то мы получаем файловый дескриптор, который понадобится в других системных вызовах -
read(), write(), lseek().
Для этого введен идентификатор fd.
Базовые свойства файла
Набрав команду : ls -l , увидит подробно следующие свойства файлов:
Файловый тип и доступ
Число линков на этот файл
Владельца и группу
Размер файла
Дата последней модификации
Имя файла
Команда ls выполняет следующее :
1. Выводит файлы текущей директории
2. Для каждого файла выводит его свойства
Ниже показан пример вывода команды ls.
Для каждого файла будут вызваны системные вызовы getdents() и stat() :
#include < sys/types.h>
#include < sys/stat.h>
int stat(const char *path, struct stat *buf)
- -regular file
d -directory
s -symbolic link
p -named pipe
c -character special
b -block special file name
Для системного вызова stat() есть своя структура :
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* Inode number / file serial number */
mode_t st_mode; /* File mode */
nlink_t st_nlink; /* Number of links to file */
uid_t st_uid; /* User ID of file */
gid_t st_gid; /* Group ID of file */
dev_t st_rdev; /* Device ID for char/blk special file */
off_t st_size; /* File size in bytes (regular file) */
time_t st_atime; /* Time of last access */
time_t st_mtime; /* Time of last data modification */
time_t st_ctime; /* Time of last status change */
long st_blksize; /* Preferred I/O block size */
blkcnt_t st_blocks; /* Number of 512 byte blocks allocated */
};
Ниже дан пример реализации команды ls :
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < sys/dirent.h>
4 #include < sys/unistd.h>
5 #include < fcntl.h>
6 #include < unistd.h>
7 #include < errno.h>
8 #include < pwd.h>
9 #include < grp.h>
10
11 #define BUFSZ 1024
12
13 main(
14 {
15 struct dirent *dir;
16 struct stat st;
17 struct passwd *pw;
18 struct group *grp;
19 char buf[BUFSZ], *bp, *ftime;
20 int dfd, fd, nread;
21
22 dfd = open(".", O_RDONLY)
23 bzero(buf, BUFSZ)
24 while (nread = getdents(dfd, (struct dirent *)&buf,
25 BUFSZ) != 0)
26 bp = buf;
27 dir = (struct dirent *)buf;
28 do
29 if (dir->d_reclen != 0)
30 stat(dir->d_name, &st)
31 ftime = ctime(&st.st_mtime)
32 ftime[16] = '\0'; ftime += 4;
33 pw = getpwuid(st.st_uid)
34 grp = getgrgid(st.st_gid)
35 perms(st.st_mode)
36 printf("%3d %-8s %-7s %9d %s %s\n"
37 st.st_nlink, pw->pw_name, grp->gr_name,
38 st.st_size, ftime, dir->d_name)
39
40 bp = bp + dir->d_reclen;
41 dir = (struct dirent *)(bp)
42 } while (dir->d_ino != 0)
43 bzero(buf, BUFSZ)
44
45
В цикле системный вызов getdents() будет вызван столько раз , сколько там окажется файлов.
Программу вообще-то надо-бы потестировать для большого количества файлов.
В дополнение к системному вызову stat() есть еще два , которые дают аналогичный результат:
#include < sys/types.h>
#include < sys/stat.h>
int lstat(const char *path, struct stat *buf)
int fstat(int fildes, struct stat *buf)
Разница между stat() и lstat() в том , что они по-разному интерпретируют симлинки.
Маска создания файла
Рассмотрим пример создания файла с помощью команды touch:
$ touch myfile
$ ls -l myfile
-rw-r--r-1 spate fcf 0 Feb 16 11:14 myfile
Будет создан файл нулевой длины
К файлу будут привязаны id-шники пользователя и группы пользователей.
Он доступен на чение-запись (rw-) владельцем и членами группыg fcf.
Если вы хотите изменить свойства файла , для этого есть команда umask.
Маску файла можно вывести в числовой или символьной форме :
$ umask
022
$ umask -
u=rwx,g=rx,o=rx
Для изменения маски команда umask может быть вызвана с тремя числовыми параметрами,
которые представляют пользователя,группу и владельца
Каждому можно дать доступ на чтение (r=4), запись (w=2), или выполнение (x=1).
Маска по умолчанию для вновь созданного файла как правтло = 022 при создании с помощью
touch:
$ umask
022
$ strace touch myfile 2>&1 | grep open | grep myfile
open("myfile"
O_WRONLY_O_NONBLOCK_O_CREAT_O_NOCTTY_O_LARGEFILE, 0666) =
$ ls -l myfile
-rw-r--r-1 spate fcf 0 Apr 4 09:45 myfile
022 говорит о том , что доступ на запись запрещен.
Файл создается с маской 666.
Результат : 666 -022 = 644, который дает права -rw-r--r--.
Изменение атрибутов файла
Есть несколько команд , которые позволяют это сделать.
Наиболее известная утилита - chmod:
chmod [ -fR ] file ...
chmod [ -fR ] file ...
Маска rwxr--r-- эквивалентна 744. Для этого chmod нужно вызвать со следующими аргументами:
$ ls -l myfile
-rw------1 spate fcf 0 Mar 6 10:09 myfile
$ chmod 744 myfile
$ ls -l myfile
-rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
Или так:
$ ls -l myfile
-rw------1 spate fcf 0 Mar 6 10:09 myfile
$ chmod u+x,a+r myfile
$ ls -l myfile
-rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
Атрибуты файла можно менять в этой команде с параметрами u, g, o, a.
Добавить атрибут - (+), удалить - (-), или установить (=) :
$ ls -l myfile
-rw------1 spate fcf 0 Mar 6 10:09 myfile
$ chmod u=rwx,g=r,o=r myfile
$ ls -l myfile
-rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
Опцию -R можно использовать рекурсивно для директории :
$ ls -ld mydir
drwxr-xr-x 2 spate fcf 4096 Mar 30 11:06 mydir/
$ ls -l mydir
total
-rw-r--r-1 spate fcf 0 Mar 30 11:06 fileA
-rw-r--r-1 spate fcf 0 Mar 30 11:06 fileB
$ chmod -R a+w mydir
$ ls -ld mydir
drwxrwxrwx 2 spate fcf 4096 Mar 30 11:06 mydir/
$ ls -l mydir
total
-rw-rw-rw 1 spate fcf 0 Mar 30 11:06 fileA
-rw-rw-rw 1 spate fcf 0 Mar 30 11:06 fileB
Пример :
$ find mydir -print | xargs chmod a+
Есть разновидности команды chmod :
#include < sys/types.h>
#include < sys/stat.h>
int chmod(const char *path, mode_t mode)
int fchmod(int fildes, mode_t mode)
The mode argument is a bitwise OR of the fields shown in Table 2.1. Some of the
flags can be combined as shown below:
S_IRWXU. This is the bitwise OR of S_IRUSR, S_IWUSR and S_IXUSR
S_IRWXG. This is the bitwise OR of S_IRGRP, S_IWGRPand S_IXGRP
S_IRWXO. This is the bitwise OR of S_IROTH, S_IWOTH and S_IXOTH
Изменение владельца файла
When a file is created, the user and group IDs are set to those of the caller.
Occasionally it is useful to change ownership of a file or change the group in
which the file resides. Only the root user can change the ownership of a file
although any user can change the files group ID to another group in which the
user resides.
There are three calls that can be used to change the files user and group as
shown below:
#include < sys/types.h>
#include < unistd.h>
int chown(const char *path, uid_t owner, gid_t group)
int fchown(int fd, uid_t owner, gid_t group)
int lchown(const char *path, uid_t owner, gid_t group)
The difference between chown() and lchown() is that the lchown() system
call operates on the symbolic link specified rather than the file to which it points.
PERMISSION DESCRIPTION
Table 2.1 Permissions Passed to chmod()
S_IRWXU Read, write, execute/search by owner
S_IRUSR Read permission by owner
S_IWUSR Write permission by owner
S_IXUSR Execute/search permission by owner
S_IRWXG Read, write, execute/search by group
S_IRGRP Read permission by group
S_IWGRP Write permission by group
S_IXGRP Execute/search permission by group
S_IRWXO Read, write, execute/search by others
S_IROTH Read permission by others
S_IWOTH Write permission by others
S_IXOTH Execute/search permission by others
S_ISUID Set-user-ID on execution
S_ISGID Set-group-ID on execution
S_ISVTX On directories, set the restricted deletion flag
In addition to setting the user and group IDs of the file, it is also possible to set
the effective user and effective group IDs such that if the file is executed, the caller
effectively becomes the owner of the file for the duration of execution. This is a
commonly used feature in UNIX. For example, the passwd command is a setuid
binary. When the command is executed it must gain an effective user ID of root in
order to change the passwd(F) file. For example:
$ ls -l /etc/passwd
-r--r--r-1 root other 157670 Mar 14 16:03 /etc/passwd
$ ls -l /usr/bin/passwd
-r-sr-sr-x 3 root sys 99640 Oct 6 1998 /usr/bin/passwd*
Because the passwd file is not writable by others, changing it requires that the
passwd command run as root as noted by the s shown above. When run, the
process runs as root allowing the passwd file to be changed.
The setuid() and setgid() system calls enable the user and group IDs to
be changed. Similarly, the seteuid() and setegid() system calls enable the
effective user and effective group ID to be changed:
#include < unistd.h>
int setuid(uid_t uid)
int seteuid(uid_t euid)
int setgid(gid_t gid)
int setegid(gid_t egid)
Handling permissions checking is a task performed by the kernel.
Changing File Times
When a file is created, there are three timestamps associated with the file as
shown in the stat structure earlier. These are the creation time, the time of last
modification, and the time that the file was last accessed.
On occasion it is useful to change the access and modification times. One
particular use is in a programming environment where a programmer wishes to
force re-compilation of a module. The usual way to achieve this is to run the
touchcommand on the file and then recompile. For example:
$ ls -l hello*
-rwxr-xr-x 1 spate fcf 13397 Mar 30 11:53 hello*
-rw-r--r-1 spate fcf 31 Mar 30 11:52 hello.c
$ make hello
make: 'hello' is up to date.
$ touch hello.
$ ls -l hello.
-rw-r--r-1 spate fcf 31 Mar 30 11:55 hello.
$ make hello
cc hello.c -o hello
The system calls utime() and utimes() can be used to change both the access
and modification times. In some versions of UNIX, utimes() is simply
implemented by calling utime().
#include < sys/types.h>
#include < utime.h>
int utime(const char *filename, struct utimbuf *buf)
#include < sys/time.h>
int utimes(char *filename, struct timeval *tvp)
struct utimbuf {
time_t actime; /* access time *
time_t modtime; /* modification time *
}
struct timeval {
long tv_sec; /* seconds */
long tv_usec; /* microseconds */
};
By running strace, truss etc., it is possible to see how a call to touch maps
onto the utime() system call as follows:
$ strace touch myfile 2>&1 | grep utime
utime("myfile", NULL) =
To change just the access time of the file, the touch command must first
determine what the modification time of the file is. In this case, the call sequence
is a little different as the following example shows:
$ strace touch -a myfile
..
time([984680824]) = 984680824
open("myfile"
O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY|O_LARGEFILE, 0666) =
fstat(3, st_mode=S_IFREG|0644, st_size=0, ...) =
close(3) =
utime("myfile", [2001/03/15-10:27:04, 2001/03/15-10:26:23]) =
In this case, the current time is obtained through calling time(). The file is then
opened and fstat() called to obtain the files modification time. The call to
utime()then passes the original modification time and the new access time.
Truncating and Removing Files
Removing files is something that people just take for granted in the same vein as
pulling up an editor and creating a new file. However, the internal operation of
truncating and removing files can be a particularly complicated operation as later
chapters will show.
There are two calls that can be invoked to truncate a file:
#include < unistd.h>
int truncate(const char *path, off_t length)
int ftruncate(int fildes, off_t length)
The confusing aspect of truncation is that through the calls shown here it is
possible to truncate upwards, thus increasing the size of the file! If the value of
length is less than the current size of the file, the file size will be changed and
storage above the new size can be freed. However, if the value of length is
greater than the current size, storage will be allocated to the file, and the file size
will be modified to reflect the new storage.
To remove a file, the unlink()system call can be invoked:
#include < unistd.h>
int unlink(const char *path)
The call is appropriately named since it does not necessarily remove the file but
decrements the files link count. If the link count reaches zero, the file is indeed
removed as the following example shows:
$ touch myfile
$ ls -l myfile
-rw-r--r-1 spate fcf 0 Mar 15 11:09 myfile
$ ln myfile myfile2
$ ls -l myfile*
-rw-r--r-2 spate fcf 0 Mar 15 11:09 myfile
-rw-r--r-2 spate fcf 0 Mar 15 11:09 myfile2
$ rm myfile
$ ls -l myfile*
-rw-r--r-1 spate fcf 0 Mar 15 11:09 myfile2
$ rm myfile2
$ ls -l myfile*
ls: myfile*: No such file or directory
When myfile is created it has a link count of 1. Creation of the hard link
(myfile2) increases the link count. In this case there are two directory entries
(myfileand myfile2), but they point to the same file.
To remove myfile, the unlink() system call is invoked, which decrements
the link count and removes the directory entry for myfile.
Directories
There are a number of routines that relate to directories. As with other simple
UNIX commands, they often have a close correspondence to the system calls that
they call, as shown in Table 2.2.
The arguments passed to most directory operations is dependent on where in
the file hierarchy the caller is at the time of the call, together with the pathname
passed to the command:
Current working directory. This is where the calling process is at the time of
the call; it can be obtained through use of pwd from the shell or getcwd()
from within a C program.
Absolute pathname. An absolute pathname is one that starts with the
character /. Thus to get to the base filename, the full pathname starting at /
must be parsed. The pathname /etc/passwd is absolute.
Relative pathname. A relative pathname does not contain / as the first
character and starts from the current working directory. For example, to
reach the same passwd file by specifying passwd the current working
directory must be /etc.
Table 2.2 Directory Related Operations
COMMAND SYSTEM CALL DESCRIPTION
mkdir mkdir() Make a new directory
rmdir rmdir() Remove a directory
pwd getcwd() Display the current working directory
cd chdir() Change directory
fchdir()
chroot chroot() Change the root directory
The following example shows how these calls can be used together:
$ cat dir.
#include < sys/stat.h>
#include < sys/types.h>
#include < sys/param.h>
#include < fcntl.h>
#include < unistd.h>
main(
printf("cwd = %s\n", getcwd(NULL, MAXPATHLEN))
mkdir("mydir", S_IRWXU)
chdir("mydir")
printf("cwd = %s\n", getcwd(NULL, MAXPATHLEN))
chdir("..")
rmdir("mydir")
}
$ make dir
cc -o dir dir.
$ ./dir
cwd = /h/h065/spate/tmp
cwd = /h/h065/spate/tmp/mydir
Special Files
A special file is a file that has no associated storage but can be used to gain access
to a device. The goal here is to be able to access a device using the same
mechanisms by which regular files and directories can be accessed. Thus, callers
are able to invoke open(), read(), and write() in the same way that these
system calls can be used on regular files.
One noticeable difference between special files and other file types can be seen
by issuing an ls command as follows:
$ ls -l /dev/vx/*dsk/homedg/
brw------ 1 root root 142,4002 Jun 5 1999 /dev/vx/dsk/homedg/
crw------ 1 root root 142,4002 Dec 5 21:48 /dev/vx/rdsk/homedg/
In this example there are two device files denoted by the b and c as the first
character displayed on each line. This letter indicates the type of device that this
file represents. Block devices are represented by the letter b while character
devices are represented by the letter c. For block devices, data is accessed in
fixed-size blocks while for character devices data can be accessed in multiple
different sized blocks ranging from a single character upwards.
Device special files are created with the mknod command as follows:
mknod name b major minor
mknod name c major minor
For example, to create the above two files, execute the following commands:
# mknod /dev/vx/dsk/homedg/h b 142 4002
# mknod /dev/vx/rdsk/homedg/h c 142 4002
The major number is used to point to the device driver that controls the device,
while the minor number is a private field used by the device driver.
The mknodcommand is built on top of the mknod()system call:
#include < sys/stat.h>
int mknod(const char *path, mode_t mode, dev_t dev)
The mode argument specifies the type of file to be created, which can be one of
the following:
S_IFIFO. FIFO special file (named pipe).
S_IFCHR. Character special file.
S_IFDIR. Directory file.
S_IFBLK. Block special file.
S_IFREG. Regular file.
The file access permissions are also passed in through the mode argument. The
permissions are constructed from a bitwise OR for which the values are the same
as for the chmod() system call as outlined in the section Changing File Permissions
earlier in this chapter.
Symbolic Links and Hard Links
Symbolic links and hard links can be created using the ln command, which in
turn maps onto the link() and symlink() system calls. Both prototypes are
shown below:
#include < unistd.h>
int link(const char *existing, const char *new)
int symlink(const char *name1, const char *name2)
The section Truncating and Removing Files earlier in this chapter describes hard
links and showed the effects that link() and unlink() have on the underlying
file. Symbolic links are managed in a very different manner by the filesystem as
the following example shows:
$ echo "Hello world" > myfile
$ ls -l myfile
-rw-r--r-1 spate fcf 12 Mar 15 12:17 myfile
$ cat myfile
Hello world
$ strace ln -s myfile mysymlink 2>&1 | grep link
execve("/bin/ln", ["ln", "-s", "myfile"
"mysymlink"], [/* 39 vars */]) =
lstat("mysymlink", 0xbffff660) = -1 ENOENT (No such file/directory)
symlink("myfile", "mysymlink") =
$ ls -l my*
-rw-r--r-1 spate fcf 12 Mar 15 12:17 myfile
lrwxrwxrwx 1 spate fcf 6 Mar 15 12:18 mysymlink -> myfile
$ cat mysymlink
Hello world
$ rm myfile
$ cat mysymlink
cat: mysymlink: No such file or directory
The ln command checks to see if a file called mysymlinkalready exists and then
calls symlink() to create the symbolic link. There are two things to notice here.
First of all, after the symbolic link is created, the link count of myfile does not
change. Secondly, the size of mysymlink is 6 bytes, which is the length of the
string myfile.
Because creating a symbolic link does not change the file it points to in any way,
after myfile is removed, mysymlink does not point to anything as the example
shows.
Named Pipes
Although Inter Process Communication is beyond the scope of a book on
filesystems, since named pipes are stored in the filesystem as a separate file type,
they should be given some mention here.
A named pipe is a means by which unrelated processes can communicate. A
simple example will show how this all works:
$ mkfifo mypipe
$ ls -l mypipe
prw-r--r-1 spate fcf 0 Mar 13 11:29 mypipe
$ echo "Hello world" > mypipe
[1] 2010
$ cat < mypipe
Hello world
[1]+ Done echo "Hello world" >mypipe
The mkfifocommand makes use of the mknod() system call.
The filesystem records the fact that the file is a named pipe. However, it has no
storage associated with it and other than responding to an open request, the
filesystem plays no role on the IPC mechanisms of the pipe. Pipes themselves
traditionally used storage in the filesystem for temporarily storing the data.
Summary
It is difficult to provide an introductory chapter on file-based concepts without
digging into too much detail. The chapter provided many of the basic functions
available to view files, return their properties and change these properties.
To better understand how the main UNIX commands are implemented and
how they interact with the filesystem, the GNU fileutils package provides
excellent documentation, which can be found online at:
www.gnu.org/manual/fileutils/html_mono/fileutils.html
and the source for these utilities can be found at:
ftp://alpha.gnu.org/gnu/fetish
CHAPTER3
User File I/O
Building on the principles introduced in the last chapter, this chapter describes
the major file-related programmatic interfaces (at a C level) including basic file
access system calls, memory mapped files, asynchronous I/O, and sparse files.
To reinforce the material, examples are provided wherever possible. Such
examples include simple implementations of various UNIX commands including
cat, cp, and dd.
The previous chapter described many of the basic file concepts. This chapter
goes one step further and describes the different interfaces that can be called to
access files. Most of the APIs described here are at the system call level. Library
calls typically map directly to system calls so are not addressed in any detail here.
The material presented here is important for understanding the overall
implementation of filesystems in UNIX. By understanding the user-level
interfaces that need to be supported, the implementation of filesystems within the
kernel is easier to grasp.
Library Functions versus System Calls
System calls are functions that transfer control from the user process to the
operating system kernel. Functions such as read() and write() are system
calls. The process invokes them with the appropriate arguments, control transfers
to the kernel where the system call is executed, results are passed back to the
calling process, and finally, control is passed back to the user process.
Library functions typically provide a richer set of features. For example, the
fread() library function reads a number of elements of data of specified size
from a file. While presenting this formatted data to the user, internally it will call
the read()system call to actually read data from the file.
Library functions are implemented on top of system calls. The decision
whether to use system calls or library functions is largely dependent on the
application being written. Applications wishing to have much more control over
how they perform I/O in order to optimize for performance may well invoke
system calls directly. If an application writer wishes to use many of the features
that are available at the library level, this could save a fair amount of
programming effort. System calls can consume more time than invoking library
functions because they involve transferring control of the process from user
mode to kernel mode. However, the implementation of different library functions
may not meet the needs of the particular application. In other words, whether to
use library functions or systems calls is not an obvious choice because it very
much depends on the application being written.
Which Header Files to Use?
The UNIX header files are an excellent source of information to understand
user-level programming and also kernel-level data structures. Most of the header
files that are needed for user level programming can be found under
/usr/includeand /usr/include/sys.
The header files that are needed are shown in the manual page of the library
function or system call to be used. For example, using the stat() system call
requires the following two header files:
#include < sys/types.h>
#include < sys/stat.h>
int stat(const char path, struct stat buf)
The stat.h header file defines the stat structure. The types.h header file
defines the types of each of the fields in the stat structure.
Header files that reside in /usr/include are used purely by applications.
Those header files that reside in /usr/include/sys are also used by the
kernel. Using stat() as an example, a reference to the stat structure is passed
from the user process to the kernel, the kernel fills in the fields of the structure
and then returns. Thus, in many circumstances, both user processes and the
kernel need to understand the same structures and data types.
The Six Basic File Operations
Most file creation and file I/O needs can be met by the six basic system calls
shown in Table 3.1. This section uses these commands to show a basic
implementation of the UNIX cat command, which is one of the easiest of the
UNIX commands to implement.
However, before giving its implementation, it is necessary to describe the terms
standard input, standard output, and standard error. As described in the section File
Descriptors in Chapter 2, the first file that is opened by a user process is assigned a
file descriptor value of 3. When the new process is created, it typically inherits the
first three file descriptors from its parent. These file descriptors (0, 1, and 2) have a
special meaning to routines in the C runtime library and refer to the standard
input, standard output, and standard error of the process respectively. When
using library routines, a file stream is specified that determines where data is to be
read from or written to. Some functions such as printf() write to standard
output by default. For other routines such as fprintf(), the file stream must be
specified. For standard output, stdout may be used and for standard error,
stderrmay be used. Similarly, when using routines that require an input stream,
stdin may be used. Chapter 5 describes the implementation of the standard I/O
library. For now simply consider them as a layer on top of file descriptors.
When directly invoking system calls, which requires file descriptors, the
constants STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO may be
used. These values are defined in unistd.has follows:
#define STDIN_FILENO 0
#define STDOUT_FILENO 1
#define STDERR_FILENO 2
Looking at the implementation of the catcommand, the program must be able to
use standard input, output, and error to handle invocations such as:
$ cat # read from standard input
$ cat file # read from 'file'
$ cat file > file2 # redirect standard output
Thus there is a small amount parsing to be performed before the program knows
which file to read from and which file to write to. The program source is shown
below:
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < fcntl.h>
4 #include < unistd.h>
6 #define BUFSZ 512
8 main(int argc, char argv) {
10 char buf[BUFSZ];
11 int ifd, ofd, nread;
13 get_fds(argc, argv, &ifd, &ofd);
14 while ((nread = read(ifd, buf, BUFSZ)) != 0)
15 write(ofd, buf, nread);
16 }
17 }
Table 3.1 The Six Basic System Calls Needed for File I/O
SYSTEM CALL FUNCTION
open() Open an existing file or create a new file
creat() Create a new file
close() Close an already open file
lseek() Seek to a specified position in the file
read() Read data from the file from the current position
write() Write data starting at the current position
As previously mentioned, there is actually very little work to do in the main
program. The get_fds() function, which is not shown here, is responsible for
assigning the appropriate file descriptors to ifd and ofdbased on the following
input:
$ mycat
ifd = STDIN_FILENO
ofd = STDOUT_FILENO
$ mycat file
ifd = open(file, O_RDONLY)
ofd = STDOUT_FILENO
$ mycat > file
ifd = STDIN_FILENO
ofd = open(file, O_WRONLY | O_CREAT)
$ mycat fileA > fileB
ifd = open(fileA, O_RDONLY)
ofd = open(fileB, O_WRONLY | O_CREAT)
The following examples show the program running:
$ mycat > testfile
Hello world
$ mycat testfile
Hello world
$ mycat testfile > testfile2
$ mycat testfile2
Hello world
$ mycat
Hello
Hello
world
world
To modify the program, one exercise to try is to implement the get_fds()
function. Some additional exercises to try are:
1. Number all output lines (cat -n). Parse the input strings to detect the -n.
2. Print all tabs as ^Iand place a $character at the end of each line (cat -ET).
The previous program reads the whole file and writes out its contents.
Commands such as dd allow the caller to seek to a specified block in the input file
and output a specified number of blocks.
Reading sequentially from the start of the file in order to get to the part which
the user specified would be particularly inefficient. The lseek() system call
allows the file pointer to be modified, thus allowing random access to the file. The
declaration for lseek()is as follows:
#include < sys/types.h>
#include < unistd.h>
off_t lseek(int fildes, off_t offset, int whence)
The offset and whence arguments dictate where the file pointer should be
positioned:
If whenceis SEEK_SETthe file pointer is set to offsetbytes.
If whence is SEEK_CUR the file pointer is set to its current location plus
offset.
If whence is SEEK_END the file pointer is set to the size of the file plus
offset.
When a file is first opened, the file pointer is set to 0 indicating that the first byte
read will be at an offset of 0 bytes from the start of the file. Each time data is read,
the file pointer is incremented by the amount of data read such that the next read
will start from the offset in the file referenced by the updated pointer. For
example, if the first read of a file is for 1024 bytes, the file pointer for the next read
will be set to 0+ 1024 = 1024. Reading another 1024 bytes will start from byte
offset 1024. After that read the file pointer will be set to 1024 + 1024 = 2048
and so on.
By seeking throughout the input and output files, it is possible to see how the
dd command can be implemented. As with many UNIX commands, most of the
work is done in parsing the command line to determine the input and output
files, the starting position to read, the block size for reading, and so on. The
example below shows how lseek() is used to seek to a specified starting offset
within the input file. In this example, all data read is written to standard output:
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < fcntl.h>
4 #include < unistd.h>
6 #define BUFSZ 512
8 main(int argc, char argv)
9 {
10 char *buf;
11 int fd, nread;
12 off_t offset;
13 size_t iosize;
15 if (argc != 4)
16 printf("usage: mydd filename offset size\n");
18 fd = open(argv[1], O_RDONLY);
19 if (fd < 0)
20 printf("unable to open file\n");
21 exit(1);
22
23 offset = (off_t)atol(argv[2]);
24 buf = (char *)malloc(argv[3]);
25 lseek(fd, offset, SEEK_SET);
26 nread = read(fd, buf, iosize);
27 write(STDOUT_FILENO, buf, nread);
28
Using a large file as an example, try different offsets and sizes and determine the
effect on performance. Also try multiple runs of the program. Some of the effects
seen may not be as expected. The section Data and Attribute Caching, a bit later in
this chapter, discusses some of these effects.
Duplicate File Descriptors
The section File Descriptors, in Chapter 2, introduced the concept of file
descriptors. Typically a file descriptor is returned in response to an open() or
creat() system call. The dup() system call allows a user to duplicate an
existing open file descriptor.
#include < unistd.h>
int dup(int fildes)
There are a number of uses for dup() that are really beyond the scope of this
book. However, the shell often uses dup()when connecting the input and output
streams of processes via pipes.
Seeking and I/O Combined
The pread() and pwrite() system calls combine the effects of lseek() and
read()(or write()) into a single system call. This provides some improvement
in performance although the net effect will only really be visible in an application
that has a very I/O intensive workload. However, both interfaces are supported
by the Single UNIX Specification and should be accessible in most UNIX
environments. The definition of these interfaces is as follows:
#include < unistd.h>
ssize_t pread(int fildes, void buf, size_t nbyte, off_t offset)
ssize_t pwrite(int fildes, const void buf, size_t nbyte,
off_t offset)
The example below continues on from the dd program described earlier and
shows the use of combining the lseek()with read() and write() calls:
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < fcntl.h>
4 #include < unistd.h>
6 main(int argc, char argv)
7 {
8 char *buf;
9 int ifd, ofd, nread;
10 off_t inoffset, outoffset;
11 size_t insize, outsize;
12
13 if (argc != 7) {
14 printf("usage: mydd infilename in_offset"
15 " in_size outfilename out_offset"
16 " out_size\n");
17 }
18 ifd = open(argv[1], O_RDONLY);
19 if (ifd < 0) {
20 printf("unable to open %s\n", argv[1]);
21 exit(1);
22 }
23 ofd = open(argv[4], O_WRONLY);
24 if (ofd < 0) {
25 printf("unable to open %s\n", argv[4]);
26 exit(1);
27 }
28 inoffset = (off_t)atol(argv[2]);
29 insize = (size_t)atol(argv[3])
30 outoffset = (off_t)atol(argv[5])
31 outsize = (size_t)atol(argv[6])
32 buf = (char *)malloc(insize)
33 if (insize < outsize)
34 outsize = insize;
35
36 nread = pread(ifd, buf, insize, inoffset)
37 pwrite(ofd, buf,
38 (nread < outsize) ? nread : outsize, outoffset)
39
The simple example below shows how the program is run:
$ cat fileA
0123456789
$ cat fileB
$ mydd2 fileA 2 4 fileB 4
$ cat fileA
0123456789
$ cat fileB
----234-
To indicate how the performance may be improved through the use of pread()
and pwrite() the I/O loop was repeated 1 million times and a call was made to
time()to determine how many seconds it took to execute the loop between this
and the earlier example.
For the pread()/pwrite() combination the average time to complete the
I/O loop was 25 seconds while for the lseek()/read() and
lseek()/write() combinations the average time was 35 seconds, which
shows a considerable difference.
This test shows the advantage of pread() and pwrite() in its best form. In
general though, if an lseek() is immediately followed by a read() or
write(), the two calls should be combined.
Data and Attribute Caching
There are a number of flags that can be passed to open() that control various
aspects of the I/O. Also, some filesystems support additional but non standard
methods for improving I/O performance.
Firstly, there are three options, supported under the Single UNIX Specification,
that can be passed to open() that have an impact on subsequent I/O operations.
When a write takes place, there are two items of data that must be written to disk,
namely the file data and the files inode. An inode is the object stored on disk that
describes the file, including the properties seen by calling stat() together with
a block map of all data blocks associated with the file.
The three options that are supported from a standards perspective are:
1. O_SYNC.For all types of writes, whether allocation is required or not, the data
and any meta-data updates are committed to disk before the write returns.
For reads, the access time stamp will be updated before the read returns.
2. O_DSYNC. When a write occurs, the data will be committed to disk before the
write returns but the files meta-data may not be written to disk at this stage.
This will result in better I/O throughput because, if implemented efficiently
by the filesystem, the number of inode updates will be minimized,
effectively halving the number of writes. Typically, if the write results in an
allocation to the file (a write over a hole or beyond the end of the file) the
meta-data is also written to disk. However, if the write does not involve an
allocation, the timestamps will typically not be written synchronously.
3. O_RSYNC. If both the O_RSYNC and O_DSYNC flags are set, the read returns
after the data has been read and the file attributes have been updated on
disk, with the exception of file timestamps that may be written later. If there
are any writes pending that cover the range of data to be read, these writes
are committed before the read returns.
If both the O_RSYNC and O_SYNC flags are set, the behavior is identical to
that of setting O_RSYNC and O_DSYNC except that all file attributes changed
by the read operation (including all time attributes) must also be committed
to disk before the read returns.
Which option to choose is dependent on the application. For I/O intensive
applications where timestamps updates are not particularly important, there can
be a significant performance boost by using O_DSYNCin place of O_SYNC.
VxFS Caching Advisories
Some filesystems provide non standard means of improving I/O performance by
offering additional features. For example, the VERITAS filesystem, VxFS,
provides the noatime mount option that disables access time updates; this is
usually fine for most application environments.
The following example shows the effect that selecting O_SYNC versus O_DSYNC
can have on an application:
#include < sys/unistd.h>
#include < sys/types.h>
#include < fcntl.h>
main(int argc, char argv[]{
char buf[4096]
int i, fd, advisory;
fd = open("myfile", O_WRONLY|O_DSYNC)
for (i=0 ; i<1024 ; i++)
write(fd, buf, 4096)
By having a program that is identical to the previous with the exception of setting
O_SYNCin place of O_DSYNC, the output of the two programs is as follows:
# time ./sync
real 0m8.33s
user 0m0.03s
sys 0m1.92s
# time ./dsync
real 0m6.44s
user 0m0.02s
sys 0m0.69s
This clearly shows the increase in time when selecting O_SYNC. VxFS offers a
number of other advisories that go beyond what is currently supported by the
traditional UNIX standards. These options can only be accessed through use of
the ioctl() system call. These advisories give an application writer more
control over a number of I/O parameters:
VX_RANDOM. Filesystems try to determine the I/O pattern in order to perform
read ahead to maximize performance. This advisory indicates that the I/O
pattern is random and therefore read ahead should not be performed.
VX_SEQ. This advisory indicates that the file is being accessed sequentially. In
this case the filesystem should maximize read ahead.
VX_DIRECT. When data is transferred to or from the user buffer and disk, a
copy is first made into the kernel buffer or page cache, which is a cache of
recently accessed file data. Although this cache can significantly help
performance by avoiding a read of data from disk for a second access, the
double copying of data has an impact on performance. The VX_DIRECT
advisory avoids this double buffering by copying data directly between the
users buffer and disk.
VX_NOREUSE. If data is only to be read once, the in-kernel cache is not
needed. This advisory informs the filesystem that the data does not need to
be retained for subsequent access.
VX_DSYNC. This option was in existence for a number of years before the
O_DSYNC mode was adopted by the UNIX standards committees. It can still
be accessed on platforms where O_DSYNC is not supported.
Before showing how these caching advisories can be used it is first necessary to
describe how to use the ioctl() system call. The definition of ioctl(), which
is not part of any UNIX standard, differs slightly from platform to platform by
requiring different header files. The basic definition is as follows:
#include < unistd.h> # Solaris
#include < stropts.h> # Solaris, AIX and HP-UX
#include < sys/ioctl.h> # Linux
int ioctl(int fildes, int request, /* arg ... */)
Note that AIX does not, at the time of writing, support ioctl() calls on regular
files. Ioctl calls may be made to VxFS regular files, but the operation is not
supported generally.
The following program shows how the caching advisories are used in practice.
The program takes VX_SEQ, VX_RANDOM, or VX_DIRECT as an argument and
reads a 1MB file in 4096 byte chunks.
#include < sys/unistd.h>
#include < sys/types.h>
#include < fcntl.h>
#include "sys/fs/vx_ioctl.h"
#define MB (1024 * 1024)
main(int argc, char argv[]
{
char *buf;
int i, fd, advisory;
long pagesize, pagemask;
if (argc != 2) exit(1);
if (strcmp(argv[1], "VX_SEQ") == 0)
{
advisory = VX_SEQ;
} else if (strcmp(argv[1], "VX_RANDOM") == 0)
{
advisory = VX_RANDOM;
} else if (strcmp(argv[1], "VX_DIRECT") == 0)
advisory = VX_DIRECT;
pagesize = sysconf(_SC_PAGESIZE)
pagemask = pagesize - 1;
buf = (char *)(malloc(2 * pagesize) & pagemask)
buf = (char *)(((long)buf + pagesize) & ~pagemask)
fd = open("myfile", O_RDWR)
ioctl(fd, VX_SETCACHE, advisory)
for (i=0 ; i< MB ; i++) {
read(fd, buf, 4096)
}
}
The program was run three times passing each of the advisories in turn. The
timescommand was run to display the time to run the program and the amount
of time that was spent in user and system space.
VX_SEQ
real 2:47.6
user 5.9
sys 2:41.4
VX_DIRECT
real 2:35.7
user 6.7
sys 2:28.7
VX_RANDOM
real 2:43.6
user 5.2
sys 2:38.1
Although the time difference between the runs shown here is not significant, the
appropriate use of these caching advisories can have a significant impact on
overall performance of large applications.
Miscellaneous Open Options
Through use of the O_NONBLOCK and O_NDELAY flags that can be passed to
open(), applications can gain some additional control in the case where they
may block for reads and writes.
O_EXCL.If both O_CREATand O_EXCL are set, a call to open()fails if the file
exists. If the O_CREAT option is not set, the effect of passing O_EXCL is
undefined.
O_NONBLOCK / O_NDELAY. These flags can affect subsequent reads and
writes. If both the O_NDELAY and O_NONBLOCK flags are set, O_NONBLOCK
takes precedence. Because both options are for use with pipes, they wont be
discussed further here.
File and Record Locking
If multiple processes are writing to a file at the same time, the result is non
deterministic. Within the UNIX kernel, only one write to the same file may
proceed at any given time. However, if multiple processes are writing to the file,
the order in which they run can differ depending on many different factors.
Obviously this is highly undesirable and results in a need to lock files at an
application level, whether the whole file or specific sections of a file. Sections of a
file are also called records, hence file and record locking.
There are numerous uses for file locking. However, looking at database file
access gives an excellent example of the types of locks that applications require.
For example, it is important that all users wishing to view database records are
able to do so simultaneously. When updating records it is imperative that while
one record is being updated, other users are still able to access other records.
Finally it is imperative that records are updated in a time-ordered manner.
There are two types of locks that can be used to coordinate access to files,
namely mandatory and advisory locks. With advisory locking, it is possible for
cooperating processes to safely access a file in a controlled manner. Mandatory
locking is somewhat of a hack and will be described later. The majority of this
section will concentrate on advisory locking, sometimes called record locking.
Advisory Locking
There are three functions which can be used for advisory locking. These are
lockf(), flock(), and fcntl(). The flock()function defined below:
/usr/ucb/cc [ flag ... ] file ..
#include < sys/file.h>
int flock(fd, operation)
int fd, operation;
was introduced in BSD UNIX and is not supported under the Single UNIX
Specification standard. It sets an advisory lock on the whole file. The lock type,
specified by the operation argument, may be exclusive (LOCK_EX) or shared
(LOCK_SH). By ORing operation with LOCK_NB, if the file is already locked,
EAGAINwill be returned. The LOCK_UNoperationremoves the lock.
The lockf() function, which is typically implemented as a call to fcntl(),
can be invoked to apply or remove an advisory lock on a segment of a file as
follows:
#include < sys/file.h>
int lockf(int fildes, int function, off_t size)
To use lockf(), the file must have been opened with one of the O_WRONLY or
O_RDWR flags. The size argument specifies the number of bytes to be locked,
starting from the current file pointer. Thus, a call to lseek() should be made
prior to calling lockf(). If the value of size is 0 the file is locked from the
current offset to the end of the file.
The functionargument can be one of the following:
F_LOCK. This command sets an exclusive lock on the file. If the file is already
locked, the calling process will block until the previous lock is relinquished.
F_TLOCK. This performs the same function as the F_LOCK command but will
not blockthus if the file is already locked, EAGAIN is returned.
F_ULOCK. This command unlocks a segment of the file.
F_TEST. This command is used to test whether a lock exists for the specified
segment. If there is no lock for the segment, 0 is returned, otherwise -1 is
returned, and errno is set to EACCES.
If the segment to be locked contains a previous locked segment, in whole or part,
the result will be a new, single locked segment. Similarly, if F_ULOCKis specified,
the segment of the file to be unlocked may be a subset of a previously locked
segment or may cover more than one previously locked segment. If size is 0,
the file is unlocked from the current file offset to the end of the file. If the segment
to be unlocked is a subset of a previously locked segment, the result will be one
or two smaller locked segments.
It is possible to reach deadlock if two processes make a request to lock
segments of a file owned by each other. The kernel is able to detect this and, if the
condition would occur, EDEADLK is returned.
Note as mentioned above that flock()is typically implemented on top of the
fcntl() system call, for which there are three commands that can be passed to
manage record locking. Recall the interface for fcntl():
#include < sys/types.h>
#include < unistd.h>
#include < fcntl.h>
int fcntl(int fildes, int cmd, ...)
All commands operate on the flock structure that is passed as the third
argument:
struct flock {
short l_type; /* F_RDLCK, F_WRLCK or F_UNLOCK */
short l_whence; /* flag for starting offset */
off_t l_start; /* relative offset in bytes */
off_t l_len; /* size; if 0 then until EOF */
pid_t l_pid; /* process ID of lock holder */
};
The commands that can be passed to fcntl()are:
F_GETLK. This command returns the first lock that is covered by the flock
structure specified. The information that is retrieved overwrites the fields of
the structure passed.
F_SETLK. This command either sets a new lock or clears an existing lock
based on the value of l_type as shown above.
F_SETLKW. This command is the same as F_SETLK with the exception that
the process will block if the lock is held by another process.
Because record locking as defined by fcntl() is supported by all appropriate
UNIX standards, this is the routine that should be ideally used for application
portability.
The following code fragments show how advisory locking works in practice.
The first program, lock, which follows, sets a writable lock on the whole of the
file myfile and calls pause() to wait for a SIGUSR1 signal. After the signal
arrives, a call is made to unlock the file.
1 #include < sys/types.h>
2 #include < unistd.h>
3 #include < fcntl.h>
4 #include < signal.h>
6 void
7 mysig(int signo)
8 {
9 return;
10 }
11
12 main()
13 {
14 struct flock lk;
15 int fd, err;
16
17 sigset(SIGUSR1, mysig);
18
19 fd = open("myfile", O_WRONLY)
20
21 lk.l_type = F_WRLCK;
22 lk.l_whence = SEEK_SET;
23 lk.l_start = 0;
24 lk.l_len = 0;
25 lk.l_pid = getpid()
26
27 err = fcntl(fd, F_SETLK, &lk)
28 printf("lock: File is locked\n")
29 pause()
30 lk.l_type = F_UNLCK;
31 err = fcntl(fd, F_SETLK, &lk)
32 printf("lock: File is unlocked\n")
33 }
Note that the process ID of this process is placed in l_pid so that anyone
requesting information about the lock will be able to determine how to identify
this process.
The next program (mycatl) is a modified version of the cat program that will
only display the file if there are no write locks held on the file. If a lock is detected,
the program loops up to 5 times waiting for the lock to be released. Because the
lock will still be held by the lock program, mycatl will extract the process ID
from the flock structure returned by fcntl() and post a SIGUSR1 signal. This
is handled by the lock program which then unlocks the file.
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < fcntl.h>
4 #include < unistd.h>
5 #include < signal.h>
7 pid_
8 is_locked(int fd)
9 {
10 struct flock lk;
11
12 lk.l_type = F_RDLCK;
13 lk.l_whence = SEEK_SET;
14 lk.l_start = 0;
15 lk.l_len = 0;
16 lk.l_pid = 0;
17
18 fcntl(fd, F_GETLK, &lk);
19 return (lk.l_type == F_UNLCK) ? 0 : lk.l_pid;
20 }
21
22 main()
23 {
24 struct flock lk;
25 int i, fd, err;
26 pid_t pid;
27
28 fd = open("myfile", O_RDONLY);
29
30 for (i = 0 ; i < 5 ; i++) {
31 if ((pid = is_locked(fd)) == 0) {
32 catfile(fd);
33 exit(0);
34 } else {
35 printf("mycatl: File is locked ...\n");
36 sleep(1);
37 }
38 }
39 kill(pid, SIGUSR1)
40 while ((pid = is_locked(fd)) != 0) {
41 printf("mycatl: Waiting for lock release\n")
42 sleep(1)
43 }
44 catfile(fd)
45 }
Note the use of fcntl()in the mycatl program. If no lock exists on the file that
would interfere with the lock requested (in this case the program is asking for a
read lock on the whole file), the l_type field is set to F_UNLCK. When the
program is run, the following can be seen:
$ cat myfile
Hello world
$ lock&
[1] 2448
lock: File is locked
$ mycatl
mycatl: File is locked ..
mycatl: File is locked ..
mycatl: File is locked ..
mycatl: File is locked ..
mycatl: File is locked ..
mycatl: Waiting for lock release
lock: File is unlocked
Hello world
[1]+ Exit 23 ./lock
The following example shows where advisory locking fails to become effective if
processes are not cooperating:
$ lock&
[1] 2494
lock: File is locked
$ cat myfile
Hello world
$ rm myfile
$ jobs
[1]+ Running ./lock
In this case, although the file has a segment lock, a non-cooperating process can
still access the file, thus the real cat program can display the file and the file can
also be removed! Note that removing a file involves calling the unlink()system
call. The file is not actually removed until the last close. In this case the lock
program still has the file open. The file will actually be removed once the lock
program exits.
Mandatory Locking
As the previous example shows, if all processes accessing the same file do not
cooperate through the use of advisory locks, unpredictable results can occur.
Mandatory locking provides file locking between non-cooperating processes.
Unfortunately, the implementation, which arrived with SVR3, leaves something
to be desired.
Mandatory locking can be enabled on a file if the set group ID bit is switched
on and the group execute bit is switched offa combination that together does
not otherwise make any sense. Thus if the following were executed on a system
that supports mandatory locking:
$ lock&
[1] 12096
lock: File is locked
$ cat myfile # The cat program blocks here
the cat program will block until the lock is relinquished. Note that mandatory
locking is not supported by the major UNIX standards so further details will not
be described here.
File Control Operations
The fcntl() system call is designed to provide file control functions for open
files. The definition was shown in a previous section, File and Record Locking,
earlier in the chapter. It is repeated below:
#include < sys/types.h>
#include < unistd.h>
#include < fcntl.h>
int fcntl(int fildes, int cmd, ...)
The file descriptor refers to a previously opened file and the cmd argument is one
of the commands shown below:
F_DUPFD. This command returns a new file descriptor that is the lowest
numbered file descriptor available (and is not already open). The file
descriptor returned will be greater than or equal to the third argument. The
new file descriptor refers to the same open file as the original file descriptor
and shares any locks. The FD_CLOEXEC (see F_SETFD below) flag
associated with the new file descriptor is cleared to keep the file open across
calls to one of the exec functions.
F_GETFD. This command returns the flags associated with the specified file
descriptor. This is a little bit of a misnomer because there has only ever been
one flag, the FD_CLOEXEC flag that indicates that the file should be closed
following a successful call to exec().
F_SETFD. This command sets the FD_CLOEXECflag.
F_GETFL. This command returns the file status flags and file access modes for
fildes. The file access modes can be extracted from the return value using
the mask O_ACCMODE. The flags are O_RDONLY, O_WRONLYand O_RDWR.
The file status flags, as described in the sections Data and Attribute Caching
and Miscellaneous Open Options, earlier in this chapter, can be either
O_APPEND, O_SYNC, O_DSYNC, O_RSYNC, or O_NONBLOCK.
F_SETFL. This command sets the file status flags for the specified file
descriptor.
F_GETLK. This command retrieves information about an advisory lock. See
the section File and Record Locking, earlier in this chapter, for further
information.
F_SETLK. This command clears or sets an advisory lock. See the section File
and Record Locking, earlier in this chapter, for further information.
F_SETLKW. This command also clears or sets an advisory lock. See the section
File and Record Locking, earlier in this chapter, for further information.
Vectored Reads and Writes
If the data that a process reads from a file in a single read needs to placed in
different areas of memory, this would typically involve more than one call to
read(). However, the readv() system call can be used to perform a single read
from the file but copy the data to the multiple memory locations, which can cut
down on system call overhead and therefore increase performance in
environments where there is a lot of I/O activity. When writing to files the
writev()system call can be used.
Here are the definitions for both functions:
#include < sys/uio.h>
ssize_t readv(int fildes, const struct iovec iov, int iovcnt)
ssize_t writev(int fildes, const struct iovec iov, int iovcnt)
Note that although multiple I/Os can be combined, they must all be contiguous
within the file.
struct uio {
void *iov_base; /* Address in memory of buffer for r/w *
size_t iov_len; /* Size of the above buffer in memory *
}
Figure 3.1 shows how the transfer of data occurs for a read operation. The shading
on the areas of the file and the address space show where the data will be placed
after the read has completed.
The following program corresponds to the example shown in Figure 3.1:
1 #include < sys/uio.h>
2 #include < unistd.h>
3 #include < fcntl.h>
4
5 main()
6 {
7 struct iovec uiop[3];
8 void *addr1, *addr2, *addr3;
9 int fd, nbytes;
10
11 addr1 = (void *)malloc(4096)
12 addr2 = (void *)malloc(4096)
13 addr3 = (void *)malloc(4096)
14
15 uiop[0].iov_base = addr1; uiop[0].iov_len = 512;
16 uiop[1].iov_base = addr2; uiop[1].iov_len = 512;
17 uiop[2].iov_base = addr3; uiop[2].iov_len = 1024;
18
19 fd = open("myfile", O_RDONLY)
20 nbytes = readv(fd, uiop, 3)
21 printf("number of bytes read = %d\n", nbytes)
22
Note that readv() returns the number of bytes read. When this program runs,
the result is 2048 bytes, the total number of bytes obtained by adding the three
individual iovec structures.
$ readv
number of bytes read = 2048
Asynchronous I/O
By issuing an I/O asynchronously, an application can continue with other work
rather than waiting for the I/O to complete. There have been numerous different
implementations of asynchronous I/O (commonly referred to as async I/O) over
the years. This section will describe the interfaces as supported by the Single
UNIX Specification.
As an example of where async I/O is commonly used, consider the Oracle
database writer process (DBWR), one of the main Oracle processes; its role is to
manage the Oracle buffer cache, a user-level cache of database blocks. This
involves responding to read requests and writing dirty (modified) buffers to
disk.
In an active database, the work of DBWR is complicated by the fact that it is
constantly writing dirty buffers to disk in order to allow new blocks to be read.
Oracle employs two methods to help alleviate some of the performance
bottlenecks. First, it supports multiple DBWR processes (called DBWR slave
processes); the second option, which greatly improves throughput, is through
use of async I/O. If I/O operations are being performed asynchronously, the
DBWR processes can be doing other work, whether flushing more buffers to
disk, reading data from disk, or other internal functions.
All of the Single UNIX Specification async I/O operations center around an
I/O control block defined by the aiocbstructure as follows:
struct aiocb {
int aio_fildes; /* file descriptor */
off_t aio_offset; /* file offset */
volatile void *aio_buf; /* location of buffer */
size_t aio_nbytes; /* length of transfer */
int aio_reqprio; /* request priority offset */
struct sigevent aio_sigevent; /* signal number and value */
int aio_lio_opcode; /* operation to be performed */
};
The fields of the aiocb structure will be described throughout this section as the
various interfaces are described. The first interface to describe is aio_read():
cc [ flag... ] file... -lrt [ library...
#include < aio.h>
int aio_read(struct aiocb aiocbp)
The aio_read() function will read aiocbp->aio_nbytes from the file
associated with file descriptor aiocbp->aio_fildes into the buffer referenced
by aiocbp->aio_buf. The call returns when the I/O has been initiated. Note
that the requested operation takes place at the offset in the file specified by the
aio_offsetfield.
Similarly, to perform an asynchronous write operation, the function to call is
aio_write()which is defined as follows:
cc [ flag... ] file... -lrt [ library...
#include < aio.h>
int aio_write(struct aiocb aiocbp)
and the fields in the aio control block used to initiate the write are the same as for
an async read.
In order to retrieve the status of a pending I/O, there are two interfaces that can
be used. One involves the posting of a signal and will be described later; the other
involves the use of the aio_return() function as follows:
#include < aio.h>
ssize_t aio_return(struct aiocb aiocbp)
The aio control block that was passed to aio_read() should be passed to
aio_return(). The result will either be the same as if a call to read() or
write() had been made or, if the operation is still in progress, the result is
undefined.
The following example shows some interesting properties of an asynchronous
write:
1 #include < aio.h>
2 #include < time.h>
3 #include < errno.h>
4
5 #define FILESZ (1024 * 1024 * 64)
6
7 main()
8 {
9 struct aiocb aio;
10 void *buf;
11 time_t time1, time2;
12 int err, cnt = 0;
13
14 buf = (void *)malloc(FILESZ)
15 aio.aio_fildes = open("/dev/vx/rdsk/fs1", O_WRONLY)
16 aio.aio_buf = buf;
17 aio.aio_offset = 0;
18 aio.aio_nbytes = FILESZ;
19 aio.aio_reqprio = 0;
20
21 time(&time1)
22 err = aio_write(&aio)
23 while ((err = aio_error(&aio)) == EINPROGRESS) {
24 sleep(1)
25 }
26 time(&time2)
27 printf("The I/O took %d seconds\n", time2 - time1)
28 }
The program uses the raw device /dev/vx/rdsk/fs1 to write a single 64MB
buffer. The aio_error()call:
cc [ flag... ] file... -lrt [ library...
#include
int aio_error(const struct aiocb aiocbp)
can be called to determine whether the I/O has completed, is still in progress, or
whether an error occurred. The return value from aio_error() will either
correspond to the return value from read(), write(), or will be EINPROGRESS
if the I/O is still pending. Note when the program is run:
# aiowrite
The I/O took 7 seconds
Thus if the process had issued a write through use of the write()system call, it
would wait for 7 seconds before being able to do anything else. Through the use
of async I/O the process is able to continue processing and then find out the
status of the async I/O at a later date.
For async I/O operations that are still pending, the aio_cancel() function
can be used to cancel the operation:
cc [ flag... ] file... -lrt [ library...
#include < aio.h>
int aio_cancel(int fildes, struct aiocb aiocbp)
The filedes argument refers to the open file on which a previously made async
I/O, as specified by aiocbp, was issued. If aiocbp is NULL, all pending async
I/O operations are canceled. Note that it is not always possible to cancel an async
I/O. In many cases, the I/O will be queued at the driver level before the call from
aio_read()or aio_write() returns.
As an example, following the above call to aio_write(), this code is inserted:
err = aio_cancel(aio.aio_fildes, &aio)
switch (err) {
case AIO_CANCELED:
errstr = "AIO_CANCELED"
break;
case AIO_NOTCANCELED:
errstr = "AIO_NOTCANCELED"
break;
case AIO_ALLDONE:
errstr = "AIO_ALLDONE"
break;
default:
errstr = "Call failed"
}
printf("Error value returned %s\n", errstr)
and when the program is run, the following error value is returned:
Error value returned AIO_CANCELED
In this case, the I/O operation was canceled. Consider the same program but
instead of issuing a 64MB write, a small 512 byte I/O is issued:
Error value returned AIO_NOTCANCELED
In this case, the I/O was already in progress, so the kernel was unable to prevent
it from completing.
As mentioned above, the Oracle DBWR process will likely issue multiple I/Os
simultaneously and wait for them to complete at a later time. Multiple read()
and write() system calls can be combined through the use of readv() and
write() to help cut down on system call overhead. For async I/O, the
lio_listio()function achieves the same result:
#include < aio.h>
int lio_listio(int mode, struct aiocb const list[], int nent,
struct sigevent sig)
The modeargument can be one of LIO_WAITin which the requesting process will
block in the kernel until all I/O operations have completed or LIO_NOWAIT in
which case the kernel returns control to the user as soon as the I/Os have been
queued. The list argument is an array of nentaiocb structures. Note that for
each aiocb structure, the aio_lio_opcode field must be set to either
LIO_READ for a read operation, LIO_WRITE for a write operation, or LIO_NOP
in which case the entry will be ignored.
If the mode flag is LIO_NOWAIT, the sig argument specifies the signal that
should be posted to the process once the I/O has completed.
The following example uses lio_listio() to issue two async writes to
different parts of the file. Once the I/O has completed, the signal handler
aiohdlr() will be invoked; this displays the time that it took for both writes to
complete.
1 #include < aio.h>
2 #include < time.h>
3 #include < errno.h>
4 #include < signal.h>
6 #define FILESZ (1024 * 1024 * 64)
7 time_t time1, time2;
9 void
10 aiohdlr(int signo)
11 {
12 time(&time2)
13 printf("Time for write was %d seconds\n", time2 - time1)
14 }
15
16 main()
17 {
18 struct sigevent mysig;
19 struct aiocb *laio[2];
20 struct aiocb aio1, aio2;
21 void *buf;
22 char errstr;
23 int fd;
24
25 buf = (void *)malloc(FILESZ)
26 fd = open("/dev/vx/rdsk/fs1", O_WRONLY)
27
28 aio1.aio_fildes = fd;
29 aio1.aio_lio_opcode = LIO_WRITE;
30 aio1.aio_buf = buf;
31 aio1.aio_offset = 0;
32 aio1.aio_nbytes = FILESZ;
33 aio1.aio_reqprio = 0;
34 laio[0] = &aio1;
35
36 aio2.aio_fildes = fd;
37 aio2.aio_lio_opcode = LIO_WRITE;
38 aio2.aio_buf = buf;
39 aio2.aio_offset = FILESZ;
40 aio2.aio_nbytes = FILESZ;
41 aio2.aio_reqprio = 0;
42 laio[1] = &aio2;
43
44 sigset(SIGUSR1, aiohdlr)
45 mysig.sigev_signo = SIGUSR1;
46 mysig.sigev_notify = SIGEV_SIGNAL;
47 mysig.sigev_value.sival_ptr = (void *)laio;
48
49 time(&time1)
50 lio_listio(LIO_NOWAIT, laio, 2, &mysig)
51 pause()
52 }
The call to lio_listio() specifies that the program should not wait and that a
signal should be posted to the process after all I/Os have completed. Although
not described here, it is possible to use real-time signals through which
information can be passed back to the signal handler to determine which async
I/O has completed. This is particularly important when there are multiple
simultaneous calls to lio_listio(). Bill Gallmeisters book Posix.4:
Programming for the Real World [GALL95] describes how to use real-time signals.
When the program is run the following is observed:
# listio
Time for write was 12 seconds
which clearly shows the amount of time that this process could have been
performing other work rather than waiting for the I/O to complete.
Memory Mapped Files
In addition to reading and writing files through the use of read() and write(),
UNIX supports the ability to map a file into the process address space and read
and write to the file through memory accesses. This allows unrelated processes to
access files with either shared or private mappings. Mapped files are also used by
the operating system for executable files.
The mmap() system call allows a process to establish a mapping to an already
open file:
#include < sys/mman.h>
void mmap(void addr, size_t len, int prot, int flags,
int fildes, off_t off)
The file is mapped from an offset of off bytes within the file for len bytes. Note
that the offset must be on a page size boundary. Thus, if the page size of the
system is 4KB, the offset must be 0, 4096, 8192 and so on. The size of the mapping
does not need to be a multiple of the page size although the kernel will round the
request up to the nearest page size boundary. For example, if off is set to 0 and
sizeis set to 2048, on systems with a 4KB page size, the mapping established will
actually be for 4KB.
Figure 3.2 shows the relationship between the pages in the users address
space and how they relate to the file being mapped. The page size of the
underlying hardware platform can be determined by making a call to
sysconf()as follows:
#include < unistd.h>
main(){
printf("PAGESIZE = %d\n", sysconf(_SC_PAGESIZE))
}
Typically the page size will be 4KB or 8KB. For example, as expected, when the
program is run on an x86 processor, the following is reported:
# ./sysconf
PAGESIZE = 4096
while for Sparc 9 based hardware:
# ./sysconf
PAGESIZE = 8192
Although it is possible for the application to specify the address to which the file
should be mapped, it is recommended that the addr field be set to 0 so that the
system has the freedom to choose which address the mapping will start from.
The operating system dynamic linker places parts of the executable program in
various memory locations. The amount of memory used differs from one process
to the next. Thus, an application should never rely on locating data at the same
place in memory even within the same operating system and hardware
architecture. The address at which the mapping is established is returned if the
call to mmap() is successful, otherwise 0 is returned.
Note that after the file has been mapped it can be closed and still accessed
through the mapping.
Before describing the other parameters, here is a very simple example showing
the basics of mmap():
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < sys/mman.h>
4 #include < fcntl.h>
5 #include < unistd.h>
7 #define MAPSZ 4096
9 main()
10 {
11 char *addr, c;
12 int fd;
13
14 fd = open("/etc/passwd", O_RDONLY)
15 addr = (char *)mmap(NULL, MAPSZ,
16 PROT_READ, MAP_SHARED, fd, 0)
17 close(fd)
18 for (;;) {
19 c = *addr;
20 putchar(c)
21 addr++
22 if (c == \n) {
23 exit(0)
24 }
25 }
26 }
The /etc/passwd file is opened and a call to mmap() is made to map the first
MAPSZ bytes of the file. A file offset of 0 is passed. The PROT_READ and
MAP_SHAREDarguments describe the type of mapping and how it relates to other
processes that map the same file. The prot argument (in this case PROT_READ)
can be one of the following:
PROT_READ. The data can be read.
PROT_WRITE. The data can be written.
PROT_EXEC. The data can be executed.
PROT_NONE. The data cannot be accessed.
Note that the different access types can be combined. For example, to specify read
and write access a combination of (PROT_READ|PROT_WRITE) may be specified.
By specifying PROT_EXEC it is possible for application writers to produce their
own dynamic library mechanisms. The PROT_NONE argument can be used for
user level memory management by preventing access to certain parts of memory
at certain times. Note that PROT_NONE cannot be used in conjunction with any
other flags.
The flagsargument can be one of the following:
MAP_SHARED. Any changes made through the mapping will be reflected back
to the mapped file and are visible by other processes calling mmap() and
specifying MAP_SHARED.
MAP_PRIVATE. Any changes made through the mapping are private to this
process and are not reflected back to the file.
MAP_FIXED. The addr argument should be interpreted exactly. This
argument will be typically used by dynamic linkers to ensure that program
text and data are laid out in the same place in memory for each process. If
MAP_FIXED is specified and the area specified in the mapping covers an
already existing mapping, the initial mapping is first unmapped.
Note that in some versions of UNIX, the flags have been enhanced to include
operations that are not covered by the Single UNIX Specification. For example,
on the Solaris operating system, the MAP_NORESERVE flag indicates that swap
space should not be reserved. This avoids unnecessary wastage of virtual
memory and is especially useful when mappings are read-only. Note, however,
that this flag is not portable to other versions of UNIX.
To give a more concrete example of the use of mmap(), an abbreviated
implementation of the cp utility is given. This is how some versions of UNIX
actually implement cp.
1 #include < sys/types.h>
2 #include < sys/stat.h>
3 #include < sys/mman.h>
4 #include < fcntl.h>
5 #include < unistd.h>
7 #define MAPSZ 4096
9 main(int argc, char argv)
10 {
11 struct stat st;
12 size_t iosz;
13 off_t off = 0;
14 void *addr;
15 int ifd, ofd;
16
17 if (argc != 3) {
18
printf("Usage: mycp srcfile destfile\n")
19 exit(1)
20 }
21 if ((ifd = open(argv[1], O_RDONLY)) < 0)
22 printf("Failed to open %s\n", argv[1])
23
24 if ((ofd = open(argv[2],
25 O_WRONLY|O_CREAT|O_TRUNC, 0777)) < 0) {
26 printf("Failed to open %s\n", argv[2]);
27 }
28 fstat(ifd, &st);
29 if (st.st_size < MAPSZ) {
30 addr = mmap(NULL, st.st_size,
31 PROT_READ, MAP_SHARED, ifd, 0);
32 printf("Mapping entire file\n");
33 close(ifd);
34 write (ofd, (char *)addr, st.st_size);
35 } else {
36 printf("Mapping file by MAPSZ chunks\n");
37 while (off <= st.st_size) {
38 addr = mmap(NULL, MAPSZ, PROT_READ,
39 MAP_SHARED, ifd, off);
40 if (MAPSZ < (st.st_size - off)) {
41 iosz = MAPSZ;
42 } else {
43 iosz = st.st_size - off;
44 }
45 write (ofd, (char *)addr, iosz);
46 off += MAPSZ;
47 }
48 }
49 }
The file to be copied is opened and the file to copy to is created on lines 21-27. The
fstat() system call is invoked on line 28 to determine the size of the file to be
copied. The first call to mmap() attempts to map the whole file (line 30) for files of
size less then MAPSZ. If this is successful, a single call to write() can be issued to
write the contents of the mapping to the output file.
If the attempt at mapping the whole file fails, the program loops (lines 37-47)
mapping sections of the file and writing them to the file to be copied.
Note that in the example here, MAP_PRIVATE could be used in place of
MAP_SHARED since the file was only being read. Here is an example of the
program running:
$ cp mycp.c fileA
$ mycp fileA fileB
Mapping entire file
$ diff fileA fileB
$ cp mycp fileA
$ mycp fileA fileB
Mapping file by MAPSZ chunks
$ diff fileA fileB
Note that if the file is to be mapped in chunks, we keep making repeated calls to
mmap(). This is an extremely inefficient use of memory because each call to
mmap() will establish a new mapping without first tearing down the old
mapping. Eventually the process will either exceed its virtual memory quota or
run out of address space if the file to be copied is very large. For example, here is
a run of a modified version of the program that displays the addresses returned
by mmap():
$ dd if=/dev/zero of=20kfile bs=4096 count=
5+0 records in
5+0 records out
$ mycp_profile 20kfile newfile
Mapping file by MAPSZ chunks
map addr = 0x40019000
map addr = 0x4001a000
map addr = 0x4001b000
map addr = 0x4001c000
map addr = 0x4001d000
map addr = 0x4001e000
The different addresses show that each call to mmap()establishes a mapping at a
new address. To alleviate this problem, the munmap() system call can be used to
unmap a previously established mapping:
#include < sys/mman.h>
int munmap(void *addr, size_t len)
Thus, using the example above and adding the following line:
munmap(addr, iosz)
after line 46, the mapping established will be unmapped, freeing up both the
users virtual address space and associated physical pages. Thus, running the
program again and displaying the addresses returned by calling mmap()shows:
$ mycp2 20kfile newfile
Mapping file by MAPSZ chunks
map addr = 0x40019000
map addr = 0x40019000
map addr = 0x40019000
map addr = 0x40019000
map addr = 0x40019000
map addr = 0x40019000
The program determines whether to map the whole file based on the value of
MAPSZ and the size of the file. One way to modify the program would be to
attempt to map the whole file regardless of size and only switch to mapping in
segments if the file is too large, causing the call to mmap() to fail.
After a mapping is established with a specific set of access protections, it may
be desirable to change these protections over time. The mprotect() system call
allows the protections to be changed:
#include < sys/mman.h>
int mprotect(void *addr, size_t len, int prot)
The prot argument can be one of PROT_READ, PROT_WRITE, PROT_EXEC,
PROT_NONE, or a valid combination of the flags as described above. Note that the
range of the mapping specified by a call to mprotect() does not have to cover
the entire range of the mapping established by a previous call to mmap(). The
kernel will perform some rounding to ensure that len is rounded up to the next
multiple of the page size.
The other system call that is of importance with respect to memory mapped
files is msync(), which allows modifications to the mapping to be flushed to the
underlying file:
#include < sys/mman.h>
int msync(void *addr, size_t len, int flags)
Again, the range specified by the combination of addr and len does not need to
cover the entire range of the mapping. The flags argument can be one of the
following:
MS_ASYNC. Perform an asynchronous write of the data.
MS_SYNC. Perform a synchronous write of the data.
MS_INVALIDATE. Invalidate any cached data.
Thus, a call to mmap() followed by modification of the data followed by a call to
msync() specifying the MS_SYNC flag is similar to a call to write() following a
call to open() and specifying the O_SYNCflag. By specifying the MS_ASYNCflag,
this is loosely synonymous to opening a file without the O_SYNC flag. However,
calling msync() with the MS_ASYNCflag is likely to initiate the I/O while writing
to a file without specifying O_SYNC or O_DSYNCcould result in data sitting in the
system page or buffer cache for some time.
One unusual property of mapped files occurs when the pseudo device
/dev/zerois mapped. As one would expect, this gives access to a contiguous set
of zeroes covering any part of the mapping that is accessed. However, following a
mapping of /dev/zero, if the process was to fork, the mapping would be visible
by parent and child. If MAP_PRIVATEwas specified on the call to mmap(), parent
and child will share the same physical pages of the mapping until a modification
is made at which time the kernel will copy the page that makes the modification
private to the process which issued the write.
If MAP_SHARED is specified, both parent and children will share the same
physical pages regardless of whether read or write operations are performed.
64-Bit File Access (LFS)
32-bit operating systems have typically used a signed long integer as the offset to
files. This leads to a maximum file size of 231 -1 (2GB - 1). The amount of work to
convert existing applications to use a different size type for file offsets was
considered too great, and thus the Large File Summit was formed, a group of OS
and filesystem vendors who wanted to produce a specification that could allow
access to large files. The specification would then be included as part of the Single
UNIX Specification (UNIX 95 and onwards). The specification provided the
following concepts:
The off_t data type would support one of two or more sizes as the OS
and filesystem evolved to a full 64-bit solution.
An offset maximum which, as part of the interface, would give the maximum
offset that the OS/filesystem would allow an application to use. The offset
maximum is determined through a call to open() by specifying (or not)
whether the application wishes to access large files.
When applications attempt to read parts of a file beyond their
understanding of the offset maximum, the OS would return a new error
code, namely EOVERFLOW.
In order to provide both an explicit means of accessing large files as well as a
hidden and easily upgradable approach, there were two programmatic models.
The first allowed the size of off_t to be determined during the compilation and
linking process. This effectively sets the size of off_t and determines whether
the standard system calls such as read() and write() will be used or whether
the large file specific libraries will be used. Either way, the application continues
to use read(), write(), and related system calls, and the mapping is done
during the link time.
The second approach provided an explicit model whereby the size of off_t
was chosen explicitly within the program. For example, on a 32-bit OS, the size
of off_t would be 32 bits, and large files would need to be accessed through
use of the off64_t data type. In addition, specific calls such as open64(),
read64()would be required in order to access large files.
Today, the issue has largely gone away, with most operating systems
supporting large files by default.
Sparse Files
Due to their somewhat rare usage, sparse files are often not well understood and a
cause of confusion. For example, the VxFS filesystem up to version 3.5 allowed a
maximum filesystem size of 1TB but a maximum file size of 2TB. How can a
single file be larger than the filesystem in which it resides?
A sparse file is simply a file that contains one or more holes. This statement itself
is probably the reason for the confusion. A hole is a gap within the file for which
there are no allocated data blocks. For example, a file could contain a 1KB data
block followed by a 1KB hole followed by another 1KB data block. The size of the
file would be 3KB but there are only two blocks allocated. When reading over a
hole, zeroes will be returned.
The following example shows how this works in practice. First of all, a 20MB
filesystem is created and mounted:
# mkfs -F vxfs /dev/vx/rdsk/rootdg/vol2 20m
version 4 layout
40960 sectors, 20480 blocks of size 1024, log size 1024 blocks
unlimited inodes, largefiles not supported
20480 data blocks, 19384 free data blocks
1 allocation units of 32768 blocks, 32768 data blocks
last allocation unit has 20480 data blocks
# mount -F vxfs /dev/vx/dsk/rootdg/vol2 /mnt2
and the following program, which is used to create a new file, seeks to an offset of
64MB and then writes a single byte:
#include < sys/types.h>
#include < fcntl.h>
#include < unistd.h>
#define IOSZ (1024 * 1024 *64)
main() {
int fd;
fd = open("/mnt2/newfile", O_CREAT | O_WRONLY, 0666)
lseek(fd, IOSZ, SEEK_SET)
write(fd, "a", 1)
}
The following shows the result when the program is run:
# ./lf
# ls -l /mnt2
total
drwxr-xr-x 2 root root 96 Jun 13 08:25 lost+found/
-rw-r--r 1 root other 67108865 Jun 13 08:28 newfile
# df -k | grep mnt2
/dev/vx/dsk/rootdg/vol2 20480 1110 18167 6% /mnt2
And thus, the filesystem which is only 20MB in size contains a file which is 64MB.
Note that, although the file size is 64MB, the actual space consumed is very low.
The 6 percent usage, as displayed by running df, shows that the filesystem is
mostly empty.
To help understand how sparse files can be useful, consider how storage is
allocated to a file in a hypothetical filesystem. For this example, consider a
filesystem that allocates storage to files in 1KB chunks and consider the
interaction between the user and the filesystem as follows:
In this example, following the close()call, the file has a size of 2048 bytes. The
data written to the file is stored in two 1k blocks. Now, consider the example
below:
User Filesystem
create() Create a new file
write(1k of as) Allocate a new 1k block for range 0 to 1023 bytes
write(1k of bs) Allocate a new 1k block for range 1024 to 2047 bytes
close() Close the file
The chain of events here also results in a file of size 2048 bytes. However, by
seeking to a part of the file that doesnt exist and writing, the allocation occurs at
the position in the file as specified by the file pointer. Thus, a single 1KB block is
allocated to the file. The two different allocations are shown in Figure 3.3.
Note that although filesystems will differ in their individual implementations,
each file will contain a block map mapping the blocks that are allocated to the file
and at which offsets. Thus, in Figure 3.3, the hole is explicitly marked.
So what use are sparse files and what happens if the file is read? All UNIX
standards dictate that if a file contains a hole and data is read from a portion of a
file containing a hole, zeroes must be returned. Thus when reading the sparse file
above, we will see the same result as for a file created as follows:
User Filesystem
create() Create a new file
write(1k of 0s) Allocate a new 1k block for range 1023 to 2047 bytes
write(1k of bs) Allocate a new 1k block for range 1024 to 2047 bytes
close() Close the file
Not all filesystems implement sparse files and, as the examples above show, from
a programmatic perspective, the holes in the file are not actually visible. The
main benefit comes from the amount of storage that is saved. Thus, if an
application wishes to create a file for which large parts of the file contain zeroes,
this is a useful way to save on storage and potentially gain on performance by
avoiding unnecessary I/Os.
The following program shows the example described above:
1 #include < sys/types.h>
2 #include < fcntl.h>
3 #include < unistd.h>
5 main()
6 {
7 char buf[1024];
8 int fd;
9
10 memset(buf, a, 1024);
11 fd = open("newfile", O_RDWR|O_CREAT|O_TRUNC, 0777);
12 lseek(fd, 1024, SEEK_SET);
13 write(fd, buf, 1024);
14 }
When the program is run the contents are displayed as shown below. Note the
zeroes for the first 1KB as expected.
$ od -c newfile
0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
0002000 a a a a a a a a a a a a a a a
0004000
If a write were to occur within the first 1KB of the file, the filesystem would have
to allocate a 1KB block even if the size of the write is less than 1KB. For example,
by modifying the program as follows:
memset(buf, 'b', 512)
fd = open("newfile", O_RDWR)
lseek(fd, 256, SEEK_SET)
write(fd, buf, 512)
and then running it on the previously created file, the resulting contents are:
$ od -c newfile
0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
0000400 b b b b b b b b b b b b b b b
0001400 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
0002000 a a a a a a a a a a a a a a a
0004000
Therefore in addition to allocating a new 1KB block, the filesystem must zero fill
those parts of the block outside of the range of the write.
The following example shows how this works on a VxFS filesystem. A new file
is created. The program then seeks to byte offset 8192 and writes 1024 bytes.
#include < sys/types.h>
#include < fcntl.h>
#include < unistd.h>
main()
{
int fd;
char buf[1024]
fd = open("myfile", O_CREAT | O_WRONLY, 0666)
lseek(fd, 8192, SEEK_SET)
write(fd, buf, 1024)
}
In the output shown below, the program is run, the size of the new file is
displayed, and the inode number of the file is obtained:
# ./sparse
# ls -l myfile
-rw-r--r 1 root other 9216 Jun 13 08:37 myfile
# ls -i myfile
6 myfile
The VxFS fsdb command can show which blocks are assigned to the file. The
inode corresponding to the file created is displayed:
# umount /mnt2
# fsdb -F vxfs /dev/vx/rdsk/rootdg/vol2
# > 6i
inode structure at 0x00000431.0200
type IFREG mode 100644 nlink 1 uid 0 gid 1 size 9216
atime 992447379 122128 (Wed Jun 13 08:49:39 2001)
mtime 992447379 132127 (Wed Jun 13 08:49:39 2001)
ctime 992447379 132127 (Wed Jun 13 08:49:39 2001)
aflags 0 orgtype 1 eopflags 0 eopdata
fixextsize/fsindex 0 rdev/reserve/dotdot/matchino
blocks 1 gen 844791719 version 0 13 iattrino
de: 0 1096 0 0 0 0 0 0 0
des: 8 1 0 0 0 0 0 0 0
ie: 0
ies:
The de field refers to a direct extent (filesystem block) and the des field is the
extent size. For this file the first extent starts at block 0 and is 8 blocks (8KB) in
size. VxFS uses block 0 to represent a hole (note that block 0 is never actually
used). The next extent starts at block 1096 and is 1KB in length. Thus, although the
file is 9KB in size, it has only one 1KB block allocated to it.
Summary
This chapter provided an introduction to file I/O based system calls. It is
important to grasp these concepts before trying to understand how filesystems
are implemented. By understanding what the user expects, it is easier to see how
certain features are implemented and what the kernel and individual filesystems
are trying to achieve.
Whenever programming on UNIX, it is always a good idea to follow
appropriate standards to allow programs to be portable across multiple versions
of UNIX. The commercial versions of UNIX typically support the Single UNIX
Specification standard although this is not fully adopted in Linux and BSD. At the
very least, all versions of UNIX will support the POSIX.1 standard.
CHAPTER4
The Standard I/O Library
Many users require functionality above and beyond what is provided by the basic
file access system calls. The standard I/O library, which is part of the ANSI C
standard, provides this extra level of functionality, avoiding the need for
duplication in many applications.
There are many books that describe the calls provided by the standard I/O
library (stdio). This chapter offers a different approach by describing the
implementation of the Linux standard I/O library showing the main structures,
how they support the functions available, and how the library calls map onto the
system call layer of UNIX.
The needs of the application will dictate whether the standard I/O library will
be used as opposed to basic file-based system calls. If extra functionality is
required and performance is not paramount, the standard I/O library, with its
rich set of functions, will typically meet the needs of most programmers. If
performance is key and more control is required over the execution of I/O,
understanding how the filesystem performs I/O and bypassing the standard I/O
library is typically a better choice.
Rather than describing the myriad of stdio functions available, which are well
documented elsewhere, this chapter provides an overview of how the standard
I/O library is implemented. For further details on the interfaces available, see
Richard Stevens book Advanced Programming in the UNIX Programming
Environment [STEV92] or consult the Single UNIX Specification.
The FILE Structure
Where system calls such as open() and dup() return a file descriptor through
which the file can be accessed, the stdio library operates on a FILE structure, or
file stream as it is often called. This is basically a character buffer that holds
enough information to record the current read and write file pointers and some
other ancillary information. On Linux, the IO_FILE structure from which the
FILE structure is defined is shown below. Note that not all of the structure is
shown here.
struct _IO_FILE
{
char *_IO_read_ptr; /* Current read pointer */
char *_IO_read_end; /* End of get area. */
char *_IO_read_base; /* Start of putback and get area. */
char *_IO_write_base; /* Start of put area. */
char *_IO_write_ptr; /* Current put pointer. */
char *_IO_write_end; /* End of put area. */
char *_IO_buf_base; /* Start of reserve area. */
char *_IO_buf_end; /* End of reserve area. */
int _fileno;
int _blksize;
};
typedef struct _IO_FILE FILE;
Each of the structure fields will be analyzed in more detail throughout the
chapter. However, first consider a call to the open() and read()system calls:
fd = open("/etc/passwd", O_RDONLY)
read(fd, buf, 1024)
When accessing a file through the stdio library routines, a FILEstructure will be
allocated and associated with the file descriptor fd, and all I/O will operate
through a single buffer. For the _IO_FILE structure shown above, _fileno is
used to store the file descriptor that is used on subsequent calls to read() or
write(), and _IO_buf_base represents the buffer through which the data will
pass.
Standard Input, Output, and Error
The standard input, output, and error for a process can be referenced by the file
descriptors STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO. To use the
stdio library routines on either of these files, their corresponding file streams
stdin, stdout, and stderr can also be used. Here are the definitions of all
three:
extern FILE *stdin;
extern FILE *stdout;
extern FILE *stderr;
All three file streams can be accessed without opening them in the same way that
the corresponding file descriptor values can be accessed without an explicit call to
open().
There are some standard I/O library routines that operate on the standard
input and output streams explicitly. For example, a call to printf() uses stdin
by default whereas a call to fprintf() requires the caller to specify a file stream.
Similarly, a call to getchar() operates on stdin while a call to getc() requires
the file stream to be passed. The declaration of getchar()could simply be:
#define getchar() getc(stdin)
Opening and Closing a Stream
The fopen() and fclose() library routines can be called to open and close a
file stream:
#include < stdio.h>
FILE *fopen(const char *filename, const char *mode)
int fclose(FILE *stream)
The mode argument points to a string that starts with one of the following
sequences. Note that these sequences are part of the ANSI C standard.
r, rb.Open the file for reading.
w, wb.Truncate the file to zero length or, if the file does not exist, create a new
file and open it for writing.
a, ab. Append to the file. If the file does not exist, it is first created.
r+, rb+, r+b. Open the file for update (reading and writing).
w+, wb+, w+b. Truncate the file to zero length or, if the file does not exist,
create a new file and open it for update (reading and writing).
a+, ab+, a+b. Append to the file. If the file does not exist it is created and
opened for update (reading and writing). Writing will start at the end of file.
Internally, the standard I/O library will map these flags onto the corresponding
flags to be passed to the open() system call. For example, r will map to
O_RDONLY, r+ will map to O_RDWR and so on. The process followed when
opening a stream is shown in Figure 4.1.
The following example shows the effects of some of the library routines on the
FILE structure:
1 #include < stdio.h>
2
3 main()
4 {
5 FILE *fp1, *fp2;
6 char c;
7
8 fp1 = fopen("/etc/passwd", "r")
9 fp2 = fopen("/etc/mtab", "r")
10 printf("address of fp1 = 0x%x\n", fp1)
11 printf(" fp1->_fileno = 0x%x\n", fp1->_fileno)
12 printf("address of fp2 = 0x%x\n", fp2)
13 printf(" fp2->_fileno = 0x%x\n\n", fp2->_fileno)
14
15 c = getc(fp1)
16 c = getc(fp2)
17 printf(" fp1->_IO_buf_base = 0x%x\n"
18 fp1->_IO_buf_base)
19 printf(" fp1->_IO_buf_end = 0x%x\n"
20 fp1->_IO_buf_end)
21 printf(" fp2->_IO_buf_base = 0x%x\n"
22 fp2->_IO_buf_base)
23 printf(" fp2->_IO_buf_end = 0x%x\n"
24 fp2->_IO_buf_end)
25 }
Note that, even following a call to fopen(), the library will not allocate space to
the I/O buffer unless the user actually requests data to be read or written. Thus,
the value of _IO_buf_base will initially be NULL. In order for a buffer to be
allocated in the program here, a call is made to getc() in the above example,
which will allocate the buffer and read data from the file into the newly allocated
buffer.
$ fpopen
Address of fp1 = 0x8049860
fp1->_fileno = 0x3
Address of fp2 = 0x80499d0
fp2->_fileno = 0x4
fp1->_IO_buf_base = 0x40019000
fp1->_IO_buf_end = 0x4001a000
fp2->_IO_buf_base = 0x4001a000
fp2->_IO_buf_end = 0x4001b000
Note that one can see the corresponding system calls that the library will make by
running strace, trussetc.
$ strace fpopen 2>&1 | grep open
open("/etc/passwd", O_RDONLY) =
open("/etc/mtab", O_RDONLY) =
$ strace fpopen 2>&1 | grep read
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 827
read(4, "/dev/hda6 / ext2 rw 0 0 none /pr"..., 4096) = 157
Note that despite the programs request to read only a single character from each
file stream, the stdio library attempted to read 4KB from each file. Any
subsequent calls to getc() do not require another call to read() until all
characters in the buffer have been read.
There are two additional calls that can be invoked to open a file stream, namely
fdopen()and freopen():
#include < stdio.h>
FILE *fdopen (int fildes, const char *mode)
FILE *freopen (const char *filename,
const char *mode, FILE *stream)
The fdopen() function can be used to associate an already existing file stream
with a file descriptor. This function is typically used in conjunction with functions
that only return a file descriptor such as dup(), pipe(), and fcntl().
The freopen() function opens the file whose name is pointed to by
filename and associates the stream pointed to by stream with it. The original
stream (if it exists) is first closed. This is typically used to associate a file with one
of the predefined streams, standard input, output, or error. For example, if the
caller wishes to use functions such as printf() that operate on standard output
by default, but also wants to use a different file stream for standard output, this
function achieves the desired effect.
Standard I/O Library Buffering
The stdio library buffers data with the goal of minimizing the number of calls to
the read() and write() system calls. There are three different types of
buffering used:
Fully (block) buffered. As characters are written to the stream, they are
buffered up to the point where the buffer is full. At this stage, the data is
written to the file referenced by the stream. Similarly, reads will result in a
whole buffer of data being read if possible.
Line buffered. As characters are written to a stream, they are buffered up until
the point where a newline character is written. At this point the line of data
including the newline character is written to the file referenced by the
stream. Similarly for reading, characters are read up to the point where a
newline character is found.
Unbuffered. When an output stream is unbuffered, any data that is written to
the stream is immediately written to the file to which the stream is
associated.
The ANSI C standard dictates that standard input and output should be fully
buffered while standard error should be unbuffered. Typically, standard input
and output are set so that they are line buffered for terminal devices and fully
buffered otherwise.
The setbuf()and setvbuf()functions can be used to change the buffering
characteristics of a stream as shown:
#include < stdio.h>
void setbuf(FILE *stream, char *buf)
int setvbuf(FILE *stream, char *buf, int type, size_t size)
The setbuf() function must be called after the stream is opened but before any
I/O to the stream is initiated. The buffer specified by the buf argument is used in
place of the buffer that the stdio library would use. This allows the caller to
optimize the number of calls to read() and write() based on the needs of the
application.
The setvbuf() function can be called at any stage to alter the buffering
characteristics of the stream. The type argument can be one of _IONBF
(unbuffered), _IOLBF (line buffered), or _IOFBF (fully buffered). The buffer
specified by the bufargument must be at least sizebytes. Prior to the next I/O,
this buffer will replace the buffer currently in use for the stream if one has
already been allocated. If bufis NULL, only the buffering mode will be changed.
Whether full or line buffering is used, the fflush() function can be used to
force all of the buffered data to the file referenced by the stream as shown:
#include < stdio.h>
int fflush(FILE *stream)
Note that all output streams can be flushed by setting stream to NULL. One
further point worthy of mention concerns termination of a process. Any streams
that are currently open are flushed and closed before the process exits.
Reading and Writing to/from a Stream
There are numerous stdio functions for reading and writing. This section
describes some of the functions available and shows a different implementation of
the cp program using various buffering options. The program shown below
demonstrates the effects on the FILEstructure by reading a single character using
the getc()function:
1 #include < stdio.h>
2
3 main()
4 {
5 FILE *fp;
6 char c;
7
8 fp = fopen("/etc/passwd", "r")
9 printf("address of fp = 0x%x\n", fp)
10 printf(" fp->_fileno = 0x%x\n", fp->_fileno)
11 printf(" fp->_IO_buf_base = 0x%x\n", fp->_IO_buf_base)
12 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
13
14 c = getc(fp)
15 printf(" fp->_IO_buf_base = 0x%x (size = %d)\n"
16 fp->_IO_buf_base,
17 fp->_IO_buf_end fp->_IO_buf_base)
18 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
19 c = getc(fp)
20 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
21 }
Note as shown in the output below, the buffer is not allocated until the first I/O is
initiated. The default size of the buffer allocated is 4KB. With successive calls to
getc(), the read pointer is incremented to reference the next byte to read within
the buffer. Figure 4.2 shows the steps that the stdio library goes through to read
the data.
$ fpinfo
Address of fp = 0x8049818
fp->_fileno = 0x3
fp->_IO_buf_base = 0x0
fp->_IO_read_ptr = 0x0
fp->_IO_buf_base = 0x40019000 (size = 4096)
fp->_IO_read_ptr = 0x40019001
fp->_IO_read_ptr = 0x40019002
By running strace on Linux, it is possible to see how the library reads the data
following the first call to getc(). Note that only those lines that reference the
/etc/passwd file are displayed here:
$ strace fpinfo
..
open("/etc/passwd", O_RDONLY) =
..
fstat(3, st_mode=S_IFREG_0644, st_size=788, ...) =
..
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 788
The call to fopen() results in a call to open()and the file descriptor returned is
stored in fp->_fileno as shown above. Note that although the program only
asked for a single character (line 14), the standard I/O library issued a 4KB read
to fill up the buffer. The next call to getc() did not require any further data to be
read from the file. Note that when the end of the file is reached, a subsequent call
to getc() will return EOL.
The following example provides a simple cp program showing the effects of
using fully buffered, line buffered, and unbuffered I/O. The buffering option is
passed as an argument. The file to copy from and the file to copy to are hard
coded into the program for this example.
1 #include < time.h>
2 #include < stdio.h>
4 main(int argc, char **argv)
5 {
time_t time1, time2;
7 FILE *ifp, *ofp;
8 int mode;
9 char c, ibuf[16384], obuf[16384];
10
11 if (strcmp(argv[1], "_IONBF") == 0) {
12 mode = _IONBF;
13 } else if (strcmp(argv[1], "_IOLBF") == 0) {
14 mode = _IOLBF;
15 } else
16 mode = _IOFBF;
17
18
19 ifp = fopen("infile", "r")
20 ofp = fopen("outfile", "w")
21
22 setvbuf(ifp, ibuf, mode, 16384)
23 setvbuf(ofp, obuf, mode, 16384)
24
25 time(&time1)
26 while ((c = fgetc(ifp)) != EOF) {
27 fputc(c, ofp)
28 }
29 time(&time2)
30 fprintf(stderr, "Time for %s was %d seconds\n", argv[1]
31 time2 - time1)
32 }
The input file has 68,000 lines of 80 characters each. When the program is run with
the different buffering options, the following results are observed:
$ ls -l infile
-rw-r--r-1 spate fcf 5508000 Jun 29 15:38 infile
$ wc -l infile
68000 infile
$ ./fpcp _IONBF
Time for _IONBF was 35 seconds
$ ./fpcp _IOLBF
Time for _IOLBF was 3 seconds
$ ./fpcp _IOFBF
Time for _IOFBF was 2 seconds
The reason for such a huge difference in performance can be seen by the number
of system calls that each option results in. For unbuffered I/O, each call to
getc() or putc() produces a system call to read() or write(). All together,
there are 68,000 reads and 68,000 writes! The system call pattern seen for
unbuffered is as follows:
..
open("infile", O_RDONLY) =
open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) =
time([994093607]) = 994093607
read(3, "0", 1) =
write(4, "0", 1) =
read(3, "1", 1) =
write(4, "1", 1) =
..
For line buffered, the number of system calls is reduced dramatically as the
system call pattern below shows. Note that data is still read in buffer-sized
chunks.
..
open("infile", O_RDONLY) =
open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) =
time([994093688]) = 994093688
read(3, "01234567890123456789012345678901"..., 16384) = 16384
write(4, "01234567890123456789012345678901"..., 81) = 81
write(4, "01234567890123456789012345678901"..., 81) = 81
write(4, "01234567890123456789012345678901"..., 81) = 81
..
For the fully buffered case, all data is read and written in buffer size (16384 bytes)
chunks, reducing the number of system calls further as the following output
shows:
open("infile", O_RDONLY) =
open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) =
read(3, "67890123456789012345678901234567"..., 4096) = 4096
write(4, "01234567890123456789012345678901"..., 4096) = 4096
read(3, "12345678901234567890123456789012"..., 4096) = 4096
write(4, "67890123456789012345678901234567"..., 4096) = 4096
Seeking through the Stream
Just as the lseek() system call can be used to set the file pointer in preparation
for a subsequent read or write, the fseek() library function can be called to set
the file pointer for the stream such that the next read or write will start from that
offset.
#include < stdio.h>
int fseek(FILE *stream, long int offset, int whence)
The offset and whence arguments are identical to those supported by the
lseek() system call. The following example shows the effect of calling
fseek()on the file stream:
1 #include < stdio.h>
3 main()
4 {
5 FILE *fp;
6 char c;
7
8 fp = fopen("infile", "r")
9 printf("address of fp = 0x%x\n", fp)
10 printf(" fp->_IO_buf_base = 0x%x\n", fp->_IO_buf_base)
11 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
12
13 c = getc(fp)
14 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
15 fseek(fp, 8192, SEEK_SET)
16 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
17 c = getc(fp)
18 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
19 }
By calling getc(), a 4KB read is used to fill up the buffer pointed to by
_IO_buf_base. Because only a single character is returned by getc(), the read
pointer is only advanced by one. The call to fseek() modifies the read pointer as
shown below:
$ fpseek
Address of fp = 0x80497e0
fp->_IO_buf_base = 0x0
fp->_IO_read_ptr = 0x0
fp->_IO_read_ptr = 0x40019001
fp->_IO_read_ptr = 0x40019000
fp->_IO_read_ptr = 0x40019001
Note that no data needs to be read for the second call to getc(). Here are the
relevant system calls:
open("infile", O_RDONLY) =
fstat64(1, st_mode=S_IFCHR_0620, st_rdev=makedev(136, 0), ...) =
read(3, "01234567890123456789012345678901"..., 4096) = 4096
write(1, ...) # display _IO_read_ptr
_llseek(3, 8192, [8192], SEEK_SET) =
write(1, ...) # display _IO_read_ptr
read(3, "12345678901234567890123456789012"..., 4096) = 4096
write(1, ...) # display _IO_read_ptr
The first call to getc() results in the call to read(). Seeking through the stream
results in a call to lseek(), which also resets the read pointer. The second call to
getc()then involves another call to read data from the file.
There are four other functions available that relate to the file position within the
stream, namely:
#include < stdio.h>
long ftell( FILE *stream)
void rewind( FILE *stream)
int fgetpos( FILE *stream, fpos_t *pos)
int fsetpos( FILE *stream, fpos_t *pos)
The ftell()function returns the current file position. In the preceding example
following the call to fseek(), a call to ftell() would return 8192. The
rewind()function is simply the equivalent of calling:
fseek(stream, 0, SEEK_SET)
The fgetpos() and fsetpos() functions are equivalent to ftell() and
fseek() (with SEEK_SET passed), but store the current file pointer in the
argument referenced by pos.
Summary
There are numerous functions provided by the standard I/O library that often
reduce the work of an application writer. By aiming to minimize the number of
system calls, performance of some applications may be considerably improved.
Buffering offers a great deal of flexibility to the application programmer by
allowing finer control over how I/O is actually performed.
This chapter highlighted how the standard I/O library is implemented but
stops short of describing all of the functions that are available. Richard Stevens
book Advanced Programming in the UNIX Environment [STEV92] provides more
details from a programming perspective. Herbert Schildts book The Annotated
ANSI C Standard [SCHI93] provides detailed information on the stdio library as
supported by the ANSI C standard.
|