Linux Kernel Notes
Pramode C.E
1.2. ¬от одна проблема и ее решение
≈сть такой файл - называетс¤ /usr/share/dict/words
јнаграмма - комбинаци¤ из слов :
top opt pot
¬ словаре имеютс¤ подобные анаграммы .
ѕопробуем написать программу , котора¤ покажет все анаграммы ,
состо¤щие из 5 слов , 4 слов и т.д.
ƒавайте дл¤ начала напишем программу , котора¤ читает аргумент
из командной строки - например такой :
hello
и распечатает что-то типа
ehllo hello
main()
{
char s[100], t[100];
while(scanf("%s", s) != EOF) {
strcpy(t, s);
sort(s);
printf("%s %s\n", s, t);
}
}
ќткомпилируем эту программу , назовем ее sign , и запустим команду :
cat /usr/share/dict/words | ./sign
¬ариант этой же команды , когда все символы перевод¤тс¤ в нижний регистр :
cat /usr/share/dict/words | tr ТA-ZТ Тa-zТ | ./sign
—ледующий вариант с сортировкой :
cat /usr/share/dict/words | tr ТA-ZТ Тa-zТ | ./sign | sort
—ледующа¤ доработка нашей программы :
2{
3 char prev_sign[100]="";
4 char curr_sign[100], word[100];
5 while(scanf("%s%s", curr_sign, word)!=EOF) {
6 if(strcmp(prev_sign, curr_sign) == 0) {
7 printf("%s ", word);
8 } else { /* Signatures differ */
9 printf("\n");
10 printf("%s ", word);
11 strcpy(prev_sign, curr_sign);
12 }
13 }
14 }
15
«апускаем :
cat /usr/share/dict/words | tr ТA-ZТ Тa-zТ | ./sign | sort | ./sameline
ƒл¤ выделени¤ анаграмм , состо¤щих из 3 или 4 слов , можно использовать
стандартную программу awk :
cat /usr/share/dict/words | tr ТA-ZТ Тa-zТ | ./sign | sort | ./sameline | awk Т.if(NF==3)print.Т
2.1. The Unix Shell
Ўелл - один из базовых инструментов в юниксе .
ќн ассоциируетс¤ с командой СbashТ , но это может быть и СcshТ
ќдна из книг по шеллу , которую мы рекомендуем -
СThe Unix Programming EnvironmentТ, by СKernighan&PikeТ.
Ќужно прочитать минимум 3-4 главы из нее , прежде чем начинать
делать что-то серьезное .
ƒавайте дл¤ начала напишем скрипт , который положит в каталог 'img'
только картинки , размер которых превышает 15 кб :
for i in '/mydir -nameТ*.jpgТ -size +15kС
do
cp $i img
done
2.2. The C Compiler
ƒл¤ глубокого изучени¤ можно порекомендовать книгу
СDeep C SecretsТ (Peter van der Linden).
“акже рекомендуем :
СThe C Programming LanguageТ by Kernighan and Ritchie.
2.2.2. Options
The СccТ command is merely a compiler СdriverТ or Сfront endТ.
Its job is to collect com-mand line arguments and pass them on to the four programs
which do the actual compilation process.
The -E option makes СccТ call only СcppТ.
The output of the preprocessing phase is displayed on the screen.
The -S option makes СccТ invoke both СcppТ and Сcc1Т.
What you get would be a file with extension С.sТ, an assembly language program.
The -c option makes СccТ invoke the first three phases -output would be an object file
with extension С.oТ. Typing
cc hello.c -o hello
Will result in output getting stored in a file called СhelloТ instead of Сa.outТ.
The -Wall option enables all warnings.
It is essential that you always compile your code with -Wall -you should let the compiler
check your code as thoroughly as possible.
The -pedantic-errors options checks your code for strict ISO compatibility.
You must be aware that GCC implements certain extensions to the C language,
if you wish your code to be strict ISO C, you must eliminate the possibility
of such extensions creeping into it.
Here is a small program which demonstrates the idea -we are using the named structure
field initalization extension here, which gcc allows, unless -pedantic-errors is provided.
1 main()
2{
3 struct complex {int re, im;}
4 struct complex c = {im:4, re:5};
5}
6
Here is what gcc says when we use the -pedantic-errors option:
a.c: In function СmainТ:
a.c:4: ISO C89 forbids specifying structure member to initialize
a.c:4: ISO C89 forbids specifying structure member to initialize
As GCC is the dominant compiler in the free software world, using GCC extensions
is not really a bad idea.
The compiler performs several levels of optimizations -which are enabled by the options
-O, -O2 and -O3.
Read the gcc man page and find out what all optimizations are enabled by each option.
The -I option is for the preprocessor -if you do
cc a.c -I/usr/proj/include
you are adding the directory /usr/proj/include to the standard preprocessor search path.
The -D option is useful for de.ning symbols on the command line.
1 main()
2{
3 #ifdef DEBUG
4 printf("hello");
5 #endif
6}
7
Try compiling the above program with the option -DDEBUG and without the option.
It is also instructive to do:
cc -E -DDEBUG a.c
cc -E a.c
to see what the preprocessor really does. Note that the Linux kernel code makes heavy use
of preprocessor tricks -so donТt skip the part on the preprocessor in K&R.
The -L and -l options are for the linker. If you do
cc a.c -L/usr/X11R6/lib -lX11
the linker tries to combine the object code of your program with the object code contained
in a file call СlibX11.soТ; this file will be searched for in the directory /usr/X11R6/lib too,
besides the standard directories like /lib and /usr/lib.
Chapter 3. The System Call Interface
The СkernelТ is the heart of the Operating System.
Your Linux system will most probably have a directory called /boot under which
you will find a file whose name might look somewhat like СvmlinuzТ.
This file contains machine code (which is compiled from source files under /usr/src/linux)
which gets loaded into memory when you boot your machine.
Once the kernel is loaded into memory, it stays there until you reboot the machine,
overseeing each and every activity going on in the system.
The kernel is responsible for managing hardware resources, scheduling processes,
controlling network communication etc.
If a user program wants to, say, send data over the network,
it has to interact with the TCP/IP code present within the kernel.
This interaction takes place through special C functions which are called СSystem CallsТ.
Understanding a few elementary system calls is the first step towards understanding Linux.
3.1. Files and Processes
3.1.1. File I/O
The Linux operating system, just like all Unices, takes the concept of a file to dizzying heights.
A file is not merely a few bytes of data residing on disk -it is an abstraction for anything
that can be read from or written to.
Files are manipulated using three fundamental system calls -open, read and write.
A system call is a C function which transfers control to a point within the operating system kernel.
This needs to be elaborated a little bit.
The Linux source tree is rooted at /usr/src/linux (кстати,не факт).
If you examine the file fs/open.c, you will see a function whose prototype looks like this:
1 asmlinkage long sys_open(const char* filename,
2 int flags, int mode);
3
Now, this function is compiled into the kernel and is as such resident in memory.
When the C program which you write calls СopenТ, control is getting transferred
to this function within the operating system kernel.
It is possible to make alterations to this function(or any other),
recompile and install a new kernel -you just have to look through the СREADMEТ file
under /usr/src/linux.
The availability of kernel source provides a multitude of opportunities to the student
and researcher -students can СseeТ how abstract operating system principles are implemented
in practice and researchers can make their own enhancements.
Here is a small program which behaves like the copy command.
1#include.sys/types.h.
2#include.sys/stat.h.
3#include.fcntl.h.
4#include.unistd.h.
5#include.assert.h.
6#include.stdio.h.
7
8#define BUFLEN 1024
9
10 int main(int argc, char *argv[])
11 {
12 int fdr, fdw, n;
13 char buf[BUFLEN];
14
15 assert(argc == 3);
16
17 fdr = open(argv[1], O_RDON LY);
18 assert(fdr .= 0);
19 fdw = open(argv[2], O_WR ONLY|O_ CREAT|O_TRUNC, 0644);
20 assert(fdw .= 0);
21 while((n = read(fdr, buf, sizeof( buf))).0)
22 if (write(fdw, buf, n) != n) {
23 fprintf(stderr, "write error\n");
24 exit(1);
25 }
26
27 if(n.0){
28 fprintf(stderr, "read error\n ");
29 exit(1);
30 }
31
32 return 0;
33 }
34
35
Let us look at the important points. We see that СopenТ returns an integer Сfile descriptorТ
which is to be passed as argument to all other file manipulation functions.
The first file is opened as read only.The second one is opened for writing -we are also specifying
that we wish to truncate the file (to zero length) if it exists.
We are going to create the file if it does not exist -and hence we pass a creation mode
(octal 644 -user read/write, group and others read) as the last argument.
The read system call returns the actual number of bytes read,
the return value is 0 if EOF is reached, it is -1 in case of errors.
The write system call returns the number of bytes written, which should be equal to the number
of bytes which we have asked to write.
Note that there are subtleties with write.
The write system call simply СschedulesТ data to be written -it returns without verifying
that the data has been actually transferred to the disk.
3.1.2. Process creation with СforkТ
The fork system call creates an exact replica(in memory) of the process which executes the call.
1 main()
2{
3 fork();
4 printf("hello\n");
5}
6
You will see that the program prints hello twice. Why?
After the call to СforkТ, we will have two processes in memory -the original process
which called the СforkТ (the parent process) and the clone which fork has created (the child process).
Lines after the fork will be executed by both the parent and the child.
Fork is a peculiar function, it seems to return twice.
1 main()
2{
3 int pid;
4 pid = fork();
5 assert(pid >= 0);
6 if (pid == 0) printf("I am child");
7 else printf("I am parent");
8}
9
This is quite an amazing program to anybody who is not familiar with the working of fork.
Both the СifТ part as well as the СelseТ part seems to be getting executed.
The idea is that both parts are being executed by two different processes.
Fork returns 0 in the child process and process id of the child in the parent process.
It is important to note that the parent and the child are replicas -both the code and the data
in the parent gets duplicated in the child -only thing is that parent takes the else branch
and child takes the if branch.
3.1.3. Sharing files
It is important to understand how a fork affects open files.
Let us play with some simple programs.
1 int main()
2{
3 char buf1[] = "hello", buf2[] = "world";
4 int fd1, fd2;
5 fd1 = open("dat", O_WRONLY|O_CREAT, 0644);
6 assert(fd1 >= 0);
7 fd2 = open("dat", O_WRONLY|O_CREAT, 0644);
8 assert(fd2 >= 0);
9
10 write(fd1, buf1, strlen(buf1));
11 write(fd2, buf2, strlen(buf2));
12 }
After running the program, we note that the file СdatТ contains the string СworldТ.
This demon-strates that calling open twice lets us manipulate the file independently
through two descrip-tors. The behaviour is similar when we open and write to the file
from two independent programs.
Every running process will have a per process file descriptor table associated with it -
the value returned by open is simply an index to this table.
Each per process file descriptor table slot will contain a pointer to a kernel file table entry
which will contain:
1. the file status .ags (read, write, append etc)
2. a pointer to the v-node table entry for the file
3. the current file offset
What does the v-node contain? It is a datastructure which contains, amongst other things,
information using which it would be possible to locate the data blocks of the file on the disk.
Figure3-1.Openinga.letwice
Note that the two descriptors point to two different kernel file table entries -
but both the file table entries point to the same v-node structure.
The consequence is that writes to both descriptors results in data getting written to the same file.
Because the offset is maintained in the kernel file table entry, they are completely independent -
the first write results in the offset field of the kernel file table entry pointed to by slot 3
of the file descriptor table getting changed to five (length of the string СhelloТ).
The second write again starts at offset 0, because slot 4 of the file descriptor table
is pointing to a different kernel file table entry.
What happens to open file descriptors after a fork? Let us look at another program.
1 #include "myhdr.h"
2 main()
3{
4 char buf1[] = "hello";
5 char buf2[] = "world";
6 int fd;
7 fd = open("dat", O_WRONLY|O_CREAT|O_TRUNC, 0644);
8 assert(fd >= 0);
9
10 write(fd, buf1, strlen(buf1));
11 if(fork() == 0) write(fd, buf2, strlen(buf2));
12 }
13
14
We note that СopenТ is being called only once.
The parent process writes СhelloТ to the file. The child process uses the same descriptor
to write СworldТ. We examine the contents of the file after the program exits.
We find that the file contains СhelloworldТ.
The СopenТ system call creates an entry in the kernel file table, stores the address of that entry
in the process file descriptor table and returns the index.
The СforkТ results in the child process inheriting the parentТs file descriptor table.
The slot indexed by СfdТ in both the parentТs and childТs file descriptor table contains pointers
to the same file table entry -which means the offsets are shared by both the process.
This explains the behaviour of the program.
3.1.4. The СexecТ system call
LetТs look at a small program:
1 int main()
2{
3 execlp("ls", "ls", 0);
4 printf("Hello\n");
5 return 0;
6}
7
The program executes the СlsТ command -but we see no trace of a СHelloТ anywhere on the screen.
WhatТs up?
The СexecТ family of functions perform Сprogram loadingТ.
If exec succeeds, it replaces the memory image of the currently executing process
with the memory image of СlsТ -ie, exec has no place to return to if it succeeds!
The first argument to execlp is the name of the command to execute.
The subsequent arguments form the command line arguments of the execed program
(ie, they will be available as argv[0], argv[1] etc in the execed program).
The list should be terminated by a null pointer.
What happens to an open file descriptor after an exec?
That is what the following program tries to find out.
We first create a program called Сt.cТ and compile it into a .le called СtТ.
1
2 main(int argc, char *argv[])
3 {
4 char buf[] = "world";
5 int fd;
7 assert(argc == 2);
8 fd = atoi(argv[1]);
9 printf("got descriptor %d\n", fd);
10 write(fd, buf, strlen(buf));
11 }
The program receives a file descriptor as a command line argument -it then executes
a write on that descriptor. We will now write another program Сforkexec.cТ,
which will fork and exec this program.
1 int main()
2 {
3 int fd;
4 char buf[] = "hello";
5 char s[10];
7 fd = open("dat", O_WRONLY |O_CREAT|O_TRUNC, 0644);
8 assert(fd >= 0);
9 sprintf(s, "%d", fd);
10 write(fd, buf, strlen( buf));
11 if(fork() == 0) {
12 execl("./t", "t", s, 0);
13 fprintf(stderr, "exec failed\n");
14 }
15 }
What would be the contents of file СdatТ after this program is executed?
We note that it is СhelloworldТ. This demonstrates the fact that the file descriptor
is not closed during the exec.
3.1.5. The СdupТ system call
You might have observed that the value of the file descriptor returned by СopenТ is minimum
Why?
The Unix shell, before forking and execТing your program, had opened the console thrice -
on descriptors 0, 1 and 2.
Standard library functions which write to СstdoutТ are guaranteed to invoke the СwriteТ system call
with a descriptor value of 1 while those functions which write to СstderrТ and read from СstdinТ
invoke СwriteТ and СreadТ with descriptor values 2 and 0.
This behaviour is vital for the proper working of standard I/O redirection.
The СdupТ system call СduplicatesТ the descriptor which it gets as the argument on the lowest
unused descriptor in the per process file descriptor table.
1 #include "myhdr.h"
3 main()
4{
5 int fd;
7 fd = open("dat", O_WRONLY|O_CREAT|O_TRUNC, 0644);
8 close(1);
9 dup(fd);
10 printf("hello\n");
11 }
12
Note that after the dup, file descriptor 1 refers to whatever СfdТ is referring to.
The СprintfТ function invokes the write system call with descriptor value equal to 1,
with the result that the message gets СredirectedТ to the file СdatТ and does not appear on the screen.
3.2. The СprocessТ file system
The /proc directory of your Linux system is very interesting.
The files (and directories) present under /proc are not really disk files.
Here is what you will see if you do a Сcat /proc/interruptsТ:
1 CPU0
2 0: 296077 XT-PIC timer
3 1: 3514 XT-PIC keyboard
4 2: 0 XT-PIC cascade
5 4: 6385 XT-PIC serial
6 5: 15 XT-PIC usb-ohci, usb-ohci
7 8: 1 XT-PIC rtc
8 11: 337670 XT-PIC nvidia, NVIDIA nForce Audio
9 14: 11765 XT-PIC ide0
10 15: 272508 XT-PIC ide1
11 NMI: 0
12 LOC: 0
13 ERR: 0
14 MIS: 0
By reading from (or writing to) files under /proc, you are in fact accessing data structures
present within the Linux kernel -/proc exposes a part of the kernel to manipulation using standard
text processing tools.
You can try Сman procТ and learn more about the process information pseudo file system.
Chapter 4. Defining New System Calls
This will be our first kernel hack -mostly because it is extremely simple to implement.
We shall examine the processing of adding new system calls to the Linux kernel -
in the process, we will learn something about building new kernels -
and one or two things about the very nature of the Linux kernel itself.
Note that we are dealing with Linux kernel version 2.4.
Please note that making modifications to the kernel and installing modified kernels
can lead to system hangs and data corruption and should not be attempted on production systems.
4.1. What happens during a system call?
In one word -Magic. It is difficult to understand the actual sequence of events
which take place during a system call without having an intimate understanding
of the processor on which the kernel is running -
say the Intel 386+ family of CPUТs. CPUТs with built in memory management units (MMUТs)
implement various levels of СprotectionТ in hardware.
The body of code which interacts intimately with the machine hardware forms the OS kernel -
it runs at a very high privilege level.
The code which runs as part of the kernel has permissions to do anything -
read from and write to I/O ports, manage interrupts, control Direct Memory Access (DMA) transfers,
execute СprivilegedТ CPU instructions etc.
User programs run at a very low privilege level -
and are not really capable of doing any Сlow-levelТ stuff other than reading and writing I/O ports.
User programs have to СenterТ into the kernel whenever they want service from hardware devices
(say read from disk, keyboard etc).
System calls form well defined Сentry pointsТ through which user programs can get into the kernel.
Whenever a user program invokes a system call, a few lines of assembly code executes -
which takes care of switching from low privileged user mode to high privileged kernel mode.
4.2. A simple system call
LetТs go to the /usr/src/linux/fs subdirectory and create a file called Сmycall.cТ.
/*/usr/src/linux/fs/mycall.c*/
#include.linux/linkage.h.
asmlinkagevoidsys_zap(void)
{
printk("This is Zap from kernel...\n");
}
The Linux kernel convention is that system calls be prefixed with a sys_.
The СasmlinkageТ is some kind of preprocessor macro which is present
in /usr/src/linux/include/linux/linkage.h and seems to be essential for defining system calls.
The system call simply prints a message using the kernel function СprintkТ
which is somewhat similar to the C library function СprintfТ
(Note that the kernel canТt make use of the standard C library -
it has its own implementation of most simple C library functions).
It is essential that this file gets compiled into the kernel -
so you have to make some alterations to the СMakefileТ.
obj-y:=open.o read_write.o devices.o file_table.o buffer.o \
super.o block_dev.o char_dev.o stat.o exec.o pipe.o namei.o \
fcntl.o ioctl.o readdir.o select.o fifo.o locks.o \
dcache.o inode.o attr.o bad_inode.o file.o iobuf.o dnotify.o \
filesystems.o namespace.o seq_file.o mycall.o
ifeq ($(CONFIG_QUOTA),y)
obj-y += dquot.o
Note the line containing Сmycall.oТ.
Once this change is made, we have to examine the file /usr/src/linux/arch/i386/kernel/entry.S.
This file defines a table of system calls -
we add our own syscall at the end.
Each system call has a number of its own, which is basically an index into this table -
ours is numbered 239.
1 .long SYMBOL_NAME(sys_ni_syscall)
2 .long SYMBOL_NAME(sys_exit)
3 .long SYMBOL_NAME(sys_fork)
4 .long SYMBOL_NAME(sys_read)
5 .long SYMBOL_NAME(sys_write)
6 .long SYMBOL_NAME(sys_open)
8 /* Lots of lines deleted */
9 .long SYMBOL_NAME(sys_ni_syscall)
10 .long SYMBOL_NAME(sys_tkill)
11 .long SYMBOL_NAME(sys_zap)
13 .rept NR_syscalls-(.-sys_call_table)/4
14 .long SYMBOL_NAME(sys_ni_syscall)
15 .endr
We will also add a line
1 #define __NR_zap 239
to /usr/src/linux/include/asm/unistd.h. We are now ready to go.
We have made all necessary modifications to our kernel.
We now have to rebuild it. This can be done by typing, in sequence:
make menuconfig
make dep
make bzImage
A new kernel called СbzImageТ will be available under /usr/src/linux/arch/i386/boot.
You have to copy this to a directory called, say, /boot -remember not to overwrite the kernel
which you are currently running -
if there is some problem with your modified kernel, you should be able to fall back
to your functional kernel. You will have to add the name of this kernel to a boot loader
configuration file (if you are using lilo, then /etc/lilo.conf) and run some command like СliloТ.
Here is the /etc/lilo.conf which we are using:
1 prompt
2 timeout=50
3 default=linux
4 boot=/dev/hda
5 map=/boot/map
6 install=/boot/boot.b
7 message=/boot/message
8 lba32
9 vga=0xa
10 image=/boot/vmlinuz-2.4.18-3
11 label=linux
12 read-only
13 append="hdd=ide-scsi"
14 root=/dev/hda3
16 image=/boot/nov22-ker
17 label=syscall-hack
18 read-only
19 root=/dev/hda3
22 other=/dev/hda1
23 optional
24 label=DOS
26 other=/dev/hda2
27 optional
28 label=FreeBSD
The default kernel is /boot/vmlinuz-2.4.18-3.
The modified kernel is called /boot/nov22-ker.
Note that you have to type СliloТ after modifying /etc/lilo.conf.
If you are using something like СGrubТ, consult the man pages and make the necessary modi.cations.
You can now reboot the system and load the new Linux kernel. You then write a C program:
1 main()
2{
3 syscall(239);
4}
5
And you will see a message СThis is Zap from kernel...Т on the screen
(Note that if you are running something like an xterm, you may not see the message on the screen -
you can then use the СdmesgТ command. We will explore printk and message logging in detail later).
You should try one experiment if you donТt mind your machine hanging.
Place an infinite loop in the body of sys_zap -a Сwhile(1);Т would do.
What happens when you invoke sys_zap? Is the Linux kernel capable of preempting itself?
Chapter 5. Module Programming Basics
The next few chapters will cover the basics of writing kernel modules.
Our discussion will be centred around the Linux kernel version 2.4.
As this is an СintroductoryТ look at Linux systems programming,
we shall skip those material which might confuse a novice reader -
especially those related to portability between various kernel versions and machine architectures,
SMP issues and error handling.
Please understand that these are very vital issues, and should be dealt with when writing professional
code. The reader who gets motivated to learn more should refer the excellent book СLinux Device DriversТ
by Alessandro Rubini and Jonathan Corbet.
5.1. What is a kernel module?
A kernel module is simply an object file which can be inserted into the running Linux kernel -
perhaps to support a particular piece of hardware or to implement new functionality.
The ability to dynamically add code to the kernel is very important -
it helps the driver writer to skip the install-new-kernel-and-reboot cycle;
it also helps to make the kernel lean and mean.
You can add a module to the kernel whenever you want certain functionality -
once that is over, you can remove the module from kernel space, freeing up memory.
5.2. Our First Module
1 #include .linux/module.h.
2 int init_module(void)
3 {
4 printk("Module Initializing...\n");
5 return 0;
6 }
7 void cleanup_module(void)
8 {
9 printk("Cleaning up...\n");
10 }
11
Compile the program using the commandline:
cc -c -O -DMODULE -D__KERNEL__ module.c -I/usr/src/linux/include
You will get a file called Сmodule.oТ. You can now type:
insmod ./module.o
and your module gets loaded into kernel address space.
You can see that your module has been added, either by typing
lsmod
or by examining /proc/modules. The Сinit_moduleТ function is called after the module has been loaded -
you can use it for performing whatever initializations you want.
The Сcleanup_moduleТ function is called when you type:
rmmod module
That is, when you attempt to remove the module from kernel space.
5.3. Accessing kernel data structures
The code which you write as a module is running as part of the Linux kernel,
and is capa-ble of manipulating data structures defined in the kernel.
Here is a simple program which demonstrates the idea.
#include.linux/module.h.
#include.linux/sched.h.
intinit_module(void)
{
printk("hello\n");
printk("name=%s\n",current->comm);
printk("pid=%d\n",current->pid);
return 0;
}
void cleanup_module(void) { printk("world\n"); }
/* Look at /usr/src/linux/include/asm/current.h,
* especially, the macro implementation of current
*/
The init_module function is called by the СinsmodТ command after the module is loaded into the kernel.
You can think of СcurrentТ as a globally visible pointer to structure -
the СcommТ and СpidТ fields of this structure give you the command name as well
as the process id of the Сcurrently executingТ process (which, in this case, is СinsmodТ itself).
Every now and then, it would be good to browse through the header files
which you are including in your program and look for СcreativeТ uses of preprocessor macros.
Here is /usr/src/linux/include/asm/current.h for your reading pleasure!
#ifndef _I386_CURRENT_H
#define _I386_CURRENT_H
struct task_struct;
static inline struct task_struct * get_current(void)
{
struct task_struct *current;
__asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
return current;
}
#define current get_current()
#endif /* !(_I386_CURRENT_H) */
СcurrentТ is infact a function which, using some inline assembly magic, retrieves the address
of an object of type Сtask structТ and returns it to the caller.
5.4. Symbol Export
The global variables defined in your module are accessible from other parts of the kernel.
Lets compile and load the following module:
#include.linux/module.h.
int foo_baz=101;
int init_module(void){printk("hello\n");return0;}
void cleanup_module(void){printk("world\n");}
Now, either run the СksymsТ command or look into the file /proc/ksysms -
this file will contain all symbols which are СexportedТ in the Linux kernel -
you should find Сfoo_bazТ in the list.
Once we take off the module, recompile and reload it with foo_baz declared as a СstaticТ variable,
we wont be able to see foo_baz in the kernel symbol listing.
Modules may sometimes Сstack overТ each other -ie, one module will make use of the func-tions
and variables defined in another module.
LetТs check whether this works.
We compile and load another module, in which we try to print the value of the variable foo_baz.
#include .linux/module.h.
extern int foo_baz;
int init_module(void)
{
printk("foo_baz=%d\n", foo_baz);
return 0;
}
void cleanup_module(void) { printk("world\n"); }
The module gets loaded and the init_module function prints 101.
It would be interesting to try and delete the module in which foo_baz was defined.
The СmodprobeТ command is used for automatically locating and loading all modules on which
a particular module depends -
it simplifies the job of the system administrator.
You may like to go through the file /lib/modules/2.4.18-3/modules.dep
(note that your kernel version number may be different).
5.5. Usage Count
#include.linux/module.h.
intinit_module(void)
{
MOD_INC_USE_COUNT;
printk("hello\n"); return 0;
}
void cleanup_module(void) { printk("world\n"); }
After loading the program as a module, what if you try to СrmmodТ it?
We get an error mes-sage. The output of СlsmodТ shows the used count to be 1.
A module should not be acciden-tally removed when it is being used by a process.
Modern kernels can automatically track the usage count,
but it will be sometimes necessary to adjust the count manually.
5.6. User defined names to initialization and cleanup functions
The initialization and cleanup functions need not be called init_module() and cleanup_module().
#include.linux/module.h.
#include.linux/init.h.
intfoo_init(void){printk("hello\n");return0;}
voidfoo_exit(void){printk("world\n");}
module_init(foo_init);
module_exit(foo_exit);
Note that the macroТs placed at the end of the source file, module_init() and module_exit(),
perform the СmagicТ required to make foo_init and foo_exit act as the initialization and cleanup functions.
5.7. Reserving I/OPorts
A driver needs some way to tell the kernel that it is manipulating some I/O ports -
and well behaved drivers need to check whether some other driver is using the I/O ports
which it intends to use. Note that what we are looking at is a pure software solution -
there is no way that you can reserve a range of I/O ports for a particular module in hardware.
Here is the content of the file /file/ioports on my machine running Linux kernel 2.4.18:
1 0000-001f : dma1
2 0020-003f : pic1
3 0040-005f : timer
4 0060-006f : keyboard
5 0070-007f : rtc
6 0080-008f : dma page reg
7 00a0-00bf : pic2
8 00c0-00df : dma2
9 00f0-00ff : fpu
10 0170-0177 : ide1
11 01f0-01f7 : ide0
12 02f8-02ff : serial(auto)
13 0376-0376 : ide1
14 03c0-03df : vga+
15 03f6-03f6 : ide0
16 03f8-03ff : serial(auto)
17 0cf8-0cff : PCI conf1
18 5000-500f : PCI device 10de:01b4 (nVidia Corporation)
19 5100-511f : PCI device 10de:01b4 (nVidia Corporation)
20 5500-550f : PCI device 10de:01b4 (nVidia Corporation)
21 b800-b80f : PCI device 10de:01bc (nVidia Corporation)
22 b800-b807 : ide0
23 b808-b80f : ide1
24 e000-e07f : PCI device 10de:01b1 (nVidia Corporation)
25 e100-e1ff : PCI device 10de:01b1 (nVidia Corporation)
The content can be interpreted in this way -the serial driver is using ports
in the range 0x2f8 to 0x2ff, hard disk driver is using 0x376 and 0x3f6 etc.
Here is a program which checks whether a particular range of I/O ports is being used
by any other module, and if not reserves that range for itself.
#include.linux/module.h.
#include.linux/ioport.h.
intinit_module(void)
{
interr;
if((err=check_region(0x300,5)).0)returnerr;
request_region(0x300,5,"foobaz");
return 0;
}
void cleanup_module(void)
{
release_region(0x300, 5);
printk("world\n");
}
You should examine /proc/ioports once again after loading this module.
5.8. Passing parameters at module load time
It may sometimes be necessary to set the value of certain variables within the module at load time.
Take the case of an old ISA network card -the module has to be told the I/O base of the network card.
We do it by typing:
insmod ne.o io=0x300
Here is an example module where we pass the value of the variable foo_dat at module load time.
#include.linux/module.h.
int foo_dat = 0;
MODULE_PARM(foo_dat, "i");
int init_module(void)
{
printk("hello\n");
printk("foo_dat = %d\n", foo_dat);
return 0;
}
void cleanup_module(void) { printk("world\n"); }
/* Type insmod ./k.o foo_dat=10. If
* misspelled, we get an error message.
*
*/
The MODULE_PARM macro announces that foo_dat is of type integer and can be provided a value
at module load time, on the command line.
Five types are currently supported, b for one byte; h for two bytes; i for integer;
l for long and s for string.
Chapter 6. Character Drivers
Device drivers are classified into character, block and network drivers.
The simplest to write and understand is the character driver -
we shall start with that.
Note that we will not attempt any kind of actual hardware interfacing at this stage -we will do it later.
Before we proceed any further, you have to once again refresh whatever you have learnt about
the file handling system calls -open, read, write etc and the way file descriptors are shared
between parent and child processes.
6.1. Special Files
Go to the /dev directory and try Сls -lТ. Here is the output on our machine:
1 total 170
2 crw-------1 root root 10, 10 Apr 11 2002 adbmouse
3 crw-r--r--1 root root 10, 175 Apr 11 2002 agpgart
4 crw-------1 root root 10, 4 Apr 11 2002 amigamouse
5 crw-------1 root root 10, 7 Apr 11 2002 amigamouse1
6 crw-------1 root root 10, 134 Apr 11 2002 apm_bios
7 drwxr-xr-x 2 root root 4096 Oct 14 20:16 ataraid
8 crw-------1 root root 10, 5 Apr 11 2002 atarimouse
9 crw-------1 root root 10, 3 Apr 11 2002 atibm
10 crw-------1 root root 10, 3 Apr 11 2002 atimouse
11 crw-------1 root root 14, 4 Apr 11 2002 audio
12 crw-------1 root root 14, 20 Apr 11 2002 audio1
13 crw-------1 root root 14, 7 Apr 11 2002 audioctl
14 brw-rw----1 root disk 29, 0 Apr 11 2002 aztcd
15 crw-------1 root root 10, 128 Apr 11 2002 beep
You note that the permissions field begins with, in most cases, the character СcТ.
We have a СdТ against one name and a СbТ against another.
A file whose permission field starts with a СcТ is called a character special file
and one which starts with СbТ is a block special file.
These files dont have sizes, instead they have what are called major and minor numbers.
They are not files in the sense they donТt represent streams of data on a disk -
they are mostly abstractions of peripheral devices.
LetТs suppose that you execute the command
echo hello ./dev/lp0
Had lp0 been an ordinar file, the string СhelloТ would have appeared within it.
But you observe that if you have a printer connected to your machine and if it is turned on,
СhelloТ gets printed on the paper.
Thus, lp0 is acting as some kind of Сaccess pointТ through which you can talk to your printer.
The choice of the file as a mechanism to define access points to peripheral devices
is perhaps one of the most significant (and powerful) ideas popularized by Unix.
How is it that a СwriteТ to /dev/lp0 results in characters getting printed on paper?
LetТs think of it this way.
The kernel contains some routines (loaded as a module) for initializing a printer,
writing data to it, reading back error messages etc.
These routines form the Сprinter device driverТ. LetТs suppose that these routines are called:
printer_open printer_read
printer_write
Now, the device driver programmer loads these routines into kernel memory either statically linked
with the kernel or dynamically as a module. LetТs suppose that the driver programmer stores the address
of these routines in some kind of a structure (which has fields of type Сpointer to functionТ,
whose names are, say, СopenТ, СreadТ and СwriteТ) -
letТs also suppose that the address of this structure is СregisteredТ in a table within the kernel,
say at index 254. Now, the driver writer creates a Сspecial .leТ using the command:
mknod c printer 253 0
An Сls -l printerТ displays:
crw-r--r--1 root root 253, 0 Nov 26 08:15 printer
What happens when you attempt to write to this file?
The СwriteТ system call understands that СprinterТ is a special file -
so it extracts the major number (which is 254) and indexes a table in kernel memory
(the very same table into which the driver programmer has stored the address of the structure
containing pointers to driver routines) from where it gets the address of a structure.
Write then simply calls the function whose address is stored in the СwriteТ field of this structure,
thereby invoking Сprinter_writeТ. ThatТs all there is to it, conceptually.
Before we write to a file, we will have to СopenТ it -
the СopenТ system call also behaves in a similar manner -ultimately executing Сprinter_openТ.
LetТs put these ideas to test. Look at the following program:
1
2 #include.linux/module.h.
3 #include.linux/fs.h.
5 static struct file_operations fops = {
6 open: NULL,
7 read: NULL,
8 write: NULL,
9 };
11 static char *name = "foo";
12 static int major;
14 int init_module(void)
15 {
16 major = register_chrdev(0, name, &fops);
17 printk("Registered, got major = %d\n", major);
18 return 0;
19 }
20
21 void cleanup_module(void)
22 {
23 printk("Cleaning up...\n");
24 unregister_chrdev(major, name);
30 25 }
We are not defining any device manipulation functions at this stage -
we simply create a variable of type Сstruct file_operationsТ and initialize some of its fields to NULL
Note that we are using the GCC structure initialization extension to the C language.
We then call a function
register_chrdev(0, name, &fops);
The first argument to register_chrdev is a Major Number (ie, the slot of a table in kernel memory
where we are going to put the address of the structure) -
we are using the special number С0Т here -
by using which we are asking register_chrdev to identify an unused slot and put the address
of our structure there -
the slot index will be returned by register_chrdev. During cleanup, we СunregisterТ our driver.
We compile this program into a file called Сa.oТ and load it.
Here is what /proc/devices looks like after loading this module:
1 Character devices:
2 1 mem
3 2 pty
4 3 ttyp
5 4 ttyS
6 -----Many Lines Deleted---
7 140 pts
8 141 pts
9 142 pts
10 143 pts
11 162 raw
12 180 usb
13 195 nvidia
14 254 foo
16 Block devices:
17 1 ramdisk
18 2fd
19 3 ide0
20 9md
21 12 unnamed
22 14 unnamed
23 22 ide1
24 38 unnamed
25 39 unnamed
Note that our driver has been registered with the name СfooТ, major number is 254.
We will now create a special file called, say, СfooТ (the name can be anything,
what matters is the major number).
mknod foo c 254 0
LetТs now write a small program to test our dummy driver.
1 #include "myhdr.h"
main()
{
intfd,retval;
charbuf[]="hello";
fd=open("foo",O_RDWR);
if(fd.0){
perror("");
exit(1);
}
printf("fd=%d\n",fd);
retval=write(fd,buf,sizeof(buf));
printf("writeretval=%d\n",retval);
if(retval.0)perror("");
retval=read(fd,buf,sizeof(buf));
printf("readretval=%d\n",retval);
if(retval.0)perror("");
}
Here is the output of running the above program(Note that we are not showing the messages
coming from the kernel).
fd = 3
write retval=-1
Invalid argument
read retval=-1
Invalid argument
Lets try to interpret the output. The СopenТ system call, upon realizing that our file is a special file,
looks up the table in which we have registered our driver routines(using the major number as an index).
It gets the address of a structure and sees that the СopenТ field of the structure is NULL.
Open assumes that the device does not require any initialization sequence -
so it simply returns to the caller.
Open performs some other tricks too. It builds up a structure (of type fi.leТ)
and stores certain information (like the current offset into the file, which would be zero initially)
in it. A field of this structure will be initialized with the address of the structure
which holds pointers to driver routines.
Open stores the address of this object (of type file) in a slot in the per process
file descriptor table and returns the index of this slot as a Сfile descriptorТ back
to the calling program.
Now what happens during
write(fd, buf, sizeof(buf));
The write system call uses the value in fd to index the file descriptor table -
from there it gets the address of an object of type СfileТ -
one field of this object will contain the address of a structure which contains pointers
to driver routines -write examines this structure and realizes that the СwriteТ field
of the structure is NULL -so it immediately goes back to the caller with a negative return value -
the logic being that a driver which does not define a СwriteТ canТt be written to.
The application program gets -1 as the return value -
calling perror() helps it find out the nature of the error
(there is a little bit of СmagicТ here which we intentionally leave out from our discussion).
Similar is the case with read.
We will now change our module a little bit.
#include.linux/module.h.
#include.linux/fs.h.
staticchar*name="foo";
staticintmajor;
staticint9foo_open(structinode*inode,structfile*filp)
{
printk("Major=%d,Minor=%d\n",MAJOR(inode->i_rdev),
MINOR(inode->i_rdev));
/* Perform whatever actions are
* need to physically open the
* hardware device
*/
printk("Offset=%d\n",filp->f_pos);
printk("filp->f_op-.open=%x\n",filp->f_op->open);
printk("address of foo_open=\n", foo_open);
return 0; /* Success */
}
static int
foo_read(struct file *filp, char *buf,
size_t count, loff_t *offp)
{
printk("&filp->f_pos=%x\n",&filp->f_pos);
printk("offp=%x\n", offp);
/* As of now, dummy */
return 0;
}
static int
foo_write(struct file *filp, const char *buf,
size_t count, loff_t *offp)
{
/* As of now, dummy */
return 0;
}
static struct file_operations fops = {
open: foo_open,
read: foo_read,
write: foo_write
};
int init_module(void)
{
major = register_chrdev(0, name, &fops);
printk("Registered, got major = %d\n", major);
return 0;
}
void cleanup_module(void)
{
printk("Cleaning up...\n");
unregister_chrdev(major, name);
}
We are now filling up the structure with address of three functions, foo_open, foo_read and foo_write.
What are the arguments to foo_open? When the СopenТ system call ultimately gets to call
foo_open after several layers of indirection, it always passes two arguments, both of which are pointers.
Our foo_open should be prepared to access these arguments.
The first argument is a pointer to an object of type Сstruct inodeТ.
An inode is a disk data structure which stores information about a file like its permissions,
ownership, date, size, location of data blocks (if it is a real disk .le) and major and minor numbers
(in case of special .les). An object of type Сstruct inodeТ mirrors this information in kernel memory space.
Our foo_open function, by accessing the field i_rdev through certain macros, is capable of finding out
what the major and minor numbers of the file on which the СopenТ system call is acting.
The next argument is of type Сpointer to struct fileТ.
We had mentioned earlier that the per process file descriptor table contains addresses of structures
which store information like current file offset etc.
The second argument to open is the address of this structure.
Note that this structure in turn contains the address of the structure which holds the address
of the driver routines(the .eld is called f_op), including foo_open! Does this make you crazy?
It should not. When you read the kernel source, you will realize that most of the complexity
of the code is in the way the data structures are organized.
The code which acts on these data structures would be fairly straightforward.
This is the way large programs are (or should be) written,
most of the complexity should be confined to (or captured in) the data structures -
the algorithms should be made as simple as possible.
It is comparitively easier for us to decode complex data structures than complex algorithms.
Of courses, there will be places in the code where you will be forced to use complex algorithms -
if you are writing numerical programs, algorithmic complexity is almost unavoidable;
same is the case with optimizing compilers, many optimization techniques have strong mathematical
(read graph theoretic) foundations and they are inherently complex.
Operating systems are fortunately not riddled with such algorithmic complexitites.
What about the arguments to foo_read and foo_write.
We have a buffer and count, together with a .eld called СoffpТ, which we may interpret as the address
of the f_pos field in the structure pointed to by СfilepТ (Wonder why we need this field?
Why dont we straightaway access filp->f_pos?).
Here is what gets printed on the screen when we run the test program (which calls open, read and write).
Again, note that we are not printing the kernelТs response.
fd = 3
write retval=0
read retval=0
The response from the kernel is interesting.
We note that the address of foo_open does not change.
That is because the module stays in kernel memory -
every time we are running our test program, we are calling the same foo_open.
But note that the С&.lp->f_posТ and СoffpТ values, though they are equal, may keep on changing.
This is because every time we are calling СopenТ, the kernel creates a new object of type Сstruct .leТ.
6.2. Use of the СreleaseТ method
The driver open method should be composed of initializations.
It is also preferable that the СopenТ method increments the usage count.
If an application program calls open, it is nec-essary that the driver code stays in memory
till it calls СcloseТ. When there is a close on a .le descriptor (either explicit or implicit -
when your program terminates, СcloseТ is invoked on all open .le descriptors automatically) -
the СreleaseТ driver method gets called -you can think of decrementing the usage count
in the body of СreleaseТ.
1 #include.linux/module.h.
2 #include.linux/fs.h.
3
4 static char *name = "foo";
5 static int major;
6
7 static int
8 foo_open(struct inode* inode, struct file *filp)
9 {
10 MOD_INC_USE_COUNT;
11 return 0; /* Success */
12 }
13
14 static int foo_close(struct inode *inode,
15 struct file *filp)
16 {
17 printk("Closing device...\n");
18 MOD_DEC_USE_COUNT;
19 return 0;
20 }
21 static struct file_operations fops = {
22 open: foo_open,
23 release: foo_close
24
25 };
26
27 int init_module(void)
28 {
29 major = register_chrdev(0, name, &fops);
30 printk("Registered, got major = %d\n", major);
31 return 0;
32 }
33
34 void cleanup_module(void)
35 {
36 printk("Cleaning up...\n");
37 unregister_chrdev(major, name);
38 }
39
40
Lets load this module and test it out with the following program:
1#include"myhdr.h"23main()4{5intfd,retval;6charbuf[]="hello";78fd=open("foo",O_RDWR);9if(fd.0){
10 perror("");
11 exit(1);
12 }
13 while(1);
14 }
15
16
We see that as long as the program is running, the use count of the module would be 1 and rmmod would fail.
Once the program terminates, the use count becomes zero.
A file descriptor may be shared among many processes -
the release method does not get invoked every time a process calls close() on its copy
of the shared descriptor. Only when the last descriptor gets closed (that is, no more descriptors point
to the Сstruct fileТ type object which has been allocated by open) does the release method get invoked.
Here is a small program which will make the idea clear:
#include"myhdr.h"
main()
{
intfd,retval;
charbuf[]="hello";
fd=open("foo",O_RDWR);
if(fd.0){
perror("");
exit(1);
}
if(fork() == 0) {
sleep(1);
close(fd); /* Explicit close by child */
}else{
close(fd); /* Explicit close by parent */
}
}
6.3. Use of the СreadТ method
Transferring data from kernel address space to user address space is the main job of the read function:
ssize_t read(struct .le* .lep, char *buf, size_t count, loff_t *offp);
Say we are defining the read method of a scanner device.
Using various hardware tricks, we acquire image data from the scanner device
and store it in an array. We now have to copy this array to user address space.
It is not possible to do this using standard functions like СmemcpyТ due to various reasons.
We have to make use of the functions:
unsigned long copy_to_user(void *to, const void* from, unsigned long count);
and
unsigned long copy_from_user(void *to, const void* from, unsigned long count);
These functions return 0 on success (ie, all bytes have been transferred, 0 more bytes to transfer).
Before we try to implement read (we shall try out the simplest implementation -
the device supports only read -
and we shall not pay attention to details of concurrency.
This is a bad approach. We shall examine concurrency issues later on) we should once again examine
how an application program uses the read syscall.
Read is invoked with a file descriptor, a buffer and a count.
Suppose that an application program is attempting to read a file in full, till EOF is reached,
trying to read N bytes at a time.
Read can return a value less than or equal to N. The application program should keep on reading
till read returns 0. This way, it will be able to read the file in full.
Here is a simple driver read method -
trying to see the contents of this device by using a standard command like cat should give us
the output СHello, World\nТ. Also, we should be able to get the same output from programs which attempt
to read from the file in several different block sizes.
1 static int
2 foo_read(struct file* filp, char *buf,
3 size_t count, loff_t *f_pos)
4{
5 static char msg[] = "Hello, world\n";
6 int data_len = strlen(msg);
7 int curr_off = *f_pos, remaining;
9 if(curr_off >= data_len) return 0;
10remaining=data_len-curr_off;
if(count.=remaining){
if(copy_to_user(buf,msg+curr_off,count))
return-EFAULT;
*f_pos=*f_pos+count;
returncount;
}else{
if(copy_to_user(buf,msg+curr_off,remaining))
return-EFAULT;
*f_pos=*f_pos+remaining;
returnremaining;
}
}
Here is a small application program which exercises the driver read function with different read counts:
1 #include "myhdr.h"
2 #define MAX 1024
4 int
5 main()
6{
7 char buf[MAX];
8 int fd, n, ret;
fd=open("foo",O_RDONLY);
assert(fd.=0);
printf("Enterreadquantum:");
scanf("%d",&n);
while((ret=read(fd,buf,n)).0)
write(1,buf,ret);/*Writetostdout*/
if(ret.0){
fprintf(stderr,"Errorinread\n");
exit(1);
}
exit(0);
}
6.4. A simple Сram diskТ
Here is a simple ram disk device which behaves like this -
initially, the device is empty. If you write, say 5 bytes and then perform a read
echo -n hello .foo
cat foo
You should be able to see СhelloТ. If you now do
echo -n abc .foo
cat foo
you should be able to see only СabcТ.
If you attempt to write more than MAXSIZE characters, you should get a Сno spaceТ error -
but as many characters as possible should be written. Here is the full source code:
#include .linux/module.h.
#include .linux/fs.h.
#include .asm/uaccess.h.
#define MAXSIZE 512
static char *name = "foo";
static int major;
static char msg[MAXSIZE];
static int curr_size = 0;
static int
foo_open(struct inode* inode, struct file *filp)
{
MOD_INC_USE_COUNT;
return 0; /* Success */
}
static int
foo_write(struct file* filp, const char *buf,
size_t count, loff_t *f_pos)
{
int curr_off = *f_pos;
int remaining = MAXSIZE -curr_off;
if(curr_off .= MAXSIZE) return -ENOSPC;
if (count .= remaining) {
if(copy_from_user(msg+curr_off, buf, count))
return -EFAULT;
*f_pos = *f_pos + count;
curr_size = *f_pos;
return count;
}else{
if(copy_from_user(msg+curr_off, buf, remaining))
return -EFAULT;
*f_pos = *f_pos + remaining;
curr_size = *f_pos;
return remaining;
}
}
static int
foo_read(struct file* filp, char *buf,
size_t count, loff_t *f_pos)
{
int data_len = curr_size;
int curr_off = *f_pos, remaining;
if(curr_off >= data_len) return 0;
remaining = data_len -curr_off;
if (count >= remaining) {
if(copy_to_user(buf, msg+curr_off, count))
return -EFAULT;
*f_pos = *f_pos + count;
56 return count;
57 }else{
58 if(copy_to_user(buf, msg+curr_off, remaining))
59 return -EFAULT;
*f_pos = *f_pos + remaining;
61 return remaining;
62 }
63 }
64
39
65 static int foo_close(struct inode *inode,
66 struct file *filp)
67 {
68 MOD_DEC_USE_COUNT;
69 printk("Closing device...\n");
70 return 0;
71 }
72
73 static struct file_operations fops = {
74 open: foo_open,
75 read: foo_read,
76 write: foo_write,
77 release: foo_close
78
79 };
80
81 int init_module(void)
82 {
83 major = register_chrdev(0, name, &fops);
84 printk("Registered, got major = %d\n", major);
85 return 0;
86 }
87
88 void cleanup_module(void)
89 {
90 printk("Cleaning up...\n");
91 unregister_chrdev(major, name);
92 }
93
94
After compiling and loading the module and creating the necessary device file,
try redirecting the output of Unix commands.
See whether you get the Сno spaceТ error (try ls -l .foo).
Write C programs and verify the behaviour of the module.
6.5. A simple pid retriever
A process opens the device .le, СfooТ, performs a read, and magically, it gets its own process id.
1
2 static int
3 foo_read(struct file* filp, char *buf,
4 size_t count, loff_t *f_pos)
5 {
6 static char msg[MAXSIZE];
7 int data_len;
8 int curr_off = *f_pos, remaining;
9
10 sprintf(msg, "%u", current-.pid);
11 data_len = strlen(msg);
12 if(curr_off >= data_len) return 0;
13 remaining = data_len -curr_off;
14 if (count >= remaining) {
15 if(copy_to_user(buf, msg+curr_off, count))
40
16 return -EFAULT;
17 *f_pos = *f_pos + count;
18 return count;
19 }else{
20 if(copy_to_user(buf, msg+curr_off, remaining))
21 return -EFAULT;
22 *f_pos = *f_pos + remaining;
23 return remaining;
24 }
25 }
Chapter 7. Ioctl and Blocking I/O
We discuss some more advanced character driver operations in this chapter.
7.1. Ioctl
It may sometimes be necessary to send СcommandsТ to your device -
especially when you are controlling a real physical device, say a serial port.
Lets say that you wish to set the baud rate (data transfer rate) of the device to 9600 bits per second.
One way to do this is to embed control sequences in the input stream of the device.
LetТs send a string Сset baud: 9600Т. The difficulty with this approach is that the input stream
of the device should now never contain a string of the form Сset baud: 9600Т during normal operations.
Imposing special СmeaningТ to symbols on the input stream is most often an ugly solution.
A better way is to use the СioctlТ system call.
ioctl(int fd, int cmd, ...);
Associated with which we have a driver method:
foo_ioctl(struct inode *inode, struct .le *.lp, unsigned int cmd, unsigned long arg);
Here is a simple module which demonstrates the idea.
Lets .rst de.ne a header .le which will be included both by the module and by the application program.
1 #define FOO_IOCTL1 0xab01
2 #define FOO_IOCTL2 0xab02
3
We now create the module:
1 #include.linux/module.h.
2 #include.linux/fs.h.
3 #include.asm/uaccess.h.
5 #include "foo.h"
7 static int major;
8 char *name = "foo";
10 static int
11 foo_ioctl(struct inode *inode, struct file *filp,
12 unsigned int cmd, unsigned long arg)
13 {
14 printk("received ioctl number %x\n", cmd);
15 return 0;
16 }
19 static struct file_operations fops = {
20 ioctl: foo_ioctl,
22 };
24 int init_module(void)
25 {
26 major = register_chrdev(0, name, &fops);
27 printk("Registered, got major = %d\n", major);
28 return 0;
29 }
31 void cleanup_module(void)
32 {
33 printk("Cleaning up...\n");
34 unregister_chrdev(major, name);
35 }
And a simple application program which exercises the ioctl:
1 #include "myhdr.h"
2 #include "foo.h"
5 main()
6{
7 int r;
8 int fd = open("foo", O_RDWR);
9 assert(fd .= 0);
11 r = ioctl(fd, FOO_IOCTL1);
12 assert(r == 0);
13 r = ioctl(fd, FOO_IOCTL2);
14 assert(r == 0);
15 }
The kernel should respond with
received ioctl number ab01
received ioctl number ab02
The general form of the driver ioctl function could be somewhat like this:
1 static int
2 foo_ioctl(struct inode *inode, struct file *filp,
3 unsigned int cmd, unsigned long arg)
4{
5 switch(cmd) {
6 case FOO_IOCTL1: /* Do some action */
7 break;
8 case FOO_IOCTL2: /* Do some action */
9 break;
10 default: return -ENOTTY;
11 }
12 /* Do something else */
13 return 0;
14 }
We note that the driver ioctl function has a final argument called СargТ.
Also, the ioctl syscall is de.ned as:
ioctl(int fd, int cmd, ...);
This does not mean that ioctl accepts variable number of arguments -
but only that type checking is disabled on the last argument.
Sometimes, it may be necessary to pass data to the ioctl routine
(ie, set the data transfer rate on a communication port) and sometimes it may be necessary
to receive back data (get the current data transfer rate).
If your intention is to pass finite amount of data to the driver as part of the ioctl,
you can pass the last argument as an integer.
If you wish to get back some data, you may think of passing a pointer to integer.
Whatever be the type which you are passing, the driver routine sees it as an unsigned long -
proper type casts should be done in the driver code.
1 static int
2 foo_ioctl(struct inode *inode, struct file *filp,
3 unsigned int cmd, unsigned long arg)
4{
5 printk("cmd=%x, arg=%x\n", cmd, arg);
6 switch(cmd) {
7 case FOO_GETSPEED:
8 put_user(speed, (int*)arg);
9 break;
10 case FOO_SETSPEED:
11 speed = arg;
12 break;
13 default: return -ENOTTY; /* Failure */
14 }
15 return 0; /* Succes */
16 }
Here is the application program which tests this ioctl:
1
2 main()
3{
4 int r, speed;
5 int fd = open("foo", O_RDWR);
6 assert(fd .= 0);
7
8 r = ioctl(fd, FOO_SETSPEED, 9600);
9 assert(r == 0);
10 r = ioctl(fd, FOO_GETSPEED, &speed);
11 assert(r == 0);
12 printf("current speed = %d\n", speed);
13 }
When writing production code, it is necessary to use certain macros to generate the ioctl command numbers.
The reader should refer LinuxDeviceDriversby Rubini for more infor-mation.
7.2. Blocking I/O
A user process which attempts to read from a device should СblockТ till data becomes ready.
A blocked process is said to be in a СsleepingТ state -it does not consume CPU cycles.
Take the case of the СscanfТ function -if you dont type anything on the keyboard,
the program which calls it just keeps on sleeping
(this can be observed by running Сps axТ on another console).
The terminal driver, when it receives an СenterТ (or as and when it receives a single character,
if the terminal is in raw mode), wakes up all processes which were deep in sleep waiting for input.
Let us see some of the functions used to implement sleep/wakeup mechanisms in Linux.
A fundamental datastructure on which all these functions operate on is a wait queue.
A wait que is declared as:
wait_queue_head_t foo_queue;
We have to do some kind of initialization before we use foo_queue.
If it is a static(global) variable, we can invoke a macro:
DECLARE_WAIT_QUEUE_HEAD(foo_queue);
Otherwise, we may call:
init_waitqueue_head(&foo_queue);
Now, if the process wants to go to sleep, it can call one of many functions, we shall use:
interruptible_sleep_on(&foo_queue);
LetТs look at an example module.
1 DECLARE_WAIT_QUEUE_HEAD(foo_queue);
2
3 static int
4 foo_open(struct inode* inode, struct file *filp)
5 {
6 if(filp->f_flags == O_RDONLY) {
7 printk("Reader going to sleep...\n");
8 interruptible_sleep_on(&foo_queue);
9 } else if(filp-.f_flags == O_WRONLY){
10 printk("Writer waking up readers...\n");
11 wake_up_interruptible(&foo_queue);
46 12 }
13 return 0; /* Success */
14 }
15
16
17
What happens to a process which tries to open the file СfooТ in read only mode?
It immediately goes to sleep. When does it wake up? Only when another process tries to open
the file in write only mode. You should experiment with this code by writing two C programs,
one which calls open with the O_RDONLY flag and another which calls open with O_WRONLY flag
(donТt try to use СcatТ -seems that cat opens the file in O_RDONLY|O_LARGEFILE mode).
You should be able to take the first program out of its sleep either by hitting Ctrl-C or by running
the second program. What if you change Сinterruptible_sleep_onТ to Сsleep_onТ and Сwake_up_interruptibleТ
to Сwake_upТ (wake_up_interruptible wakes up only those processes which have gone to sleep
using interruptible_sleep_on whereas wake_up shall wake up all processes).
You note that the first program goes to sleep, but you are not able to СinterruptТ it by typing Ctrl-C.
Only when you run the program which opens the file СfooТ in writeonly mode does the first program
come out of its sleep. Signals are not delivered to processes which are not in interruptible sleep.
This is somewhat dangerous, as there is a possibility of creating unkillable processes.
Driver writers most often use СinterruptibleТ sleeps.
7.2.1. wait_event_interruptible
This function is interesting. LetТs see what it does through an example.
1/*Templateforasimpledriver*/
3#include.glinux/module.h.
4#include.glinux/fs.h.
5#include.gasm/uaccess.h.
7#define BUFSIZE 1024
staticchar*name="foo";
10staticintmajor;
12staticintfoo_count=0;
14DECLARE_WAIT_QUEUE_HEAD(foo_queue);
17staticint
18foo_read(structfile*filp,char*buf,
19size_tcount,loff_t*f_pos)
20{
22wait_event_interruptible(foo_queue,(foo_count==0));
23printk("Outofread-wait...\n");
24returncount;
25}
27staticint
28foo_write(structfile*filp,constchar*buf,
29size_tcount,loff_t*f_pos)
30 {
31 if(buf[0] == ТIТ) foo_count++;
32 else if(buf[0] == ТDТ) foo_count--;
33 wake_up_interruptible(&foo_queue);
34 return count;
35 }
The foo_read method calls wait_event_interruptible, a macro whose second parameter
is a C boolean expression.
If the expression is true, nothing happens -control comes to the next line.
Otherwise, the process is put to sleep on a wait queue. Upon receiving a wakeup signal,
the expression is evaluated once again -if found to be true, control comes to the next line,
otherwise, the process is again put to sleep. This continues till the expression becomes true.
We write two application programs, one which simply opens СfooТ and calls СreadТ.
The other program reads a string from the keyboard and calls СwriteТ with that string as argument.
If the first character of the string is an upper case СIТ, the driver routine increments foo_count,
if it is a СDТ, foo_count is decremented. Here are the two programs:
1 main()
2{
3 int fd;
4 char buf[100];
5 fd = open("foo", O_RDONLY);
6 assert(fd .= 0);
7 read(fd, buf, sizeof(buf));
8}
9
10 /*------Here comes the writer----*/
11 main()
12 {
13 int fd;
14 char buf[100];
15
16 fd = open("foo", O_WRONLY);
17 assert(fd .= 0);
18 scanf("%s", buf);
19 write(fd, buf, strlen(buf));
20 }
21
Load the module and experiment with the programs. ItТs real fun!
7.2.2. A pipe lookalike
Synchronizing the execution of multiple reader and writer processes is no trivial job -
our experience in this area is very limited.
Here is a small Сpipe likeТ application which is sure to be full of race conditions.
The idea is that one process should be able to write to the device -if the buffer is full,
the write should block (until the whole buffer becomes free).
Another process keeps reading from the device -if the buffer is empty,
the read should block till some data is available.
1 #define BUFSIZE 1024
3 static char *name = "foo";
4 static int major;
6 static char msg[BUFSIZE];
7 static int readptr = 0, writeptr = 0;
9 DECLARE_WAIT_QUEUE_HEAD(foo_readq);
DECLARE_WAIT_QUEUE_HEAD(foo_writeq);
12 static int
13 foo_read(struct file* filp, char *buf,
14 size_t count, loff_t *f_pos)
{
16 intremaining;
18 wait_event_interruptible(foo_readq,(readptr.writeptr));
remaining = writeptr -readptr;
21 if (count >= remaining) {
22 if(copy_to_user(buf, msg+readptr, count))
23 return -EFAULT;
24 readptr = readptr + count;
wake_up_interruptible(&foo_writeq);
26 returncount;
27 }else{
28 if(copy_to_user(buf,msg+readptr,remaining))
29 return-EFAULT;
readptr = readptr + remaining;
31 wake_up_interruptible(&foo_writeq);
32 return remaining;
33 }
34 }
36 static int
37 foo_write(struct file* filp, const char *buf,
38 size_t count, loff_t *f_pos)
int remaining;
42 if(writeptr == BUFSIZE-1) {
43 wait_event_interruptible(foo_writeq, (readptr == writeptr));
44 readptr = writeptr = 0;
}
46 remaining = BUFSIZE-1-writeptr;
47 if (count .= remaining) {
48 if(copy_from_user(msg+writeptr, buf, count))
49 return -EFAULT;
writeptr = writeptr + count;
51 wake_up_interruptible(&foo_readq);
52 return count;
53 }else{
54 if(copy_from_user(msg+writeptr, buf, remaining))
return -EFAULT;
56 writeptr = writeptr + remaining;
57 wake_up_interruptible(&foo_readq);
58 return remaining;
59 }
60 }
Chapter 8. Keeping Time
Drivers need to be aware of the flow of time.
This chapter looks at the kernel mechanisms available for timekeeping.
8.1. The timer interrupt
Try
cat /proc/interrupts
This is what we see on our system:
1 CPU0
2 0: 314000 XT-PIC timer
3 1: 12324 XT-PIC keyboard
4 2: 0 XT-PIC cascade
5 4: 15155 XT-PIC serial
6 5: 15 XT-PIC usb-ohci, usb-ohci
7 8: 1 XT-PIC rtc
8 11: 212598 XT-PIC nvidia
9 14: 9717 XT-PIC ide0
10 15: 22 XT-PIC ide1
11 NMI: 0
12 LOC: 0
13 ERR: 0
14 MIS: 0
The first line shows that the СtimerТ has generated 314000 interrupts from system boot up.
The СuptimeТ command shows us that the system has been alive for around 52 minutes.
Which means the timer has interrupted at a rate of almost 100 per second.
A constant called СHZТ defined in /usr/src/linux/include/asm/params.h defines this rate.
Every time a timer interrupt occurs, value of a globally visible kernel variable called СjiffiesТ
gets printed(jiffies is initialized to zero during bootup). You should write a simple module
which prints the value of this variable. Device drivers are most often satisfied with the gran-ularity
which СjiffiesТ provides.
Drivers seldom need to know the absolute time (that is, the number of seconds elapsed since the СepochТ,
which is supposed to be 0:0:0 Jan 1 UTC 1970). If you so desire, you can think of calling the
void do_gettimeofday(struct timeval *tv);
function from your module -which behaves like the СgettimeofdayТ syscall.
Trying grepping the kernel source for a variable called СjiffiesТ. Why is it declared СvolatileТ?
8.1.1. The perils of optimization
LetТs move off track a little bit -we shall try to understand the meaning of the keyword СvolatileТ.
LetТ write a program:
1 #include.signal.h.
3 int jiffies=0;
4 voidhandler(intn)
5 {
6 printf("calledhandler...\n");
7 jiffies++;
8 }
10 main()
11 {
12 signal(SIGINT,handler);
13 while(jiffies.3);
14 }
We define a variable called СjiffiesТ and increment it in the handler of the Сinterrupt signalТ.
So, every time you press Ctrl-C, the handler function gets called and jiffies is incremented.
Ultimately, jiffies becomes equal to 3 and the loop terminates.
This is the behaviour which we observe when we compile and run the program without optimization.
Now what if we compile the program like this:
cc a.c -O2
we are enabling optimization. If we run the program, we observe that the while loop does not terminate.
Why? The compiler has optimized the access to СjiffiesТ. The compiler sees that within the loop,
the value of СjiffiesТ does not change (the compiler is not smart enough to understand that jiffies
will change asynchronously) -so it stores the value of jiffies in a CPU register before it starts
the loop -within the loop, this CPU register is constantly checked -the memory area associated
with jiffies is not at all accessed -which means the loop is completely unaware of jiffies becoming
equal to 3 (you should compile the above program with the -S option and look at the generated assembly
language code).
What is the solution to this problem? We want the compiler to produce optimized code,
but we donТt want to mess up things. The idea is to tell the compiler that СjiffiesТ should not be
involved in any optimization attempts. You can achieve this result by declaring jif.es as:
volatile int jif.es=0;
The volatile keyword instructs the compiler to leave alone jif.es during optimization.
8.1.2. Busy Looping
LetТs test out this module:
1 static int end;
3 static int
4 foo_read(struct file* filp, char *buf,
5 size_t count, loff_t *f_pos)
6{
7 static int nseconds = 2;
8 char c = ТAТ;
10 end=jiffies+nseconds*HZ;
11 while(jiffies.end);
13 copy_to_user(buf,&c,1);
14 return1;
15 }
We shall test out this module with the following program:
1 #include"myhdr.h"
3 main()
4 {
5 char buf[10];
6 int fd=open("foo",O_RDONLY);
7 assert(fd.=0);
8 while(1){
9 read(fd,buf,1);
10 write(1, buf, 1);
11 }
12 }
When you run the program, you will see a sequence of СAТs getting printed at about 2 second intervals.
What about the response time of your system? It appears as if your whole system has been stuck
during the two second delay. This is because the OS is unable to schedule any other job when one process
is executing a tight loop in kernel context. Increase the delay and see what effect it has -
this exercise should be pretty illuminating. Contrast this behaviour with that of a program
which simply executes a tight in.nite loop in user mode.
Try timing the above program; run it as
time ./a.out
how do you interpret the three times shown by the command?
8.2. interruptible_sleep_on_timeout
1 DECLARE_WAIT_QUEUE_HEAD(foo_queue);
3 static int
4 foo_read(struct file* filp, char *buf,
5 size_t count, loff_t *f_pos)
6{
7 static int nseconds = 2;
8 char c = ТAТ;
9 interruptible_sleep_on_timeout(&foo_queue, nseconds*HZ);
10 copy_to_user(buf, &c, 1);
11 return 1;
12 }
We observe that the process which calls read sleeps for 2 seconds, then prints ТAТ,
again sleeps for 2 seconds and so on. The kernel wakes up the process either when somebody executes
an explicit wakeup function on foo_queue or when the speci.ed timeout is over.
8.3. udelay, mdelay
These are busy waiting functions which can be called to implement delays lesser than one timer tick.
Eventhough udelay can be used to generate delays upto 1 second, the recom-mended maximum is 1 milli second.
Here are the function prototypes:
#include < linux.h>.
void udelay(unsigned long usescs);
void mdelay(unsigned long msecs);
8.4. Kernel Timers
It is possible to СregisterТ a function so that it is called after a certain time interval.
This is made possible through a mechanism called Сkernel timersТ. The idea is simple.
You create a variable of type Сstruct timer_listТ
1 struct timer_list{
2 struct timer_list *next;
3 struct timer_list *prev;
4 unsigned long expires; /* Absolute timeout in jiffies */
5 void (*fn) (unsigned long); /* timeout function */
6 unsigned long data; /* argument to handler function */
7 volatile int running;
8}
The variable is initialized by calling timer_init(). The expires, data and timeout function fields are set.
The timer_list object is then added to a global list of timers.
The kernel keeps scanning this list 100 times a second, if the current value of СjiffiesТ is equal
to the expiry time speci.ed in any of the timer objects, the corresponding timeout function is invoked.
Here is an example program.
1 DECLARE_WAIT_QUEUE_HEAD(foo_queue);
3 void
4 timeout_handler(unsigned long data)
5{
6 wake_up_interruptible(&foo_queue);
7}
9 static int
10 foo_read(struct file* filp, char *buf,
11 size_t count, loff_t *f_pos)
12 {
13 struct timer_list foo_timer;
14 char c=ТBТ;
16 init_timer(&foo_timer);
17 foo_timer.function = timeout_handler;
18 foo_timer.data = 10;
19 foo_timer.expires = jiffies + 2*HZ; /* 2 secs */
20 add_timer(&foo_timer);
21 interruptible_sleep_on(&foo_queue);
22 del_timer_sync(&foo_timer); /* Take timer off the list*/
23 copy_to_user(buf, &c, 1);
24 return count;
25 }
As usual, you have to test the working of the module by writing a simple application program.
Note that the time out function may execute long after the process which caused it to be scheduled
vanished. The timeout function is then supposed to be working in Сinterrupt modeТ and there are many
restrictions on its behaviour (shouldnТt sleep, shouldnТt access any user space memory etc).
It is very easy to lock up the system when you play with such functions (we are speaking from experience!)
8.5. Timing with special CPU Instructions
Modern CPUТs have special purpose Machine Specific Registers associated with them for performance
measurement, timing and debugging purposes. There are macroТs for accessing these MSRТs,
but letТs take this opportunity to learn a bit of GCC Inline Assembly Language.
8.5.1. GCC Inline Assembly
It may sometimes be convenient (and necessary) to mix assembly code with C.
We are not talking of C callable assembly language functions or assembly callable C functions -
but we are talking of C code woven around assembly. An example would make the idea clear.
8.5.1.1. The CPUID Instruction
Modern Intel CPUТs (as well as Intel clones) have an instruction called CPUID which is used
for gathering information regarding the processor, like, say the vendor id (GenuineIntel or AuthenticAMD).
LetТs think of writing a functtion:
char* vendor_id();
which uses the CPUID instruction to retrieve the vendor id. We will obviously have to call
the CPUID instruction and transfer the values which it stores in registers to C variables.
LetТs first look at what Intel has to say about CPUID:
If the EAX register contains an input value of 0, CPUID returns the vendor identification string
in EBX, EDX and ECX registers. These registers will contain the ASCII string СGenuineIntelТ.
Here is a function which returns the vendor id:
1 #include < stdlib.h>
3 char*vendor_id()
4 {
5 unsigned int p,q,r;
6 int i,j;
7 char* result=malloc(13 *sizeof(char));
9 asm("movl$0,%%eax;
10 cpuid"
11 :"=b"(p), "=c"(q), "=d"(r)
12 :
13 :"%eax");
15 for(i=0,j=0;i<4;i++,j++)
16 result[j]=*((char*)&p+i);
17 for(i=0;i<4;i++,j++)
18 result[j]=*((char*)&r+i);
19 for(i=0;i<4;i++,j++)
20 result[j]=*((char*)&q+i);
21 result[j]=0;
22 return result;
23 }
How does it work? The template of an inline assembler sequence is:
asm(instructions
:output operands
:input operands
:clobbered register list)
Except the first (ie, instructions), everything is optional.
The real power of inline assembly lies in its ability to operate directly on C variables and expressions.
Lets take each line and understand what it does.
The first line is the instruction
movl $0, %eax
which means copy the immediate value 0 into register eax. The $ and % are merely part of the syntax.
Note that we have to write %%eax in the instruction part -it gets translated to %eax
(again, there is a reason for this, which we conveniently ignore).
The output operands specify a mapping between C variables (l-values) and CPU registers.
"=b"(p) means the C variable СpТ is bound to the ebx register.
"=c"(q) means variable СqТ is bound to the ecx register and
"=d"(r) means that the variable СrТ is bound to register edx.
We leave the input operands section empty. The clobber list specifies those registers,
other than those specified in the output list, which the execution of this sequence of instructions
would alter. If the compiler is storing some variable in register eax, it should not assume
that that value remains unchanged after execution of the instructions given within the СasmТ -
the clobberlist thus acts as a warning to the compiler.
So, after the execution of CPUID, the ebx, edx, and ecx registers (each 4 bytes long)
would contain the ASCII values of each character of the string AuthenticAMD (our system is an AMD Athlon).
Because the variables p, r, q are mapped to these registers, we can easily transfer the ASCII values
into a proper null terminated char array.
8.5.2. The Time Stamp Counter
The Intel Time Stamp Counter gets incremented every CPU clock cycle.
ItТs a 64 bit register and can be read using the СrdtscТ assembly instruction which stores the result
in eax (low) and edx (high).
3 main()
4{
5 unsigned int low, high;
7 asm("rdtsc"
8 :"=a" (low), "=d"(high));
10 printf("%u, %u\n", high, low);
11 }
You can look into /usr/src/linux/include/asm/msr.hto learn about the macros which manipu-late MSRТs.
58
Chapter 9. Interrupt Handling
We examine how to use the PC parallel port to interface to real world devices.
The basics of interrupt handling too will be introduced.
9.1. User level access
The PC printer port is usually located at I/O Port address 0x378.
Using instructions like outband inbit is possible to write/read data to/from the port.
1 #include < asm/io.h>
3 #define LPT_DATA 0x378
4 #define LPT_STATUS 0x379
5 #define LPT_CONTROL 0x37a
7 main()
8 {
9 unsigned charc;
11 iopl(3);
12 outb(0xff, LPT_DATA);
13 c = inb(LPT_DATA);
14 printf("%x\n", c);
15 }
Before we call outb/inbon a port, we must set some kind of privilegelevelby calling the ioplinstruction.
Only the superuser can execute iopl,so this program can be executed only by root.
We are writing hex ff to the data port of the parallel interface (there is a status as well as control port
associated with the parallel interface). Pin numbers 2 to 9 of the parallel interface are output pins -
the result of executing this program will be СvisibleТ if you connect some LEDТs between these pins
and pin 25 (ground) through a 1KOhm current limiting resistor.
All the LEDТs will light up! (the pattern which we are writing is, in binary 11111111,
each bit controls one pin of the port -
D0th bit controls pin 2, D1th bit pin 3 and so on).
Note that it may sometimes be necessary to compile the program with the -O flag to gcc.
9.2. Access through a driver
Here is simple driver program which helps us play with the parallel port using Unix com-mands
like cat,echo,ddetc.
1 #define LPT_DATA 0x378
2 #define BUFLEN 1024
3
4 static int
5 foo_read(struct file* filp, char *buf,
6 size_t count, loff_t *f_pos)
7 {
8 unsigned char c;
10 if(count==0)return0;
11 if(*f_pos==1)return0;
12 c=inb(LPT_DATA);
13 copy_to_user(buf,&c,1);
14 *f_pos=*f_pos+1;
15 return 1;
17 }
19 static int
20 foo_write(structfile*filp,constchar*buf,
21 size_tcount,loff_t*f_pos)
22 {
23 unsigned chars[BUFLEN];
24 int i;
26 /*Ignoreextradata*/
27 if(count.BUFLEN) count=BUFLEN;
28 copy_from_user(s,buf,count);
29 for(i=0;i.count;i++)
30 outb(s[i],LPT_DATA);
31 returncount;
32}
We load the module and create a device file called СledТ. Now, if we try:
echo -n abcd > led
All the characters (ie, ASCII values) will be written to the port, one after the other.
If we read back, we should be able to see the effect of the last write, ie, the character СdТ.
9.3. Elementary interrupt handling
Pin 10 of the PC parallel port is an interrupt intput pin. A low to high transition on this pin
will generate Interrupt number 7. But first, we have to enable interrupt processing by writing
a 1 to bit 4 of the parallel port control register (which is at BASE+2).
Our СhardwareТ will consist of a piece of wire between pin 2 (output pin) and pin 10 (interrupt input).
It is easy for us to trigger a hardware interrupt by making pin 2 go from low to high.
#define LPT1_IRQ 7
#define LPT1_BASE 0x378
static char *name = "foo";
static int major;
DECLARE_WAIT_QUEUE_HEAD(foo_queue);
static int
foo_read(struct file* filp, char *buf,
size_t count, loff_t *f_pos)
{
static char c=ТaТ;
if(count==0)return0;
interruptible_sleep_on(&foo_queue);
copy_to_user(buf,&c,1);
if(c==ТzТ)c=ТaТ;
elsec++;
return 1;
}
void lpt1_irq_handler(intirq,void*data,
structpt_regs*regs)
{
printk("irq:%dtriggerred\n",irq);
wake_up_interruptible(&foo_queue);
}
int init_module(void)
{
int result;
major=register_chrdev(0,name,&fops);
printk("Registered,gotmajor=%d\n",major);
/*Enable parallel port interrupt*/
outb(0x10,LPT1_BASE+2);
result=request_irq(LPT1_IRQ,lpt1_irq_handler,
SA_INTERRUPT,"foo",0);
if(result){
printk("Interrupt registration failed\n");
return result;
}
return 0;
}
void cleanup_module(void)
{
printk("Freeingirq...\n");
free_irq(LPT1_IRQ,0);
printk("Freed...\n");
unregister_chrdev(major,name);
}
Note the arguments to Сrequest_handlerТ. The first one is an IRQ number, second is the ad-dress
of a handler function, third is a .ag (SA_INTERRUPT stands for fastinterrupt.
We shall not go into the details), third argument is a name and fourth argument, 0.
The function basically registersa handler for IRQ 7. When the handler gets called,
its first argument would be the IRQ number of the interrupt which caused the handler to be called.
We are not using the second and third arguments. In cleanup_module,we tell the kernel that we are
no longer interested in IRQ 7. The registration of the interrupt handler should really be done
only in the foo_open function -and freeing up done when the last process which had
the device file open closes it.
It is instructive to examine /proc/interrupts while the module is loaded.
You have to write a small application program to trigger the interrupt (make pin 2 low, then high).
1 #include < asm/io.h>
3 #define LPT1_BASE 0x378
5 void enable_int()
6 {
7 outb(0x10, LPT1_BASE+2);
8 }
10 void low()
11 {
12 outb(0x0, LPT1_BASE);
13 }
15 void high()
16 {
17 outb(0x1, LPT1_BASE);
18 }
20 void trigger()
21 {
22 low();
23 usleep(1);
24 high();
25 }
28 main()
29 {
30 iopl(3);
31 enable_int();
32 while(1) {
33 trigger();
34 getchar();
35 }
36 }
37
9.3.1. Tasklets and Bottom Halves
The interrupt handler runs with interrupts disabled -if the handler takes too much time to execute,
it would affect the performance of the system as a whole. Linux solves the problem in this way -
the interrupt routine responds as fast as possible -
say it copies data from a network card to a buffer in kernel memory -
it then schedules a job to be done later on -
this job would take care of processing the data -
it runs with interrupts enabled.
Task queues and kernel timers can be used for scheduling jobs to be done at a later time -
but the preferred mechanism is a tasklet.
1 #include < linux/module.h>
3 #include < linux/fs.h>
4 #include < linux/interrupt.h>
5 #include < asm/uaccess.h>
6 #include < asm/irq.h>
7 #include < asm/io.h>
9 #define LPT1_IRQ 7
10 #define LPT1_BASE 0x378
12 static char *name = "foo";
13 static int major;
14 static void foo_tasklet_handler(unsigned long data);
16 DECLARE_WAIT_QUEUE_HEAD(foo_queue);
17 DECLARE_TASKLET(foo_tasklet, foo_tasklet_handler, 0);
19 static int
20 foo_read(struct file* filp, char *buf,
21 size_t count, loff_t *f_pos)
22 {
23 static char c = ТaТ;
24 if (count == 0) return 0;
25 interruptible_sleep_on(&foo_queue);
26 copy_to_user(buf, &c, 1);
27 if (c == ТzТ) c = ТaТ;
28 else c++;
29 return 1;
30 }
32 static void foo_tasklet_handler(unsigned long data)
33 {
34 printk("In tasklet...\n");
35 wake_up_interruptible(&foo_queue);
36 }
38 void lpt1_irq_handler(int irq, void* data,
39 struct pt_regs *regs)
40 {
41 printk("irq: %d triggerred, scheduling tasklet\n", irq);
42 tasklet_schedule(&foo_tasklet);
44 }
46 int init_module(void)
47 {
48 int result;
49 major = register_chrdev(0, name, &fops);
50 printk("Registered, got major = %d\n", major);
51 /* Enable parallel port interrupt */
52 outb(0x10, LPT1_BASE+2);
53 result = request_irq(LPT1_IRQ, lpt1_irq_handler,
54 SA_INTERRUPT, "foo", 0);
55 if (result) {
56 printk("Interrupt registration failed\n");
57 return result;
58 }
59 return 0;
60 }
62 void cleanup_module(void)
63 {
64 printk("Freeing irq...\n");
65 free_irq(LPT1_IRQ, 0);
66 printk("Freed...\n");
67 unregister_chrdev(major, name);
68 }
The DECLARE_TASKLET macro takes a tasklet name, a tasklet function and a data value as argument.
The tasklet_schedulefunction schedules the tasklet for future execution.
Chapter 11. A Simple Real Time Clock Driver
11.1. Introduction
How does the PC "remember" the date and time even when you power it off?
There is a small amount of battery powered RAM together with a simple oscillator circuit
which keeps on ticking always. The oscillator is called a real time clock (RTC)
and the battery powered RAM is called the CMOS RAM. Other than storing the date and time,
the CMOS RAM also stores the configuration details of your computer (for example,
which device to boot from).
The CMOS RAM as well as the RTC control and status registers are accessed via two ports,
an address port (0x70) and a data port (0x71).
Suppose we wish to access the 0th byte of the 64 byte CMOS RAM (RTC control and status registers
included in this range) -we write the address 0 to the address port(only the lower 5 bits should be used)
and read a byte from the data port. The 0th byte stores the seconds part of system time in BCD format.
Here is an example program which does this.
Example 11-1. Reading from CMOS RAM
1 #include < asm/io.h>
3 #define ADDRESS_REG 0x705
#define DATA_REG 0x716
#define ADDRESS_REG_MASK 0xe078
#defineSECOND0x009
10 main()
11 {
12 unsigned char i, j;
13 iopl(3);
15 i = inb(ADDRESS_REG);
16 i = i & ADDRESS_REG_MASK;
17 i=i|SECOND;
18 outb(i, ADDRESS_REG);
19 j = inb(DATA_REG);
20 printf("j=%x\n", j);
21 }
11.2. Enabling periodic interrupts
The RTC is capable of generating periodic interrupts at rates from 2Hz to 8192Hz.
This is done by setting the PI bit of the RTC Status Register B (which is at address 0xb).
The frequency is selected by writing a 4 bit "rate" value to Status Register A (address 0xa) -
the rate can vary from 0011 to 1111 (binary). Frequency is derived from rate using
the formula f=65536/2^rate. RTC interrupts are reported via IRQ 8.
Here is a program which puts the RTC in periodic interrupt generation mode.
Example 11-2. rtc.c -generate periodic interrupts
#include < linux/config.h>
#include < linux/module.h>
#include < linux/kernel.h>
#include < linux/sched.h>
#include < linux/interrupt.h>
#include < linux/fs.h>
#include < asm/uaccess.h>
#include < asm/io.h>
#define ADDRESS_REG 0x70
#define DATA_REG 0x71
#define ADDRESS_REG_MASK 0xe0
#define STATUS_A 0x0a
#define STATUS_B 0x0b
#define STATUS_C 0x0c
#define SECOND 0x00
#include "rtc.h"
#define RTC_IRQ 8
#define MODULE_NAME "rtc"
unsigned char
rtc_inb(unsigned char addr)
{
unsigned char i, j;
i = inb(ADDRESS_REG);
/* Clear lower 5 bits */
i = i & ADDRESS_REG_MASK;
i = i | addr;
outb(i, ADDRESS_REG);
j = inb(DATA_REG);
return j;
}
void
rtc_outb(unsigned char data, unsigned char addr)
{
unsigned char i;
i = inb(ADDRESS_REG);
/* Clear lower 5 bits */
i = i & ADDRESS_REG_MASK;
i = i | addr;
outb(i, ADDRESS_REG);
outb(data, DATA_REG);
}
void
enable_periodic_interrupt(void)
{
unsigned char c;
c = rtc_inb(STATUS_B);
/* set Periodic Interrupt enable bit */
c=c|(1..6);
rtc_outb(c,STATUS_B);
rtc_inb(STATUS_C);
}
void disable_periodic_interrupt(void)
{unsignedcharc;
c=rtc_inb(STATUS_B);
c=c&~(1..6);
rtc_outb(c,STATUS_B);
}
int
set_periodic_interrupt_rate(unsignedcharrate)
{
unsignedcharc;
if((rate.3)&&(rate.15))return-EINVAL;
printk("settingrate%d\n",rate);
c=rtc_inb(STATUS_A);
c=c&~0xf;/*Clear4bitsLSB*/
c=c|rate;84rtc_outb(c,STATUS_A);
printk("newrate=%d\n",rtc_inb(STATUS_A)&0xf);
return0;
}
void rtc_int_handler(intirq,void*devid,structpt_regs*regs)
{
printk("Handlercalled...\n");
rtc_inb(STATUS_C);
}
intrtc_init_module(void)
{
intresult;
result=request_irq(RTC_IRQ,rtc_int_handler,
SA_INTERRUPT, MODULE_NAME, 0);
if(result.0){
printk("Unable to get IRQ %d\n", RTC_IRQ);
return result;
}
disable_periodic_interrupt();
set_periodic_interrupt_rate(15);
enable_periodic_interrupt();
return result;
}
void rtc_cleanup(void)
{
free_irq(RTC_IRQ, 0);
return;
}
module_init(rtc_init_module);
module_exit(rtc_cleanup)
Your Linux kernel may already have an RTC driver compiled in -in that case you will have to compile
a new kernel without the RTC driver -otherwise, the above program may fail to acquire the interrupt line.
11.3. Implementing a blocking read
The RTC helps us play with interrupts without using any external circuits.
Suppose we invoke "read" on a device driver -the read method of the driver will transfer data
to user space only if some data is available -otherwise, our process should be put to sleep
and woken up later (when data arrives).
Most peripheral devices generate interrupts when data is available -the interrupt service routine
can be given the job of waking up processes which were put to sleep in the read method.
We try to simulate this situation using the RTC.
Our read method does not transfer any data -it simply goes to sleep -and gets woken up
when an interrupt arrives.
Example 11-3. Implementing blocking read
1
2 #define ADDRESS_REG 0x70
3 #define DATA_REG 0x71
4 #define ADDRESS_REG_MASK 0xe0
5 #define STATUS_A 0x0a
6 #define STATUS_B 0x0b
7 #define STATUS_C 0x0c
8 #define SECOND 0x00
9
#define RTC_PIE_ON0x10
#define RTC_IRQP_SET 0x20
#defineRTC_PIE_OFF0x30
#include < linux/config.h>
#include < linux/module.h>
#include < linux/kernel.h>
#include < linux/sched.h>
#include < linux/interrupt.h>
#include < linux/fs.h>
#include < asm/uaccess.h>
#include < asm/io.h>
#include"rtc.h"
#define RTC_IRQ 825
#define MODULE_NAME "rtc"
static int major;
DECLARE_WAIT_QUEUE_HEAD(rtc_queue);
unsigned char rtc_inb(unsignedcharaddr)
{
unsigned char i,j;
i=inb(ADDRESS_REG);
i=i&ADDRESS_REG_MASK;
i=i|addr;
outb(i,ADDRESS_REG);
j=inb(DATA_REG);
returnj;
}
void rtc_outb(unsignedchardata,{unsigned char i;
i=inb(ADDRESS_REG);
i=i&ADDRESS_REG_MASK;
i=i|addr;
outb(i,ADDRESS_REG);
outb(data,DATA_REG);
}
void unsigned char addr)
enable_periodic_interrupt(void)
57 {
58 unsigned char c;
59 c = rtc_inb(STATUS_B);
60 /* set Periodic Interrupt enable bit */
61 c=c|(1..6);
62 rtc_outb(c, STATUS_B);
63 rtc_inb(STATUS_C); /* Start interrupts! */
64 }
65 void
66 disable_periodic_interrupt(void)
67 {
68 unsigned char c;
69 c = rtc_inb(STATUS_B);
70 /* set Periodic Interrupt enable bit */
71 c=c&~(1..6);
72 rtc_outb(c, STATUS_B);
73 }
74
75 int
76 set_periodic_interrupt_rate(unsigned char rate)
77 {
78 unsigned char c;
79 if((rate.3)&&(rate.15))return-EINVAL;
80 printk("setting rate %d\n", rate);
81 c = rtc_inb(STATUS_A);
82 c = c & ~0xf; /* Clear 4 bits LSB */
83 c = c | rate;
84 rtc_outb(c, STATUS_A);
85 printk("new rate = %d\n", rtc_inb(STATUS_A) & 0xf);
86 return 0;
}
89 void
90 rtc_int_handler(int irq, void *devid, struct pt_regs *regs)
91 {
92 wake_up_interruptible(&rtc_queue);
93 rtc_inb(STATUS_C);
94 }
96 int rtc_open(struct inode* inode, struct file *filp)
98 {
99 int result;
100 result = request_irq(RTC_IRQ,
101 rtc_int_handler, SA_INTERRUPT, MODULE_NAME, 0);
102 if(result.0){
103 printk("Unable to get IRQ %d\n", RTC_IRQ);
104 return result;
105 }
106 return result;
107 }
109 int
110 rtc_close(struct inode* inode, struct file *filp)
111 {
112 free_irq(RTC_IRQ, 0);
113 return 0;
114 }
116 int
117 rtc_ioctl(struct inode* inode, struct file* filp,
118 unsigned int cmd, unsigned long val)
119 {
120 int result = 0;
121 switch(cmd){
122 case RTC_PIE_ON:
123 enable_periodic_interrupt();
124 break;
125 case RTC_PIE_OFF:
126 disable_periodic_interrupt();
127 break;
128 case RTC_IRQP_SET:
129 result = set_periodic_interrupt_rate(val);
130 break;
131 }
132 return result;
133 }
135 ssize_t
136 rtc_read(struct file *filp, char *buf,
137 size_t len, loff_t *offp)
138 {
139 interruptible_sleep_on(&rtc_queue);
140 return 0;
141 }
142
143 struct file_operations fops = {
open:rtc_open,release:rtc_close,ioctl:rtc_ioctl,read:rtc_read,};
int rtc_ini t_module(void)
{major=register_chrdev(0,MODULE_NAME,&fops);
if(major.0){
printk("Errorregisterchardevice\n");
return major;
}printk("major=%d\n",major);
return 0;}
void rtc_cleanup(void)
{unregister_chrdev(major,MODULE_NAME);
}
module_init(rtc_init_module);
module_exit(rtc_cleanup)
Here is a user space program which tests the working of this driver.
Example 11-4. User space test program
#include"rtc.h"
#include < assert.h>
#include < sys/types.h>
#include < sys/stat.h>
#include < fcntl.h>
main()
{
int fd,dat,i,r;
fd=open("rtc",O_RDONLY);
assert(fd.=0);
r=ioctl(fd,RTC_PIE_ON,0);
assert(r==0);
r=ioctl(fd,RTC_IRQP_SET,);
assert(r==0);
for(i=0;i.20;i++){
read(fd,&dat,sizeof(dat));
printf("i=%d\n",i);22}23}
11.4. Generating Alarm Interrupts
The RTC can be instructed to generate an interrupt after a speci.ed period.
The idea is simple. Locations 0x1, 0x3 and 0x5 should store the second, minute and hour
at which the alarm should occur. If the Alarm Interrupt (AI) bit of Status Register B is set,
then the RTC will compare the current time (second, minute and hour) with the alarm time
each instant the time gets updated. If they match, an interrupt is raised on IRQ 8.
Example 11-5. Generating Alarm Interrupts
1
2 #define ADDRESS_REG 0x70
3 #define DATA_REG 0x71
4 #define ADDRESS_REG_MASK
5 #define STATUS_A 0x0a
6 #define STATUS_B 0x0b
7 #define STATUS_C 0x0c
8
9 #define SECOND 0x00
10 #define ALRM_SECOND 0x01
11 #define MINUTE 0x02
12 #define ALRM_MINUTE 0x03
13 #define HOUR 0x04
14 #define ALRM_HOUR 0x05
16 #define RTC_PIE_ON 0x10 /* Enable Periodic Interrupt */
17 #define RTC_IRQP_SET 0x20 /* Set periodic interrupt rate */
18 #define RTC_PIE_OFF 0x30 /* Disable Periodic Interrupt */
19 #define RTC_AIE_ON 0x40 /* Enable Alarm Interrupt */
20 #define RTC_AIE_OFF 0x50 /* Disable Alarm Interrupt */
21
22 /* Set seconds after which alarm should be raised */
23 #define RTC_ALRMSECOND_SET 0x60
24
25 #include .linux/config.h.
26 #include .linux/module.h.
27 #include .linux/kernel.h.
28 #include .linux/sched.h.
29 #include .linux/interrupt.h.
30 #include .linux/fs.h.
31 #include .asm/uaccess.h.
32 #include .asm/io.h.
33
34 #include "rtc.h"
35 #define RTC_IRQ 8
36 #define MODULE_NAME "rtc"
37 static int major;
38
39 DECLARE_WAIT_QUEUE_HEAD(rtc_queue);
40
41 int
42 bin_to_bcd(unsigned char c)
43 {
44 return ((c/10) .. 4) | (c % 10);
45 }
46
78
47 int
48 bcd_to_bin(unsigned char c)
49 {
50return(c..4)*10+(c&0xf);
51 }
52
53 void
54 enable_alarm_interrupt(void)
55 {
56 unsigned char c;
57
58 printk("Enabling alarm interrupts\n");
59 c = rtc_inb(STATUS_B);
60c=c|(1..5);
61 rtc_outb(c, STATUS_B);
62 printk("STATUS_B = %x\n", rtc_inb(STATUS_B));
63 rtc_inb(STATUS_C);
64 }
65
66 void
67 disable_alarm_interrupt(void)
68 {
69 unsigned char c;
70 c = rtc_inb(STATUS_B);
71c=c&~(1..5);
72 rtc_outb(c, STATUS_B);
73 }
74
75 /* Raise an alarm after nseconds (nseconds .= 59) */
76 void
77 alarm_after_nseconds(int nseconds)
78 {
79 unsigned char second, minute, hour;
80
81 second = rtc_inb(SECOND);
82 minute = rtc_inb(MINUTE);
83 hour = rtc_inb(HOUR);
84
85 second = bin_to_bcd((bcd_to_bin(second) + nseconds) % 60);
86 if(second == 0)
87 minute = bin_to_bcd((bcd_to_bin(minute)+1) % 60);
88 if(minute == 0)
89 hour = bin_to_bcd((bcd_to_bin(hour)+1) % 24);
90
91 rtc_outb(second, ALRM_SECOND);
92 rtc_outb(minute, ALRM_MINUTE);
93 rtc_outb(hour, ALRM_HOUR);
94 }
95
96 rtc_ioctl(struct inode* inode, struct file* filp,
97 unsigned int cmd, unsigned long val)
98 {
99 int result = 0;
100 switch(cmd){
101 102 103 case RTC_PIE_ON: enable_periodic_interrupt(); break;
79
104 case RTC_PIE_OFF:
105 disable_periodic_interrupt();
106 break;
107 case RTC_IRQP_SET:
108 result = set_periodic_interrupt_rate(val);
109 break;
110 case RTC_AIE_ON:
111 enable_alarm_interrupt();
112 break;
113 case RTC_AIE_OFF:
114 disable_alarm_interrupt();
115 break;
116 case RTC_ALRMSECOND_SET:
117 alarm_after_nseconds(val);
118 break;
119 }
120 return result;
121 }
|