Search     or:     and:
 LINUX 
 Language 
 Kernel 
 Package 
 Book 
 Test 
 OS 
 Forum 
iakovlev.org

Вглубь исходников

Этот раздел описывает концепцию системных вызовов и драйверов(их еще называют модулями) под линукс. Системные вызовы дают возможность программисту общаться с операционной системой. Добавление нового системного вызова - один из путей добавления нового сервиса ядра. В 3-й главе описана внутрення реализация системных вызовов. Этот раздел описывает аспекты внедрения ваших собственных системных вызовов в ядро.

Драйвер - это интерфейс , с помощью которого ядро позволяет программисту контролировать input/output устройства. Целые книги посвящены разработке линуксовых драйверов. Мы рассмотрим , как драйвер представлен в файловой системе. Затем мы сконструируем функционал символьного драйвера. Мы создадим системный вызов и встроим его в ядро.

Файловая система

Девайсы в линуксе доступны через каталог /dev. Например команда ls l /dev/random выдаст следующее :

 crw-rw-rw- 1 root  root  1, 8 Oct 2 08:08 /dev/random
 

Символ "c" говорит о том , что устройство символьное, в то время как b говорит о блочном устройстве. Далее идут владелец , группа и два числа (1,8). Единица - driver's major number , восьмерка - minor number. Когда драйвер регистрируется в ядре , это делается с помощью major number. Именно с помощью major number ядро загружает этот драйвер. Minor number передается драйверу как параметр конкретного устройства. Например /dev/urandom имеет major number = 1 и minor number = 9. Устройство зарегестрировано с major number = 1 и управляет /dev/random и /dev/urandom.

[1] mknod создает как блочные , так и символьные файлы драйверов.

Для генерации случайного числа мы просто читаем из /dev/random. Нужно прочитать 4 байта:[2]

 
 lkp@lkp:~$ head -c4 /dev/urandom | od -x
 0000000 823a 3be5
 0000004
 
 

Если вы повторите команду , вы увидите что 4 байта - [823a 3be5] - изменятся.

Мы знаем , что /dev/random имеет major number = 1. Выполним команду для /proc/devices:

 lkp@lkp:~$ less /proc/devices
 Character devices:
  1 mem
 

Поищем в драйвере mem ссылки на "random":

 
 -----------------------------------------------------------------------
 drivers/char/mem.c
 653 static int memory_open(struct inode * inode, struct file * filp)
 654 {
 655   switch (iminor(inode)) {
 656     case 1:
 ...
 676     case 8:
 677       filp->f_op = &random_fops;
 678       break;
 679     case 9:
 680       filp->f_op = &urandom_fops;
 681       break;
 -----------------------------------------------------------------------
 
 

Строка 655681

В этом switch происходит инициализация драйверных структур на основе minor number . В частности , инициализируются filp и fops .

"А что же такое filp? И что же такое fop?"

Filps и Fops

A filp is simply a file struct pointer, and a fop is a file_operations struct pointer. The kernel uses the file_operations structure to determine what functions to call when the file is operated on. Here are selected sections of the structures that are used in the random device driver:

 
 -----------------------------------------------------------------------
 include/linux/fs.h
 556 struct file {
 557   struct list_head  f_list;
 558   struct dentry   *f_dentry;
 559   struct vfsmount   *f_vfsmnt;
 560   struct file_operations *f_op;
 561   atomic_t    f_count;
 562   unsigned int   f_flags;
 ...
 581   struct address_space *f_mapping;
 582 };
 -----------------------------------------------------------------------
 -----------------------------------------------------------------------
 include/linux/fs.h
 863 struct file_operations {
  864   struct module *owner;
  865   loff_t (*llseek) (struct file *, loff_t, int);
  866   ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
  867   ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
  868   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
  869   ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
  870   int (*readdir) (struct file *, void *, filldir_t);
  871   unsigned int (*poll) (struct file *, struct poll_table_struct *);
  872   int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 
 ...
 888 };
 -----------------------------------------------------------------------
 
 

The random device driver declares which file operations it provides in the following way: Functions that the drivers implement must conform to the prototypes listed in the file_operations structure:

 
 -----------------------------------------------------------------------
 drivers/char/random.c
 1824 struct file_operations random_fops = {
  1825   .read   = random_read,
  1826   .write   = random_write,
  1827   .poll   = random_poll,
  1828   .ioctl   = random_ioctl,
  1829 };
  1830 
  1831 struct file_operations urandom_fops = {
  1832   .read   = urandom_read,
  1833   .write   = random_write,
  1834   .ioctl   = random_ioctl,
 1835 };
 -----------------------------------------------------------------------
 
 

Lines 18241829

The random device provides the operations of read, write, poll, and ioctl.

Lines 18311835

The urandom device provides the operations of read, write, and ioctl.

The poll operation allows a programmer to check before performing an operation to see if that operation blocks. This suggests, and is indeed the case, that /dev/random blocks if a request is made for more bytes of entropy than are in its entropy pool.[3] /dev/urandom does not block, but might not return completely random data, if the entropy pool is too small. For more information consult your systems man pages, specifically man 4 random.

[3] In the random device driver, entropy refers to system data that cannot be predicted. Typically, it is harvested from keystroke timing, mouse movements, and other irregular input.

Digging deeper into the code, notice that when a read operation is performed on /dev/random, the kernel passes control to the function random_read() (see line 1825). random_read() is defined as follows:

 
 -----------------------------------------------------------------------
 drivers/char/random.c
 1588 static ssize_t
  1589 random_read(struct file * file, char __user * buf, size_t 
 nbytes, loff_t *ppos)
 -----------------------------------------------------------------------
 
 

The function parameters are as follows:

  • file. Points to the file structure of the device.

  • buf. Points to an area of user memory where the result is to be stored.

  • nbytes. The size of data requested.

  • ppos. Points to a position within the file that the user is accessing.

This brings up an interesting issue: If the driver executes in kernel space, but the buffer is memory in user space, how do we safely get access to the data in buf? The next section explains the process of moving data between user and kernel memory.

10.1.3. User Memory and Kernel Memory

If we were to simply use memcpy() to copy the buffer from kernel space to user space, the copy operation might not work because the user space addresses could be swapped out when memcpy() occurs. Linux has the functions copy_to_user() and copy_from_user(), which allow drivers to move data between kernel space and user space. In read_random(), this is done in the function extract_entropy(), but there is an additional twist:

 
 -----------------------------------------------------------------------
 drivers/char/random.c
  1: static ssize_t extract_entropy(struct entropy_store *r, void * buf,
  2:         size_t nbytes, int flags)
  3: {
 1349 static ssize_t extract_entropy(struct entropy_store *r, void * buf,
  1350        size_t nbytes, int flags)
  1351 {
 ...
 1452     /* Copy data to destination buffer */
 1453     i = min(nbytes, HASH_BUFFER_SIZE*sizeof(__u32)/2);
 1454     if (flags & EXTRACT_ENTROPY_USER) {
 1455       i -= copy_to_user(buf, (__u8 const *)tmp, i);
 1456       if (!i) {
 1457         ret = -EFAULT;
 1458         break;
 1459       }
 1460     } else
 1461       memcpy(buf, (__u8 const *)tmp, i);
 -----------------------------------------------------------------------
 
 

exTRact_entropy() has the following parameters:

  • r. A pointer to an internal storage of entropy, it is ignored for the purposes of our discussion.

  • buf. A pointer to an area of memory that should be filled with data.

  • nbytes. The amount of data to write to buf.

  • flags. Informs the function whether buf is in kernel or user memory.

exTRact_entropy() returns ssize_t, which is the size, in bytes, of the random data generated.

Lines 14541455

If flags tells us that buf points to a location in user memory, we use copy_to_user() to copy the kernel memory pointed to by tmp to the user memory pointed to by buf.

Lines 14601461

If buf points to a location in kernel memory, we simply use memcpy() to copy the data.

Obtaining random bytes is something that both kernel space and user space programs are likely to use; a kernel space program can avoid the overhead of copy_to_user() by not setting the flag. For example, the kernel can implement an encrypted filesystem and can avoid the overhead of copying to user space.

10.1.4. Wait Queues

We detoured slightly to explain how to move data between user and kernel memory. Let's return to read_random() and examine how it uses wait queues.

Occasionally, a driver might need to wait for some condition to be true, perhaps access to a system resource. In this case, we don't want the kernel to wait for the access to complete. It is problematic to cause the kernel to wait because all other system processing halts while the wait occurs.[4] By declaring a wait queue, you can postpone processing until a later time when the condition you are waiting on has occurred.

[4] Actually, the CPU running the kernel task will wait. On a multi-CPU system, other CPUs can continue to run.

Two structures are used for this process of waiting: a wait queue and a wait queue head. A module should create a wait queue head and have parts of the module that use sleep_on and wake_up macros to manage things. This is precisely what occurs in random_read():

 
 -----------------------------------------------------------------------
 drivers/char/random.c
 1588 static ssize_t
  1589 random_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)
  1590 {
  1591   DECLARE_WAITQUEUE(wait, current);
 ...
 1597   while (nbytes > 0) {
 ...
 1608     n = extract_entropy(sec_random_state, buf, n,
  1609          EXTRACT_ENTROPY_USER |
  1610          EXTRACT_ENTROPY_LIMIT |
  1611          EXTRACT_ENTROPY_SECONDARY);
 ...
 1618     if (n == 0) {
  1619       if (file->f_flags & O_NONBLOCK) {
  1620         retval = -EAGAIN;
  1621         break;
  1622       }
  1623       if (signal_pending(current)) {
  1624         retval = -ERESTARTSYS;
  1625         break;
  1626       }
 ...
 1632       set_current_state(TASK_INTERRUPTIBLE);
  1633       add_wait_queue(&random_read_wait, &wait);
  1634 
  1635       if (sec_random_state->entropy_count / 8 == 0)
  1636         schedule();
  1637 
  1638       set_current_state(TASK_RUNNING);
  1639       remove_wait_queue(&random_read_wait, &wait);
 ...
 1645       continue;
 1646  }
 -----------------------------------------------------------------------
 
 

Line 1591

The wait queue wait is initialized on the current task. The macro current refers to a pointer to the current task's task_struct.

Lines 16081611

We extract a chunk of random data from the device.

Lines 16181626

If we could not extract the necessary amount of entropy from the entropy pool and we are non-blocking or there is a signal pending, we return an error to the caller.

Lines 16311633

Set up the wait queue. random_read() uses its own wait queue, random_read_wait, instead of the system wait queue.

Lines 16351636

At this point, we are on a blocking read and if we don't have 1 byte worth of entropy, we release control of the processor by calling schedule(). (The entropy_count variables hold bits and not bytes; thus, the division by 8 to determine whether we have a full byte of entropy.)

Lines 16381639

When we are eventually restarted, we clean up our wait queue.

NOTE

The random device in Linux requires the entropy queue to be full before returning. The urandom device does not have this requirement and returns regardless of the size of data available in the entropy pool.


Let's closely look at what happens when a task calls schedule():

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2184 asmlinkage void __sched schedule(void)
 2185 {
 ...
 2209   prev = current;
 ...
 2233   switch_count = &prev->nivcsw;
 2234   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 2235     switch_count = &prev->nvcsw;
 2236     if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
 2237         unlikely(signal_pending(prev))))
 2238       prev->state = TASK_RUNNING;
 2239     else
 2240       deactivate_task(prev, rq);
 2241   } 
 2242 ...
 -----------------------------------------------------------------------
 
 

Line 2209

A pointer to the current task's task structure is stored in the prev variable. In cases where the task itself called schedule(), current points to that task.

Line 2233

We store the task's context switch counter, nivcsw, in switch_count. This is incremented later if the switch is successful.[5]

[5] See Chapters 4 and 7 for more information on how context switch counters are used.

Line 2234

We only enter this if statement when the task's state, prev->state, is non-zero and there is not a kernel preemption. In other words, we enter this statement when a task's state is not TASK_RUNNING, and the kernel has not preempted the task.

Lines 22352241

If the task is interruptible, we're fairly certain that it wanted to release control. If a signal is pending for the task that wanted to release control, we set the task's state to TASK_RUNNING so that is has the opportunity to be chosen for execution by the scheduler when control is passed to another task. If no signal is pending, which is the common case, we deactivate the task and set switch_count to nvcsw. The scheduler increments switch_count later. Thus, nvcsw or nivcsw is incremented.

The schedule() function then picks the next task in the scheduler's run queue and switches control to that task.[6]

[6] For detailed information, see the " switch_to()" section in Chapter 7.

By calling schedule(), we allow a task to yield control of the processor to another kernel task when the current task knows it will be waiting for some reason. Other tasks in the kernel can make use of this time and, hopefully, when control returns to the function that called schedule(), the reason for waiting will have been removed.

Returning from our digression on the scheduler to the random_read() function, eventually, the kernel gives control back to random_read() and we clean up our wait queue and continue. This repeats the loop and, if the system has generated enough entropy, we should be able to return with the requested number of random bytes.

random_read() sets its state to TASK_INTERRUPTIBLE before calling schedule() to allow itself to be interrupted by signals while it is on a wait queue. The driver's own code generates these signals when extra entropy is collected by calling wake_up_interruptible() in batch_entropy_process() and random_ioctl(). TASK_UNINTERRUPTIBLE is usually used when the task is waiting for hardware to respond as opposed to software (when TASK_INTERRUPTIBLE is normally used).

The code that random_read() uses to pass control to another task (see lines 16321639, drivers/char/random.c) is a variant of interruptible_sleep_on() from the scheduler code.

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2489 #define SLEEP_ON_VAR         \
  2490   unsigned long flags;       \
  2491   wait_queue_t wait;        \
  2492   init_waitqueue_entry(&wait, current);
  2493
  2494 #define SLEEP_ON_HEAD         \
  2495   spin_lock_irqsave(&q->lock,flags);    \
  2496   __add_wait_queue(q, &wait);      \
  2497   spin_unlock(&q->lock);
  2498
  2499 #define SLEEP_ON_TAIL         \
  2500   spin_lock_irq(&q->lock);      \
  2501   __remove_wait_queue(q, &wait);     \
  2502   spin_unlock_irqrestore(&q->lock, flags);
 2503
  2504 void fastcall __sched interruptible_sleep_on(wait_queue_head_t *q)
  2505 {
  2506   SLEEP_ON_VAR
  2507
  2508   current->state = TASK_INTERRUPTIBLE;
  2509
  2510   SLEEP_ON_HEAD
  2511   schedule();
  2512   SLEEP_ON_TAIL
 2513 }
 -----------------------------------------------------------------------
 
 

q is a wait_queue_head structure that coordinates the module's sleeping and waiting.

Lines 24942497

Atomically add our task to a wait queue q.

Lines 24992502

Atomically remove the task from the wait queue q.

Lines 25042513

Add to the wait queue. Cede control of the processor to another task. When we are given control, remove ourselves from the wait queue.

random_read() uses its own wait queue code instead of the standard macros, but essentially does an interruptible_sleep_on() with the exception that, if we have more than a full byte's worth of entropy, we don't yield control but loop again to try and get all the requested entropy. If there isn't enough entropy, random_read() waits until it's awoken with wake_up_interruptible() from entropy-gathering processes of the driver.

10.1.5. Work Queues and Interrupts

Device drivers in Linux routinely have to deal with interrupts generated by the devices with which they are interfacing. Interrupts trigger an interrupt handler in the device driver and cause all currently executing codeboth user space and kernel spaceto cease execution. Clearly, it is desirable to have the driver's interrupt handler execute as quickly as possible to prevent long waits in kernel processing.

However, this leads us to the standard dilemma of interrupt handling: How do we handle an interrupt that requires a significant amount of work? The standard answer is to use top-half and bottom-half routines. The top-half routine quickly handles accepting the interrupt and schedules a bottom-half routine, which has the code to do the majority of the work and is executed when possible. Normally, the top-half routine runs with interrupts disabled to ensure that an interrupt handler isn't interrupted by the same interrupt. Thus, the device driver does not have to handle recursive interrupts. The bottom-half routine normally runs with interrupts enabled so that other interrupts can be handled while it continues the bulk of the work.

In prior Linux kernels, this division of top-half and bottom-half, also known as fast and slow interrupts, was handled by task queues. New to the 2.6 Linux kernel is the concept of a work queue, which is now the standard way to deal with bottom-half interrupts.

When the kernel receives an interrupt, the processor stops executing the current task and immediately handles the interrupt. When the CPU enters this mode, it is commonly referred to as being in interrupt context. The kernel, in interrupt context, then determines which interrupt handler to pass control to. When a device driver wants to handle an interrupt, it uses request_irq() to request the interrupt number and register the handler function to be called when this interrupt is seen. This registration is normally done at module initialization time. The top-half interrupt function registered with request_irq() does minimal management and then schedules the appropriate work to be done upon a work queue.

Like request_irq() in the top half, work queues are normally registered at module initialization. They can be initialized statically with the DECLARE_WORK() macro or the work structure can be allocated and initialized dynamically by calling INIT_WORK(). Here are the definitions of those macros:

 
 -----------------------------------------------------------------------
 include/linux/workqueue.h
 30 #define DECLARE_WORK(n, f, d)         \
 31   struct work_struct n = __WORK_INITIALIZER(n, f, d)
 ...
 45 #define INIT_WORK(_work, _func, _data)       \
 46   do {             \
 47     INIT_LIST_HEAD(&(_work)->entry);    \
 48     (_work)->pending = 0;       \
 49     PREPARE_WORK((_work), (_func), (_data));  \
 50     init_timer(&(_work)->timer);     \
 51   } while (0)
 -----------------------------------------------------------------------
 
 

Both macros take the following arguments:

  • n or work. The name of the work structure to create or initialize.

  • f or func. The function to run when the work structure is removed from a work queue.

  • d or data. Holds the data to pass to the function f, or func, when it is run.

The interrupt handler function registered in register_irq() would then accept an interrupt and send the relevant data from the top half of the interrupt handler to the bottom half by setting the work_struct data section and calling schedule_work() on the work queue.

The code present in the work queue function operates in process context and can thus perform work that is impossible to do in interrupt context, such as copying to and from user space or sleeping.

Tasklets are similar to work queues but operate entirely in interrupt context. This is useful when you have little to do in the bottom half and want to save the overhead of a top-half and bottom-half interrupt handler. Tasklets are initialized with the DECLARE_TASKLET() macro:

 
 -----------------------------------------------------------------------
 include/linux/interrupt.h
 136 #define DECLARE_TASKLET(name, func, data) \
 137 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
 -----------------------------------------------------------------------
 
 

  • name. The name of the tasklet structure to create.

  • func. The function to call when the tasklet is scheduled.

  • data. Holds the data to pass to the func function when the tasklet executes.

To schedule a tasklet, use tasklet_schedule():

 
 -----------------------------------------------------------------------
 include/linux/interrupt.h
 171 extern void FASTCALL(__tasklet_schedule(struct tasklet_struct *t));
 172 
 173 static inline void tasklet_schedule(struct tasklet_struct *t)
 174 {
 175   if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
 176     __tasklet_schedule(t);
 177 }
 -----------------------------------------------------------------------
 
 

  • tasklet_struct. The name of the tasklet created with DECLARE_TASKLET().

In the top-half interrupt handler, you can call tasklet_schedule() and be guaranteed that, sometime in the future, the function declared in the tasklet is executed. Tasklets differ from work queues in that different tasklets can run simultaneously on different CPUs. If a tasklet is already scheduled, and scheduled again before the tasklet executes, it is only executed once. As tasklets run in interrupt context, they cannot sleep or copy data to user space. Because of running in interrupt context, if different tasklets need to communicate, the only safe way to synchronize is by using spinlocks.

10.1.6. System Calls

There are other ways to add code to the kernel besides device drivers. Linux kernel system calls (syscalls) are the method by which user space programs can access kernel services and system hardware. Many of the C library routines available to user mode programs bundle code and one or more system calls to accomplish a single function. In fact, syscalls can also be accessed from kernel code.

By its nature, syscall implementation is hardware specific. In the Intel architecture, all syscalls use software interrupt 0x80. Parameters of the syscall are passed in the general registers. The implementation of syscall on the x86 architecture limits the number of parameters to 5. If more than 5 are required, a pointer to a block of parameters can be passed. Upon execution of the assembler instruction int 0x80, a specific kernel mode routine is called by way of the exception-handling capabilities of the processor.

10.1.7. Other Types of Drivers

Until now, all the device drivers we dealt with have been character drivers. These are usually the easiest to understand, but you might want to write other drivers that interface with the kernel in different ways.

Block devices are similar to character devices in that they can be accessed via the filesystem. /dev/hda is the device file for the primary IDE hard drive on the system. Block devices are registered and unregistered in similar ways to character devices by using the functions register_blkdev() and unregister_blkdev().

A major difference between block drivers and character drivers is that block drivers do not provide their own read and write functionality; instead, they use a request method.

The 2.6 kernel has undergone major changes in the block device subsystem. Old functions, such as block_read() and block_write() and kernel structures like blk_size and blksize_size, have been removed. This section focuses solely on the 2.6 block device implementation.

If you need the Linux kernel to work with a disk (or a disk-like) device, you need to write a block device driver. The driver must inform the kernel what kind of disk it's interfacing with. It does this by using the gendisk structure:

 
 -----------------------------------------------------------------------
 include/linux/genhd.h
 82 struct gendisk {
 83   int major;      /* major number of driver */
 84   int first_minor;
 85   int minors;
 86   char disk_name[32];    /* name of major driver */
 87   struct hd_struct **part;  /* [indexed by minor] */
 88   struct block_device_operations *fops;
 89   struct request_queue *queue;
 90   void *private_data;
 91   sector_t capacity;
 ...
 -----------------------------------------------------------------------
 
 

Line 83

major is the major number for the block device. This can be either statically set or dynamically generated by using register_blkdev(), as it was in character devices.

Lines 8485

first_minor and minors are used to determine the number of partitions within the block device. minors contains the maximum number of minor numbers the device can have. first_minor contains the first minor device number of the block device.

Line 86

disk_name is a 32-character name for the block device. It appears in the /dev filesystem, sysfs and /proc/partitions.

Line 87

hd_struct is the set of partitions that is associated with the block device.

Line 88

fops is a pointer to a block_operations structure that contains the operations open, release, ioctl, media_changed, and revalidate_disk. (See include/ linux/fs.h.) In the 2.6 kernel, each device has its own set of operations.

Line 89

request_queue is a pointer to a queue that helps manage the device's pending operations.

Line 90

private_data points to information that will not be accessed by the kernel's block subsystem. Typically, this is used to store data that is used in low-level, device-specific operations.

Line 91

capacity is the size of the block device in 512-byte sectors. If the device is removable, such as a floppy disk or CD, a capacity of 0 signifies that no disk is present. If your device doesn't use 512-byte sectors, you need to set this value as if it did. For example, if your device has 1,000 256-byte sectors, that's equivalent to 500 512-byte sectors.

In addition to having a gendisk structure, a block device also needs a spinlock structure for use with its request queue.

Both the spinlock and fields in the gendisk structure must be initialized by the device driver. (Go to http://en.wikipedia.org/wiki/Ram_disk for a demonstration of initializing a RAM disk block device driver.) After the device is initialized and ready to handle requests, the add_disk() function should be called to add the block device to the system.

Finally, if the block device can be used as a source of entropy for the system, the module initialization can also call add_disk_randomness(). (For more information, see drivers/char/random.c.)

Now that we covered the basics of block device initialization, we can examine its complement, exiting and cleaning up the block device driver. This is easy in the 2.6 version of Linux.

del_gendisk( struct gendisk ) removes the gendisk from the system and cleans up its partition information. This call should be followed by putdisk (struct gendisk), which releases kernel references to the gendisk. The block device is unregistered via a call to unregister_blkdev(int major, char[16] device_name), which then allows us to free the gendisk structure.

We also need to clean up the request queue associated with the block device driver. This is done by using blk_cleanup_queue( struct *request_queue). Note: If you can only reference the request queue via the gendisk structure, be sure to call blk_cleanup_queue before freeing gendisk.

In the block device initialization and shutdown overview, we could easily avoid talking about the specifics of request queues. But now that the driver is set up, it has to actually do something, and request queues are how a block device accomplishes its major functions of reading and writing.

 
 -----------------------------------------------------------------------
 include/linux/blkdev.h
 576 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
 ...
 -----------------------------------------------------------------------
 
 

Line 576

To create a request queue, we use blk_init_queue and pass it a pointer to a spinlock to control queue access and a pointer to a request function that is called whenever the device is accessed. The request function should have the following prototype:

 static void my_request_function( request_queue_t *q );
 

The guts of the request function usually use a number of helper functions with ease. To determine the next request to be processed, the elv_next_request() function is called and it returns a pointer to a request structure, or it returns null if there is no next request.

In the 2.6 kernel, the block device driver iterates through BIO structures in the request structure. BIO stands for Block I/O and is fully defined in include/linux/bio.h.

The BIO structure contains a pointer to a list of biovec structures, which are defined as follows:

 
 -----------------------------------------------------------------------
 include/linux/bio.h
 47 struct bio_vec {
 48   struct page  *bv_page;
 49   unsigned int bv_len;
 50   unsigned int bv_offset;
 51 };
 -----------------------------------------------------------------------
 
 

Each biovec uses its page structure to hold data buffers that are eventually written to or read from disk. The 2.6 kernel has numerous bio helpers to iterate over the data contained within bio structures.

To determine the size of BIO operation, you can either consult the bio_size field within the BIO struct to get a result in bytes or use the bio_sectors() macro to get the size in sectors. The block operation type, READ or WRITE, can be determined by using bio_data_dir().

To iterate over the biovec list in a BIO structure, use the bio_for_each_segment() macro. Within that loop, even more macros can be used to further delve into biovec bio_page(), bio_offset(), bio_curr_sectors(), and bio_data(). More information can be found in include/linux.bio.h and Documentation/block/biodoc.txt.

Some combination of the information contained in the biovec and the page structures allow you to determine what data to read or write to the block device. The low-level details of how to read and write the device are tied to the hardware the block device driver is using.

Now that we know how to iterate over a BIO structure, we just have to figure out how to iterate over a request structure's list of BIO structures. This is done using another macro: rq_for_each_bio:

 
 -----------------------------------------------------------------------
 include/linux/blkdev.h
 495 #define rq_for_each_bio(_bio, rq)  \
 496   if ((rq->bio))     \
 497     for (_bio = (rq)->bio; _bio; _bio = bio->bi_next)
 -----------------------------------------------------------------------
 
 

Line 495

bio is the current BIO structure and rq is the request to iterate over.

After each BIO is processed, the driver should update the kernel on its progress. This is done by using end_that_request_first().

 
 -----------------------------------------------------------------------
 include/linux/blkdev.h
 557 extern int end_that_request_first(struct request *, int, int); 
 -----------------------------------------------------------------------
 
 

Line 557

The first int argument should be non-zero unless an error has occurred, and the second int argument represents the number of sectors that the device processed.

When end_that_request_first() returns 0, the entire request has been processed and the cleanup needs to begin. This is done by calling blkdev_dequeue_request() and end_that_request_last() in that orderboth of which take the request as the sole argument.

After this, the request function has done its job and the block subsystem uses the block device driver's request queue function to perform disk operations. The device might also need to handle certain ioctl functions, as our RAM disk handles partitioning, but those, again, depend on the type of block device.

This section has only touched on the basics of block devices. There are Linux hooks for DMA operations, clustering, request queue command preparation, and many other features of more advanced block devices. For further reading, refer to the Documentation/block directory.

10.1.8. Device Model and sysfs

New in the 2.6 kernel is the Linux device model, to which sysfs is intimately related. The device model stores a set of internal data related to the devices and drivers on a system. The system tracks what devices exist and breaks them down into classes: block, input, bus, etc. The system also keeps track of what drivers exist and how they relate to the devices they manage. The device model exists within the kernel, and sysfs is a window into this model. Because some devices and drivers do not expose themselves through sysfs, a good way of thinking of sysfs is the public view of the kernel's device model.

Certain devices have multiple entries within sysfs.

Only one copy of the data is stored within the device model, but there are various ways of accessing that piece of data, as the symbolic links in the sysfs TRee shows.

The sysfs hierarchy relates to the kernel's kobject and kset structures. This model is fairly complex, but most driver writers don't have to delve too far into the details to accomplish many useful tasks.[7] By using the sysfs concept of attributes, you work with kobjects, but in an abstracted way. Attributes are parts of the device or driver model that can be accessed or changed via the sysfs filesystem. They could be internal module variables controlling how the module manages tasks or they could be directly linked to various hardware settings. For example, an RF transmitter could have a base frequency it operates upon and individual tuners implemented as offsets from this base frequency. Changing the base frequency can be accomplished by exposing a module attribute of the RF driver to sysfs.

[7] Reference documentation/filesystems/sysfs.txt in the kernel source.

When an attribute is accessed, sysfs calls a function to handle that access, show() for read and store() for write. There is a one-page limit on the size of data that can be passed to show() or store() functions.

With this outline of how sysfs works, we can now get into the specifics of how a driver registers with sysfs, exposes some attributes, and registers specific show() and store() functions to operate when those attributes are accessed.

The first task is to determine what device class your new device and driver should fall under (for example, usb_device, net_device, pci_device, sys_device, and so on). All these structures have a char *name field within them. sysfs uses this name field to display the new device within the sysfs hierarchy.

After a device structure is allocated and named, you must create and initialize a devicer_driver structure:

 
 -----------------------------------------------------------------------
 include/linux/device.h
 102 struct device_driver {
  103   char     * name;
  104   struct bus_type   * bus;
  105
  106   struct semaphore  unload_sem;
  107   struct kobject   kobj;
  108   struct list_head  devices;
  109
  110   int  (*probe)  (struct device * dev);
  111   int  (*remove)  (struct device * dev);
  112   void (*shutdown)  (struct device * dev);
  113   int  (*suspend)  (struct device * dev, u32 state, u32 level);
  114   int  (*resume)  (struct device * dev, u32 level);
 115};
 -----------------------------------------------------------------------
 
 

Line 103

name refers to the name of the driver that is displayed in the sysfs hierarchy.

Line 104

bus is usually filled in automatically; a driver writer need not worry about it.

Lines 105115

The programmer does not need to set the rest of the fields. They should be automatically initialized at the bus level.

We can register our driver during initialization by calling driver_register(), which passes most of the work to bus_add_driver(). Similarly upon driver exit, be sure to add a call to driver_unregister().

 
 -----------------------------------------------------------------------
 drivers/base/driver.c
 86 int driver_register(struct device_driver * drv)
   87 {
   88   INIT_LIST_HEAD(&drv->devices);
   89   init_MUTEX_LOCKED(&drv->unload_sem);
   90   return bus_add_driver(drv);
 91 }
 -----------------------------------------------------------------------
 
 

After driver registration, driver attributes can be created via driver_attribute structures and a helpful macro, DRIVER_ATTR:

 
 -----------------------------------------------------------------------
 include/linux/device.h
 133 #define DRIVER_ATTR(_name,_mode,_show,_store) \
 134 struct driver_attribute driver_attr_##_name = {     \
 135   .attr = {.name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE },  \
 136   .show = _show,        \
 137   .store = _store,        \
 138 };
 -----------------------------------------------------------------------
 
 

Line 135

name is the name of the attribute for the driver. mode is the bitmap describing the level of protection of the attribute. include/linux/stat.h contains many of these modes, but S_IRUGO (for read-only) and S_IWUSR (for root write access) are two examples.

Line 136

show is the name of the driver function to use when the attribute is read via sysfs. If reads are not allowed, NULL should be used.

Line 137

store is the name of the driver function to use when the attribute is written via sysfs. If writes are not allowed, NULL should be used.

The driver functions that implement show() and store() for a specific driver must adhere to the prototypes shown here:

 
 -----------------------------------------------------------------------
 include/linux/sysfs.h
 34 struct sysfs_ops {
 35   ssize_t (*show)(struct kobject *, struct attribute *,char *);
 36   ssize_t (*store)(struct kobject *,struct attribute *,const char *, size_t);
 37 };
 -----------------------------------------------------------------------
 
 

Recall that the size of data read and written to sysfs attributes is limited to PAGE_SIZE bytes. The show() and store() driver attribute functions should ensure that this limit is enforced.

This information should allow you to add basic sysfs functionality to kernel device drivers. For further sysfs and kobject reading, see the Documentation/ device-model directory.

Another type of device driver is a network device driver. Network devices send and receive packets of data and might not necessarily be a hardware devicethe loopback device is a software-network device.


10.2. Writing the Code

10.2.1. Device Basics

When you create a device driver, it is tied to the operating system through an entry in the filesystem. This entry has a major number that indicates to the kernel which driver to use when the file is referenced as well as a minor number that the driver itself can use for greater granularity. When the device driver is loaded, it registers its major number. This registration can be viewed by examining /proc/devices:

 
 -----------------------------------------------------------------------
 lkp# less /proc/devices
 Character devices:
  1 mem
  2 pty
  3 ttyp
  4 ttyS
  5 cua
  6 lp
  7 vcs
  10 misc
  29 fb
 128 ptm
 136 pts
 
 Block devices:
  1 ramdisk
  2 fd
  3 ide0
  7 loop
  22 ide1
 -----------------------------------------------------------------------
 
 

This number is entered in /proc/devices when the device driver registers itself with the kernel; for character devices, it calls the function register_chrdev().

 
 -----------------------------------------------------------------------
 include/linux/fs.h
  1: int register_chrdev(unsigned int major, const char *name,
  2:      struct file_operations *fops)
 -----------------------------------------------------------------------
 
 

  • major. The major number of the device being registered. If major is 0, the kernel dynamically assigns it a major number that doesn't conflict with any other module currently loaded.

  • name. The string representation of the device in the /dev tree of the filesystem.

  • fops. A pointer to file-operations structure that defines what operations can be performed on the device being registered.

Using 0 as the major number is the preferred method for creating a device number for those devices that do not have set major numbers (IDE drivers always use 3; SCSI, 8; floppy, 2). By dynamically assigning a device's major number, we avoid the problem of choosing a major number that some other device driver might have chosen.[8] The consequence is that creating the filesystem node is slightly more complicated because after module loading, we must check what major number was assigned to the device. For example, while testing a device, you might need to do the following:

[8] The register_chrdev() function returns the major number assigned. It might be useful to capture this information when dynamically assigning major numbers.

 
 -----------------------------------------------------------------------
 lkp@lkp# insmod my_module.o
 lkp@lkp# less /proc/devices
 1 mem
 ...
 233 my_module
 lkp@lkp# mknod c /dev/my_module0 233 0
 lkp@lkp# mknod c /dev/my_module1 233 1
 -----------------------------------------------------------------------
 
 

This code shows how we can insert our module using the command insmod. insmod installs a loadable module in the running kernel. Our module code contains these lines:

 
 -----------------------------------------------------------------------
 static int my_module_major=0;
 ...
 module_param(my_module_major, int, 0);
 ...
 result = register_chrdev(my_module_major, "my_module", &my_module_fops);
 -----------------------------------------------------------------------
 
 

The first two lines show how we create a default major number of 0 for dynamic assignment but allow the user to override that assignment by using the my_module_major variable as a module parameter:

 
 -----------------------------------------------------------------------
 include/linux/moduleparam.h
  1: /* This is the fundamental function for registering boot/module
  parameters. perm sets the visibility in driverfs: 000 means it's
  not there, read bits mean it's readable, write bits mean it's
  writable. */
 ...
 /* Helper functions: type is byte, short, ushort, int, uint, long,
  ulong, charp, bool or invbool, or XXX if you define param_get_XXX,
  param_set_XXX and param_check_XXX. */
 ...
  2: #define module_param(name, type, perm)
 -----------------------------------------------------------------------
 
 

In previous versions of Linux, the module_param macro was MODULE_PARM; this is deprecated in version 2.6 and module_param must be used.

  • name. A string that is used to access the value of the parameter.

  • type. The type of value that is stored in the parameter name.

  • perm. The visibility of the module parameter name in sysfs. If you don't know what sysfs is, use a value of 0, which means the parameter is not accessible via sysfs.

Recall that we pass into register_chrdev() a pointer to a fops structure. This tells the kernel what functions the driver handles. We declare only those functions that the module handles. To declare that read, write, ioctl, and open are valid operations upon the device that we are registering, we add code like the following:

 
 -----------------------------------------------------------------------
 struct file_operations my_mod_fops = {
  .read = my_mod_read,
  .write = my_mod_write,
  .ioctl = my_mod_ioctl,
  .open = my_mod_open,
 };
 -----------------------------------------------------------------------
 
 
 

10.2.2. Symbol Exporting

In the course of writing a complex device driver, there might be reasons to export some of the symbols defined in the driver for use by other kernel modules. This is commonly used in low-level drivers that expect higher-level drivers to build upon their basic functionality.

When a device driver is loaded, any exported symbol is placed into the kernel symbol table. Drivers that are loaded subsequently can use any symbols exported by prior drivers. When modules depend on each other, the order in which they are loaded becomes important; insmod fails if the symbols that a high-level module depend on aren't present.

In the 2.6 Linux kernel, two macros are available to a device programmer to export symbols:

 
 -----------------------------------------------------------------------
 include/linux/module.h
 187 #define EXPORT_SYMBOL(sym)          \
 188   __EXPORT_SYMBOL(sym, "")
 189 
 190 #define EXPORT_SYMBOL_GPL(sym)         \
 191   __EXPORT_SYMBOL(sym, "_gpl")
 -----------------------------------------------------------------------
 
 

The EXPORT_SYMBOL macro allows the given symbol to be seen by other pieces of the kernel by placing it into the kernel's symbol table. EXPORT_SYMBOL_GPL allows only modules that have defined a GPL-compatible license in their MODULE_LICENSE attribute. (See include/linux/module.h for a complete list of licenses.)

10.2.3. IOCTL

Until now, we have primarily dealt with device drivers that take actions of their own accord or read and write data to their device. What happens when you have a device that can do more than just read and write? Or you have a device that can do different kinds of reads and writes? Or your device requires some kind of hardware control interface? In Linux, device drivers typically use the ioctl method to solve these problems.

ioctl is a system call that allows the device driver to handle specific commands that can be used to control the I/O channel. A device driver's ioctl call must follow the declaration inside of the file_operations structure:

 
 -----------------------------------------------------------------------
 include/linux/fs.h
 863 struct file_operations {
  ...
 872 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 -----------------------------------------------------------------------
 
 

From user space, the ioctl function call is defined as follows:

 int ioctl (int d, int request, ...);
 

The third argument in the user space definition is an untyped pointer to memory. This is how data passes from user space to the device driver's ioctl implementation. It might sound complex, but to actually use ioctl within a driver is fairly simple.

First, we want to declare what IOCTL numbers are valid for our device. We should consult the file Documentation/ioctl-number.txt and choose a code that the machine won't use. By consulting the current 2.6 file, we see that the ioctl code of 'g' is not currently in use. In our driver, we claim it with the following code:

 #define MYDRIVER_IOC_MAGIC 'g'
 

For each distinct control message the driver receives, we need to declare a unique ioctl number. This is based off of the magic number just defined:

 
 -----------------------------------------------------------------------
 #define MYDRIVER_IOC_OP1 _IO(MYDRIVER_IOC_MAGIC, 0)
 #define MYDRIVER_IOC_OP2 _IOW(MYDRIVER_IOC_MAGIC, 1)
 #define MYDRIVER_IOC_OP3 _IOW(MYDRIVER_IOC_MAGIC, 2)
 #define MYDRIVER_IOC_OP4 _IORW(MYDRIVER_IOC_MAGIC, 3)
 -----------------------------------------------------------------------
 
 

The four operations just listed ( op1, op2, op3, and op4) have been given unique ioctl numbers using the macros defined in include/asm/ioctl.h using MYDRIVER_IOC_MAGIC, which is our ioctl magic number. The documentation file is eloquent on what everything means:

 
 -----------------------------------------------------------------------
 Documentation/lioctl-number.txt
 6 If you are adding new ioctls to the kernel, you should use the _IO
 7 macros defined in <linux/ioctl.h>:
 8
 9  _IO an ioctl with no parameters
 10  _IOW an ioctl with write parameters (copy_from_user)
 11  _IOR an ioctl with read parameters (copy_to_user)
 12  _IOWR an ioctl with both write and read parameters.
 13
 14 'Write' and 'read' are from the user's point of view, just like the
 15 system calls 'write' and 'read'. For example, a SET_FOO ioctl would
 16 be _IOW, although the kernel would actually read data from user space;
 17 a GET_FOO ioctl would be _IOR, although the kernel would actually write
 18 data to user space.
 -----------------------------------------------------------------------
 
 

From user space, we could call the ioctl commands like this:

 
 -----------------------------------------------------------------------
 ioctl(fd, MYDRIVER_IOC_OP1, NULL);
 ioctl(fd, MYDRIVER_IOC_OP2, &mydata);
 ioctl(fd, MYDRIVER_IOC_OP3, mydata);
 ioctl(fd, MYDRIVER_IOC_OP4, &mystruct);
 -----------------------------------------------------------------------
 
 

The user space program needs to know what the ioctl commands are (in this case, MYDRIVER_IOC_OP1MY_DRIVER_IOC_OP4) and the type of arguments the commands expect. We could return a value by using the return code of the ioctl system call or we could interpret the parameter as a pointer to be set or read. In the latter case, remember that the pointer references a section of user space memory that must be copied into, or out of, the kernel.

The cleanest way to move memory between user space and kernel space in an ioctl function is by using the routines put_user() and get_user(), which are defined here:

 
 -----------------------------------------------------------------------
 Include/asm-i386/uaccess.h
 * get_user: - Get a simple variable from user space.
 * @x: Variable to store result.
 * @ptr: Source address, in user space.
  ...
 * put_user: - Write a simple value into user space.
 * @x: Value to copy to user space.
 * @ptr: Destination address, in user space.
 -----------------------------------------------------------------------
 
 

put_user() and get_user() ensure that the user space memory being read or written to is in memory at the time of the call.

There is an additional constraint that you might want to add to the ioctl functions of your device driver: authentication.

One way to test whether the process calling your ioctl function is authorized to call ioctl is by using capabilities. A common capability used in driver authentication is CAP_SYS_ADMIN:

 
 -----------------------------------------------------------------------
 include/linux/capability.h
 202 /* Allow configuration of the secure attention key */
  203 /* Allow administration of the random device */
  204 /* Allow examination and configuration of disk quotas */
  205 /* Allow configuring the kernel's syslog (printk behavior) */
  206 /* Allow setting the domainname */
  207 /* Allow setting the hostname */
  208 /* Allow calling bdflush() */
  209 /* Allow mount() and umount(), setting up new smb connection */
  210 /* Allow some autofs root ioctls */
  211 /* Allow nfsservctl */
  212 /* Allow VM86_REQUEST_IRQ */
  213 /* Allow to read/write pci config on alpha */
  214 /* Allow irix_prctl on mips (setstacksize) */
  215 /* Allow flushing all cache on m68k (sys_cacheflush) */
  216 /* Allow removing semaphores */
  217 /* Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores
  218 and shared memory */
  219 /* Allow locking/unlocking of shared memory segment */
  220 /* Allow turning swap on/off */
  221 /* Allow forged pids on socket credentials passing */
  222 /* Allow setting readahead and flushing buffers on block devices */
  223 /* Allow setting geometry in floppy driver */
  224 /* Allow turning DMA on/off in xd driver */
  225 /* Allow administration of md devices (mostly the above, but some
  226 extra ioctls) */
  227 /* Allow tuning the ide driver */
  228 /* Allow access to the nvram device */
  229 /* Allow administration of apm_bios, serial and bttv (TV) device */
  230 /* Allow manufacturer commands in isdn CAPI support driver */
  231 /* Allow reading non-standardized portions of pci configuration space */
  232 /* Allow DDI debug ioctl on sbpcd driver */
  233 /* Allow setting up serial ports */
  234 /* Allow sending raw qic-117 commands */
  235 /* Allow enabling/disabling tagged queuing on SCSI controllers and sending
  236 arbitrary SCSI commands */
  237 /* Allow setting encryption key on loopback filesystem */
  238
 239 #define CAP_SYS_ADMIN 21
 -----------------------------------------------------------------------
 
 

Many other more specific capabilities in include/linux/capability.h might be more appropriate for a more restricted device driver, but CAP_SYS_ADMIN is a good catch-all.

To check the capability of the calling process within your driver, add something similar to the following code:

 
 if (! capable(CAP_SYS_ADMIN)) {
  return EPERM;
 }
 
 

10.2.4. Polling and Interrupts

When a device driver sends a command to the device it is controlling, there are two ways it can determine whether the command was successful: It can poll the device or it can use device interrupts.

When a device is polled, the device driver periodically checks the device to ensure that the command it delivered succeeded. Because device drivers are part of the kernel, if they were to poll directly, they risk causing the kernel to wait until the device completes the poll operation. The way device drivers that poll get around this is by using system timers. When the device driver wants to poll a device, it schedules the kernel to call a routine within the device driver at a later time. This routine performs the device check without pausing the kernel.

Before we get further into the details of how kernel interrupts work, we must explain the main method of locking access to critical sections of code in the kernel: spinlocks. Spinlocks work by setting a special flag to a certain value before it enters the critical section of code and resetting the value after it leaves the critical section. Spinlocks should be used when the task context cannot block, which is precisely the case in kernel code. Let's look at the spinlock code for x86 and PPC architectures:

 
 -----------------------------------------------------------------------
 include/asm-i386/spinlock.h
 32 #define SPIN_LOCK_UNLOCKED (spinlock_t) { 1 SPINLOCK_MAGIC_INIT }
 33
 34 #define spin_lock_init(x)  do { *(x) = SPIN_LOCK_UNLOCKED; } while(0)
 ...
 43 #define spin_is_locked(x)  (*(volatile signed char *)(&(x)->lock) <= 0)
 44 #define spin_unlock_wait(x)  do { barrier(); } while(spin_is_locked(x))
 
 include/asm-ppc/spinlock.h
 25 #define SPIN_LOCK_UNLOCKED  (spinlock_t) { 0 SPINLOCK_DEBUG_INIT }
 26
 27 #define spin_lock_init(x)  do { *(x) = SPIN_LOCK_UNLOCKED; } while(0)
 28 #define spin_is_locked(x)  ((x)->lock != 0)
 while(spin_is_locked(x))
 29 #define spin_unlock_wait(x)  do { barrier(); } while(spin_is_locked(x))
 -----------------------------------------------------------------------
 
 

In the x86 architecture, the actual spinlock's flag value is 1 if unlocked whereas on the PPC, it's 0. This illustrates that in writing a driver, you need to use the supplied macros instead of raw values to ensure cross-platform compatibility.

Tasks that want to gain the lock will, in a tight loop, continuously check the value of the special flag until it is less than 0; hence, waiting tasks spin. (See spin_unlock_wait() in the two code blocks.)

Spinlocks for drivers are normally used during interrupt handling when the kernel code needs to execute a critical section without being interrupted by other interrupts. In prior versions of the Linux kernel, the functions cli() and sti() were used to disable and enable interrupts. As of 2.5.28, cli() and sti() are being phased out and replaced with spinlocks. The new way to execute a section of kernel code that cannot be interrupted is by the following:

 
 -----------------------------------------------------------------------
 Documentation/cli-sti-removal.txt
  1: spinlock_t driver_lock = SPIN_LOCK_UNLOCKED;
  2: struct driver_data;
  3:
  4: irq_handler (...)
  5: {
  6: unsigned long flags;
  7: ....
  8: spin_lock_irqsave(&driver_lock, flags);
  9: ....
 10: driver_data.finish = 1;
 11: driver_data.new_work = 0;
 12: ....
 13: spin_unlock_irqrestore(&driver_lock, flags);
 14: ....
 15: }
 16:
 17: ...
 18:
 19: ioctl_func (...)
 20: {
 21: ...
 22: spin_lock_irq(&driver_lock);
 23: ...
 24: driver_data.finish = 0;
 25: driver_data.new_work = 2;
 26: ...
 27: spin_unlock_irq(&driver_lock);
 28: ...
 29: }
 -----------------------------------------------------------------------
 
 

Line 8

Before starting the critical section of code, save the interrupts in flags and lock driver_lock.

Lines 912

This critical section of code can only be executed one task at a time.

Line 27

This line finishes the critical section of code. Restore the state of the interrupts and unlock driver_lock.

By using spin_lock_irq_save() (and spin_lock_irq_restore()), we ensure that interrupts that were disabled before the interrupt handler ran remain disabled after it finishes.

When ioctl_func() has locked driver_lock, other calls of irq_handler() will spin. Thus, we need to ensure the critical section in ioctl_func() finishes as fast as possible to guarantee the irq_handler(), which is our top-half interrupt handler, waits for an extremely short time.

Let's examine the sequence of creating an interrupt handler and its top-half handler (see Section 10.2.5 for the bottom half, which uses a work queue):

 
 -----------------------------------------------------------------------
 #define mod_num_tries 3
 static int irq = 0;
 ...
 int count = 0;
 unsigned int irqs = 0;
 while ((count < mod_num_tries) && (irq <= 0)) {
  irqs = probe_irq_on();
  /* Cause device to trigger an interrupt.
   Some delay may be required to ensure receipt 
   of the interrupt */
  irq = probe_irq_off(irqs);
  /* If irq < 0 multiple interrupts were received.
   If irq == 0 no interrupts were received. */
  count++;
 }
 if ((count == mod_num_tries) && (irq <=0)) {
  printk("Couldn't determine interrupt for %s\n",
    MODULE_NAME);
 }
 -----------------------------------------------------------------------
 
 

This code would be part of the initialization section of the device driver and would likely fail if no interrupts could be found. Now that we have an interrupt, we can register that interrupt and our top-half interrupt handler with the kernel:

 
 -----------------------------------------------------------------------
 retval = request_irq(irq, irq_handler, SA_INTERRUPT,
       DEVICE_NAME, NULL);
 if (retval < 0) {
  printk("Request of IRQ %n failed for %s\n", 
    irq, MODULE_NAME);
  return retval;
 }
 -----------------------------------------------------------------------
 
 

request_irq() has the following prototype:

 
 -----------------------------------------------------------------------
 arch/ i386/kernel/irq.c
 590 /**
  591 *  request_irq - allocate an interrupt line
  592 *  @irq: Interrupt line to allocate
  593 *  @handler: Function to be called when the IRQ occurs
  594 *  @irqflags: Interrupt type flags
  595 *  @devname: An ascii name for the claiming device
  596 *  @dev_id: A cookie passed back to the handler function
 ...
 622 int request_irq(unsigned int irq,
  623     irqreturn_t (*handler)(int, void *, struct pt_regs *),
  624     unsigned long irqflags,
  625     const char * devname,
 626     void *dev_id)
 -----------------------------------------------------------------------
 
 

The irqflags parameter can be the ord value of the following macros:

  • SA_SHIRQ for a shared interrupt

  • SA_INTERRUPT to disable local interrupts while running handler

  • SA_SAMPLE_RANDOM if the interrupt is a source of entropy

dev_id must be NULL if the interrupt is not shared and, if shared, is usually the address of the device data structure because handler receives this value.

At this point, it is useful to remember that every requested interrupt needs to be freed when the module exits by using free_irq():

 
 -----------------------------------------------------------------------
 arch/ i386/kernel/irq.c
 669 /**
 670 *  free_irq - free an interrupt
 671 *  @irq: Interrupt line to free
 672 *  @dev_id: Device identity to free
 ...
 682 */
 683 
 684 void free_irq(unsigned int irq, void *dev_id)
 -----------------------------------------------------------------------
 
 

If dev_id is a shared irq, the module should ensure that interrupts are disabled before calling this function. In addition, free_irq() should never be called from interrupt context. Calling free_irq() in the module cleanup routine is standard. (See spin_lock_irq() and spin_unlock_irq.)

At this point, we have registered our interrupt handler and the irq it is linked to. Now, we have to write the actual top-half handler, what we defined as irq_handler():

 
 -----------------------------------------------------------------------
 void irq_handler(int irq, void *dev_id, struct pt_regs *regs)
 {
  /* See above for spin lock code */
  /* Copy interrupt data to work queue data for handling in
   bottom-half */
  schedule_work( WORK_QUEUE );
  /* Release spin_lock */
 }
 -----------------------------------------------------------------------
 
 

If you just need a fast interrupt handler, you can use a tasklet instead of a work queue:

 
 -----------------------------------------------------------------------
 void irq_handler(int irq, void *dev_id, struct pt_regs *regs)
 {
  /* See above for spin lock code */
  /* Copy interrupt data to tasklet data */
  tasklet_schedule( TASKLET_QUEUE );
  /* Release spin_lock */
 }
 -----------------------------------------------------------------------
 
 

10.2.5. Work Queues and Tasklets

The bulk of the work in an interrupt handler is usually done in a work queue. In the last section, we've seen that the top half of the interrupt handler copies pertinent data from the interrupt to a data structure and then calls schedule_work().

To have tasks run from a work queue, they must be packaged in a work_struct. To declare a work structure at compile time, use the DECLARE_WORK() macro. For example, the following code could be placed in our module to initialize a work structure with an associated function and data:

 
 -----------------------------------------------------------------------
 ...
 struct bh_data_struct {
  int data_one;
  int *data_array;
  char *data_text;
 }
 ...
 static bh_data_struct bh_data;
 ...
 static DECLARE_WORK(my_mod_work, my_mod_bh, &bh_data);
 ...
 static void my_mod_bh(void *data)
 {
  struct bh_data_struct *bh_data = data;
 
  /* all the wonderful bottom half code */
 }
 -----------------------------------------------------------------------
 
 

The top-half handler would set all the data required by my_mod_bh in bh_data and then call schedule_work(my_mod_work).

schedule_work() is a function that is available to any module; however, this means that the work schedule is put on the generic work queue "events." Some modules might want to make their own work queues, but the functions required to do so are only exported to GPL-compatible modules. Thus, if you want to keep your module proprietary, you must use the generic work queue.

A work queue is created by using the create_workqueue() macro, which calls __create_workqueue() with a second parameter of 0:

 
 -----------------------------------------------------------------------
 kernel/workqueue.c
 304 struct workqueue_struct *__create_workqueue(const char *name,
  305            int singlethread)
 -----------------------------------------------------------------------
 
 

name can be up to 10 characters long.

If singlethread is 0, the kernel creates a workqueue thread per CPU; if singlethread is 1, the kernel creates a single workqueue thread for the entire system.

Work structures are created in the same way as what's been previously described, but they are placed on your custom work queue using queue_work() instead of schedule_work().

 
 -----------------------------------------------------------------------
 kernel/workqueue.c
 97 int fastcall queue_work(struct workqueue_struct *wq, struct work_struct *work)
 98 {
 -----------------------------------------------------------------------
 
 

wq is the custom work queue created with create_workqueue().

work is the work structure to be placed on wq.

Other work queue functions, found in kernel/workqueue.c, include the following:

  • queue_work_delayed(). Ensures the work structure function is not called until a specified number of jiffies has passed.

  • flush_workqueue(). Causes the caller to wait until all scheduled work on the queue has finished. This is commonly used when a device driver exits.

  • destroy_workqueue(). Flushes and then frees the work queue.

Similar functions, schedule_work_delayed() and flush_scheduled_work(), exist for the generic work queue.

10.2.6. Adding Code for a System Call

We could edit the Makefile in /kernel to include a file with our function, but an easier method is to include our function code in an already existing file in the source tree. The file /kernel/sys.c contains the kernel functions for the system calls and the file arch/i386/kernel/sys_i386.c contains x86 system calls with a nonstandard calling sequence. The former is where we add the source code for our syscall function written in C. This code runs in kernel mode and does all the work. Everything else in this procedure is in support of getting us to this function. It is dispatched through the x86 exception handler:

 
 -----------------------------------------------------------------------
 kernel/sys.c
  1: ...
  2: /* somewhere after last function */
  3:
  4: /* simple function to demonstrate a syscall. */
  5: /* take in a number, print it out, return the number+1 */
  6: 
  7: asmlinkage long sys_ourcall(long num)
  8: {
  9: printk("Inside our syscall num =%d \n", num);
 10: return(num+1);
 11: }
 -----------------------------------------------------------------------
 

When the exception handler processes the int 0x80, it indexes into the system call table. The file /arch/i386/kernel/entry.S contains low-level interrupt handling routines and the system call table sys_call_tabl. The table is an assembly code implementation of an array in C with each element being 4 bytes. Each element or entry in this table is initialized to the address of a function. By convention, we must prepend the name of our function with sys_. Because the position in the table determines the syscall number, we must add the name of our function to the end of the list. See the following code for the table changes:

 -----------------------------------------------------------------------
 arch/i386/kernel/entry.S
  : .data
 608: ENTRY(sys_call_table)
   .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting*/
 ...
   .long sys_tgkill  /* 270 */
   .long sys_utimes
   .long sys_fadvise64_64
   .long sys_ni_syscall /* sys_vserver */
   .long sys_ourcall  /* our syscall will be 274 */
 884: nr_syscalls=(.-sys_call_table)/4
 -----------------------------------------------------------------------
 
 

The file include/asm/unistd.h associates the system calls with their positional numbers in the sys_call_table. Also in this file are macro routines to assist the user program (written in C) in loading the registers with parameters. Here are the changes to unistd.h to insert our system call:

 
 -----------------------------------------------------------------------
 include/asm/unistd.h
  1: /*
  2: * This file contains the system call numbers.
  3: */
  4:
  5: #define __NR_restart_syscall 0
  6: #define __NR_exit    1
  7: #define __NR_fork    2
  8: ... 
  9: #define __NR_utimes   271
 10: #define __NR_fadvise64_64  272
 11: #define __NR_vserver   273
 12: #define __NR_ourcall   274
 13:
 14: /* #define NR_syscalls 274 this is the old value before our syscall */
 15: #define NR_syscalls   275
 -----------------------------------------------------------------------
 
 

Finally, we want to create a user program to test the new syscall. As previously mentioned in this section, a set of macros exists to assist the kernel programmer in loading the parameters from C code into the x86 registers. In /usr/include/asm/unistd.h, there are seven macros: _syscallx(type, name,..), where x is the number of parameters. Each macro is dedicated to loading the proper number of parameters from 0 to 5 and syscall6(...) allows for passing a pointer to more parameters. The following example program takes in one parameter. For this example (on line 5), we use the _syscall1(type, name,type1,name1) macro from /unistd.h, which resolves to a call to int 0x80 with the proper parameters:

 
 -----------------------------------------------------------------------
 mytest.c
  1: #include <stdio.h>
  2: #include <stdlib.h>
  3: #include "/usr/include/asm/unistd.h"
  4:
  5: _syscall(long,ourcall,long, num);
  6:
  7: main()
  8: {
  9: printf("our syscall --> num in=5, num out = %d\n", ourcall(5));
 10: }
 -----------------------------------------------------------------------
 
 


10.3. Building and Debugging

Adding your code to the kernel typically involves cycles of programming and bug fixing. In this section, we describe how to debug the kernel code you've written and how to build debugging-related tools.

10.3.1. Debugging Device Drivers

In previous sections, we used the /proc filesystem to gather information about the kernel. We can also make information about our device driver accessible to users via /proc, and it is an excellent way to debug parts of your device driver. Every node in the /proc filesystem connects to a kernel function when it is read or written to. In the 2.6 kernel, most writes to part of the kernel, devices included, are done through sysfs instead of /proc. The operations modify specific kernel object attributes while the kernel is running. /proc remains a useful tool for read-only operations that require a larger amount of data than an attribute-value pair, and this section deals only with reading from /proc enTRies.

The first step in allowing read access to your device is to create an entry in the /proc filesystem, which is done by create_proc_read_entry():

 
 -----------------------------------------------------------------------
 include/linux/proc_fs.h
 146 static inline struct proc_dir_entry *create_proc_read_entry(const char *name,
 147   mode_t mode, struct proc_dir_entry *base,
 148   read_proc_t *read_proc, void * data)
 -----------------------------------------------------------------------
 
 

*name is the entry of the node that appears under /proc, a mode of 0 allows the file to be world-readable. If you are creating many different proc files for a single device driver, it could be advantageous to first create a proc directory by using proc_mkdir(), and then base each file under that. *base is the directory path under /proc to place the file; a value of NULL places the file directly under /proc. The *read_proc function is called when the file is read, and *data is a pointer that is passed back into *read_proc:

 
 -----------------------------------------------------------------------
 include/linux/proc_fs.h
 44 typedef int (read_proc_t)(char *page, char **start, off_t off,
 45       int count, int *eof, void *data);
 -----------------------------------------------------------------------
 
 

This is the prototype for functions that want to be read via the /proc filesystem. *page is a pointer to the buffer where the function writes its data for the process reading the /proc file. The function should start writing at off bytes into *page and write no more than count bytes. As most reads return only a small amount of information, many implementations ignore both off and count. In addition, **start is normally ignored and is rarely used anywhere in the kernel. If you implement a read function that returns a vast amount of data, **start, off, and count can be used to manage reading small chunks at a time. When the read is finished, the function should write 1 to *eof. Finally, *data is the parameter passed to the read function defined in create_proc_read_entry().

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

 Автор  Комментарий к данной статье