Вглубь исходников
Этот раздел описывает концепцию системных вызовов и драйверов(их еще называют модулями) под линукс.
Системные вызовы дают возможность программисту общаться с операционной системой.
Добавление нового системного вызова - один из путей добавления нового сервиса ядра.
В 3-й главе описана внутрення реализация системных вызовов.
Этот раздел описывает аспекты внедрения ваших собственных системных вызовов в ядро.
Драйвер - это интерфейс , с помощью которого ядро позволяет программисту контролировать
input/output устройства.
Целые книги посвящены разработке линуксовых драйверов.
Мы рассмотрим , как драйвер представлен в файловой системе.
Затем мы сконструируем функционал символьного драйвера.
Мы создадим системный вызов и встроим его в ядро.
Файловая система
Девайсы в линуксе доступны через каталог /dev.
Например команда ls l /dev/random выдаст следующее :
crw-rw-rw- 1 root root 1, 8 Oct 2 08:08 /dev/random
Символ "c" говорит о том , что устройство символьное, в то время как b говорит о блочном устройстве.
Далее идут владелец , группа и два числа (1,8).
Единица - driver's major number , восьмерка - minor number.
Когда драйвер регистрируется в ядре , это делается с помощью major number.
Именно с помощью major number ядро загружает этот драйвер.
Minor number передается драйверу как параметр конкретного устройства.
Например /dev/urandom имеет major number = 1 и minor number = 9.
Устройство зарегестрировано с major number = 1 и управляет /dev/random и /dev/urandom.
Для генерации случайного числа мы просто читаем из /dev/random.
Нужно прочитать 4 байта:
lkp@lkp:~$ head -c4 /dev/urandom | od -x
0000000 823a 3be5
0000004
Если вы повторите команду , вы увидите что 4 байта - [823a 3be5] - изменятся.
Мы знаем , что /dev/random имеет major number = 1.
Выполним команду для /proc/devices:
lkp@lkp:~$ less /proc/devices
Character devices:
1 mem
Поищем в драйвере mem ссылки на "random":
-----------------------------------------------------------------------
drivers/char/mem.c
653 static int memory_open(struct inode * inode, struct file * filp)
654 {
655 switch (iminor(inode)) {
656 case 1:
...
676 case 8:
677 filp->f_op = &random_fops;
678 break;
679 case 9:
680 filp->f_op = &urandom_fops;
681 break;
-----------------------------------------------------------------------
Строка 655681
В этом switch происходит инициализация драйверных структур на основе minor number .
В частности , инициализируются filp и fops .
"А что же такое filp? И что же такое fop?"
Filps и Fops
A filp is simply a file struct pointer, and a fop is a file_operations struct pointer. The kernel uses the file_operations structure to determine what functions to call when the file is operated on. Here are selected sections of the structures that are used in the random device driver:
-----------------------------------------------------------------------
include/linux/fs.h
556 struct file {
557 struct list_head f_list;
558 struct dentry *f_dentry;
559 struct vfsmount *f_vfsmnt;
560 struct file_operations *f_op;
561 atomic_t f_count;
562 unsigned int f_flags;
...
581 struct address_space *f_mapping;
582 };
-----------------------------------------------------------------------
-----------------------------------------------------------------------
include/linux/fs.h
863 struct file_operations {
864 struct module *owner;
865 loff_t (*llseek) (struct file *, loff_t, int);
866 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
867 ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
868 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
869 ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
870 int (*readdir) (struct file *, void *, filldir_t);
871 unsigned int (*poll) (struct file *, struct poll_table_struct *);
872 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
...
888 };
-----------------------------------------------------------------------
The random device driver declares which file operations it provides in the following way: Functions that the drivers implement must conform to the prototypes listed in the file_operations structure:
-----------------------------------------------------------------------
drivers/char/random.c
1824 struct file_operations random_fops = {
1825 .read = random_read,
1826 .write = random_write,
1827 .poll = random_poll,
1828 .ioctl = random_ioctl,
1829 };
1830
1831 struct file_operations urandom_fops = {
1832 .read = urandom_read,
1833 .write = random_write,
1834 .ioctl = random_ioctl,
1835 };
-----------------------------------------------------------------------
Lines 18241829
The random device provides the operations of read, write, poll, and ioctl.
Lines 18311835
The urandom device provides the operations of read, write, and ioctl.
The poll operation allows a programmer to check before performing an operation to see if that operation blocks. This suggests, and is indeed the case, that /dev/random blocks if a request is made for more bytes of entropy than are in its entropy pool. /dev/urandom does not block, but might not return completely random data, if the entropy pool is too small. For more information consult your systems man pages, specifically man 4 random.
Digging deeper into the code, notice that when a read operation is performed on /dev/random, the kernel passes control to the function random_read() (see line 1825). random_read() is defined as follows:
-----------------------------------------------------------------------
drivers/char/random.c
1588 static ssize_t
1589 random_read(struct file * file, char __user * buf, size_t
nbytes, loff_t *ppos)
-----------------------------------------------------------------------
The function parameters are as follows:
file.
Points to the file structure of the device.
buf.
Points to an area of user memory where the result is to be stored.
nbytes.
The size of data requested.
ppos.
Points to a position within the file that the user is accessing.
This brings up an interesting issue: If the driver executes in kernel space, but the buffer is memory in user space, how do we safely get access to the data in buf? The next section explains the process of moving data between user and kernel memory.
10.1.3. User Memory and Kernel Memory
If we were to simply use memcpy() to copy the buffer from kernel space to user space, the copy operation might not work because the user space addresses could be swapped out when memcpy() occurs. Linux has the functions copy_to_user() and copy_from_user(), which allow drivers to move data between kernel space and user space. In read_random(), this is done in the function extract_entropy(), but there is an additional twist:
-----------------------------------------------------------------------
drivers/char/random.c
1: static ssize_t extract_entropy(struct entropy_store *r, void * buf,
2: size_t nbytes, int flags)
3: {
1349 static ssize_t extract_entropy(struct entropy_store *r, void * buf,
1350 size_t nbytes, int flags)
1351 {
...
1452 /* Copy data to destination buffer */
1453 i = min(nbytes, HASH_BUFFER_SIZE*sizeof(__u32)/2);
1454 if (flags & EXTRACT_ENTROPY_USER) {
1455 i -= copy_to_user(buf, (__u8 const *)tmp, i);
1456 if (!i) {
1457 ret = -EFAULT;
1458 break;
1459 }
1460 } else
1461 memcpy(buf, (__u8 const *)tmp, i);
-----------------------------------------------------------------------
exTRact_entropy() has the following parameters:
r.
A pointer to an internal storage of entropy, it is ignored for the purposes of our discussion.
buf.
A pointer to an area of memory that should be filled with data.
nbytes.
The amount of data to write to buf.
flags.
Informs the function whether buf is in kernel or user memory.
exTRact_entropy() returns ssize_t, which is the size, in bytes, of the random data generated.
Lines 14541455
If flags tells us that buf points to a location in user memory, we use copy_to_user() to copy the kernel memory pointed to by tmp to the user memory pointed to by buf.
Lines 14601461
If buf points to a location in kernel memory, we simply use memcpy() to copy the data.
Obtaining random bytes is something that both kernel space and user space programs are likely to use; a kernel space program can avoid the overhead of copy_to_user() by not setting the flag. For example, the kernel can implement an encrypted filesystem and can avoid the overhead of copying to user space.
10.1.4. Wait Queues
We detoured slightly to explain how to move data between user and kernel memory. Let's return to read_random() and examine how it uses wait queues.
Occasionally, a driver might need to wait for some condition to be true, perhaps access to a system resource. In this case, we don't want the kernel to wait for the access to complete. It is problematic to cause the kernel to wait because all other system processing halts while the wait occurs. By declaring a wait queue, you can postpone processing until a later time when the condition you are waiting on has occurred.
Two structures are used for this process of waiting: a wait queue and a wait queue head. A module should create a wait queue head and have parts of the module that use sleep_on and wake_up macros to manage things. This is precisely what occurs in random_read():
-----------------------------------------------------------------------
drivers/char/random.c
1588 static ssize_t
1589 random_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos)
1590 {
1591 DECLARE_WAITQUEUE(wait, current);
...
1597 while (nbytes > 0) {
...
1608 n = extract_entropy(sec_random_state, buf, n,
1609 EXTRACT_ENTROPY_USER |
1610 EXTRACT_ENTROPY_LIMIT |
1611 EXTRACT_ENTROPY_SECONDARY);
...
1618 if (n == 0) {
1619 if (file->f_flags & O_NONBLOCK) {
1620 retval = -EAGAIN;
1621 break;
1622 }
1623 if (signal_pending(current)) {
1624 retval = -ERESTARTSYS;
1625 break;
1626 }
...
1632 set_current_state(TASK_INTERRUPTIBLE);
1633 add_wait_queue(&random_read_wait, &wait);
1634
1635 if (sec_random_state->entropy_count / 8 == 0)
1636 schedule();
1637
1638 set_current_state(TASK_RUNNING);
1639 remove_wait_queue(&random_read_wait, &wait);
...
1645 continue;
1646 }
-----------------------------------------------------------------------
Line 1591
The wait queue wait is initialized on the current task. The macro current refers to a pointer to the current task's task_struct.
Lines 16081611
We extract a chunk of random data from the device.
Lines 16181626
If we could not extract the necessary amount of entropy from the entropy pool and we are non-blocking or there is a signal pending, we return an error to the caller.
Lines 16311633
Set up the wait queue. random_read() uses its own wait queue, random_read_wait, instead of the system wait queue.
Lines 16351636
At this point, we are on a blocking read and if we don't have 1 byte worth of entropy, we release control of the processor by calling schedule(). (The entropy_count variables hold bits and not bytes; thus, the division by 8 to determine whether we have a full byte of entropy.)
Lines 16381639
When we are eventually restarted, we clean up our wait queue.
NOTE
The random device in Linux requires the entropy queue to be full before returning. The urandom device does not have this requirement and returns regardless of the size of data available in the entropy pool.
Let's closely look at what happens when a task calls schedule():
-----------------------------------------------------------------------
kernel/sched.c
2184 asmlinkage void __sched schedule(void)
2185 {
...
2209 prev = current;
...
2233 switch_count = &prev->nivcsw;
2234 if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
2235 switch_count = &prev->nvcsw;
2236 if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
2237 unlikely(signal_pending(prev))))
2238 prev->state = TASK_RUNNING;
2239 else
2240 deactivate_task(prev, rq);
2241 }
2242 ...
-----------------------------------------------------------------------
Line 2209
A pointer to the current task's task structure is stored in the prev variable. In cases where the task itself called schedule(), current points to that task.
Line 2233
We store the task's context switch counter, nivcsw, in switch_count. This is incremented later if the switch is successful.
Line 2234
We only enter this if statement when the task's state, prev->state, is non-zero and there is not a kernel preemption. In other words, we enter this statement when a task's state is not TASK_RUNNING, and the kernel has not preempted the task.
Lines 22352241
If the task is interruptible, we're fairly certain that it wanted to release control. If a signal is pending for the task that wanted to release control, we set the task's state to TASK_RUNNING so that is has the opportunity to be chosen for execution by the scheduler when control is passed to another task. If no signal is pending, which is the common case, we deactivate the task and set switch_count to nvcsw. The scheduler increments switch_count later. Thus, nvcsw or nivcsw is incremented.
The schedule() function then picks the next task in the scheduler's run queue and switches control to that task.
By calling schedule(), we allow a task to yield control of the processor to another kernel task when the current task knows it will be waiting for some reason. Other tasks in the kernel can make use of this time and, hopefully, when control returns to the function that called schedule(), the reason for waiting will have been removed.
Returning from our digression on the scheduler to the random_read() function, eventually, the kernel gives control back to random_read() and we clean up our wait queue and continue. This repeats the loop and, if the system has generated enough entropy, we should be able to return with the requested number of random bytes.
random_read() sets its state to TASK_INTERRUPTIBLE before calling schedule() to allow itself to be interrupted by signals while it is on a wait queue. The driver's own code generates these signals when extra entropy is collected by calling wake_up_interruptible() in batch_entropy_process() and random_ioctl(). TASK_UNINTERRUPTIBLE is usually used when the task is waiting for hardware to respond as opposed to software (when TASK_INTERRUPTIBLE is normally used).
The code that random_read() uses to pass control to another task (see lines 16321639, drivers/char/random.c) is a variant of interruptible_sleep_on() from the scheduler code.
-----------------------------------------------------------------------
kernel/sched.c
2489 #define SLEEP_ON_VAR \
2490 unsigned long flags; \
2491 wait_queue_t wait; \
2492 init_waitqueue_entry(&wait, current);
2493
2494 #define SLEEP_ON_HEAD \
2495 spin_lock_irqsave(&q->lock,flags); \
2496 __add_wait_queue(q, &wait); \
2497 spin_unlock(&q->lock);
2498
2499 #define SLEEP_ON_TAIL \
2500 spin_lock_irq(&q->lock); \
2501 __remove_wait_queue(q, &wait); \
2502 spin_unlock_irqrestore(&q->lock, flags);
2503
2504 void fastcall __sched interruptible_sleep_on(wait_queue_head_t *q)
2505 {
2506 SLEEP_ON_VAR
2507
2508 current->state = TASK_INTERRUPTIBLE;
2509
2510 SLEEP_ON_HEAD
2511 schedule();
2512 SLEEP_ON_TAIL
2513 }
-----------------------------------------------------------------------
q is a wait_queue_head structure that coordinates the module's sleeping and waiting.
Lines 24942497
Atomically add our task to a wait queue q.
Lines 24992502
Atomically remove the task from the wait queue q.
Lines 25042513
Add to the wait queue. Cede control of the processor to another task. When we are given control, remove ourselves from the wait queue.
random_read() uses its own wait queue code instead of the standard macros, but essentially does an interruptible_sleep_on() with the exception that, if we have more than a full byte's worth of entropy, we don't yield control but loop again to try and get all the requested entropy. If there isn't enough entropy, random_read() waits until it's awoken with wake_up_interruptible() from entropy-gathering processes of the driver.
10.1.5. Work Queues and Interrupts
Device drivers in Linux routinely have to deal with interrupts generated by the devices with which they are interfacing. Interrupts trigger an interrupt handler in the device driver and cause all currently executing codeboth user space and kernel spaceto cease execution. Clearly, it is desirable to have the driver's interrupt handler execute as quickly as possible to prevent long waits in kernel processing.
However, this leads us to the standard dilemma of interrupt handling: How do we handle an interrupt that requires a significant amount of work? The standard answer is to use top-half and bottom-half routines. The top-half routine quickly handles accepting the interrupt and schedules a bottom-half routine, which has the code to do the majority of the work and is executed when possible. Normally, the top-half routine runs with interrupts disabled to ensure that an interrupt handler isn't interrupted by the same interrupt. Thus, the device driver does not have to handle recursive interrupts. The bottom-half routine normally runs with interrupts enabled so that other interrupts can be handled while it continues the bulk of the work.
In prior Linux kernels, this division of top-half and bottom-half, also known as fast and slow interrupts, was handled by task queues. New to the 2.6 Linux kernel is the concept of a work queue, which is now the standard way to deal with bottom-half interrupts.
When the kernel receives an interrupt, the processor stops executing the current task and immediately handles the interrupt. When the CPU enters this mode, it is commonly referred to as being in interrupt context. The kernel, in interrupt context, then determines which interrupt handler to pass control to. When a device driver wants to handle an interrupt, it uses request_irq() to request the interrupt number and register the handler function to be called when this interrupt is seen. This registration is normally done at module initialization time. The top-half interrupt function registered with request_irq() does minimal management and then schedules the appropriate work to be done upon a work queue.
Like request_irq() in the top half, work queues are normally registered at module initialization. They can be initialized statically with the DECLARE_WORK() macro or the work structure can be allocated and initialized dynamically by calling INIT_WORK(). Here are the definitions of those macros:
-----------------------------------------------------------------------
include/linux/workqueue.h
30 #define DECLARE_WORK(n, f, d) \
31 struct work_struct n = __WORK_INITIALIZER(n, f, d)
...
45 #define INIT_WORK(_work, _func, _data) \
46 do { \
47 INIT_LIST_HEAD(&(_work)->entry); \
48 (_work)->pending = 0; \
49 PREPARE_WORK((_work), (_func), (_data)); \
50 init_timer(&(_work)->timer); \
51 } while (0)
-----------------------------------------------------------------------
Both macros take the following arguments:
n or work.
The name of the work structure to create or initialize.
f or func.
The function to run when the work structure is removed from a work queue.
d or data.
Holds the data to pass to the function f, or func, when it is run.
The interrupt handler function registered in register_irq() would then accept an interrupt and send the relevant data from the top half of the interrupt handler to the bottom half by setting the work_struct data section and calling schedule_work() on the work queue.
The code present in the work queue function operates in process context and can thus perform work that is impossible to do in interrupt context, such as copying to and from user space or sleeping.
Tasklets are similar to work queues but operate entirely in interrupt context. This is useful when you have little to do in the bottom half and want to save the overhead of a top-half and bottom-half interrupt handler. Tasklets are initialized with the DECLARE_TASKLET() macro:
-----------------------------------------------------------------------
include/linux/interrupt.h
136 #define DECLARE_TASKLET(name, func, data) \
137 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
-----------------------------------------------------------------------
name.
The name of the tasklet structure to create.
func.
The function to call when the tasklet is scheduled.
data.
Holds the data to pass to the func function when the tasklet executes.
To schedule a tasklet, use tasklet_schedule():
-----------------------------------------------------------------------
include/linux/interrupt.h
171 extern void FASTCALL(__tasklet_schedule(struct tasklet_struct *t));
172
173 static inline void tasklet_schedule(struct tasklet_struct *t)
174 {
175 if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
176 __tasklet_schedule(t);
177 }
-----------------------------------------------------------------------
In the top-half interrupt handler, you can call tasklet_schedule() and be guaranteed that, sometime in the future, the function declared in the tasklet is executed. Tasklets differ from work queues in that different tasklets can run simultaneously on different CPUs. If a tasklet is already scheduled, and scheduled again before the tasklet executes, it is only executed once. As tasklets run in interrupt context, they cannot sleep or copy data to user space. Because of running in interrupt context, if different tasklets need to communicate, the only safe way to synchronize is by using spinlocks.
10.1.6. System Calls
There are other ways to add code to the kernel besides device drivers. Linux kernel system calls (syscalls) are the method by which user space programs can access kernel services and system hardware. Many of the C library routines available to user mode programs bundle code and one or more system calls to accomplish a single function. In fact, syscalls can also be accessed from kernel code.
By its nature, syscall implementation is hardware specific. In the Intel architecture, all syscalls use software interrupt 0x80. Parameters of the syscall are passed in the general registers. The implementation of syscall on the x86 architecture limits the number of parameters to 5. If more than 5 are required, a pointer to a block of parameters can be passed. Upon execution of the assembler instruction int 0x80, a specific kernel mode routine is called by way of the exception-handling capabilities of the processor.
10.1.7. Other Types of Drivers
Until now, all the device drivers we dealt with have been character drivers. These are usually the easiest to understand, but you might want to write other drivers that interface with the kernel in different ways.
Block devices are similar to character devices in that they can be accessed via the filesystem. /dev/hda is the device file for the primary IDE hard drive on the system. Block devices are registered and unregistered in similar ways to character devices by using the functions register_blkdev() and unregister_blkdev().
A major difference between block drivers and character drivers is that block drivers do not provide their own read and write functionality; instead, they use a request method.
The 2.6 kernel has undergone major changes in the block device subsystem. Old functions, such as block_read() and block_write() and kernel structures like blk_size and blksize_size, have been removed. This section focuses solely on the 2.6 block device implementation.
If you need the Linux kernel to work with a disk (or a disk-like) device, you need to write a block device driver. The driver must inform the kernel what kind of disk it's interfacing with. It does this by using the gendisk structure:
-----------------------------------------------------------------------
include/linux/genhd.h
82 struct gendisk {
83 int major; /* major number of driver */
84 int first_minor;
85 int minors;
86 char disk_name[32]; /* name of major driver */
87 struct hd_struct **part; /* [indexed by minor] */
88 struct block_device_operations *fops;
89 struct request_queue *queue;
90 void *private_data;
91 sector_t capacity;
...
-----------------------------------------------------------------------
Line 83
major is the major number for the block device. This can be either statically set or dynamically generated by using register_blkdev(), as it was in character devices.
Lines 8485
first_minor and minors are used to determine the number of partitions within the block device. minors contains the maximum number of minor numbers the device can have. first_minor contains the first minor device number of the block device.
Line 86
disk_name is a 32-character name for the block device. It appears in the /dev filesystem, sysfs and /proc/partitions.
Line 87
hd_struct is the set of partitions that is associated with the block device.
Line 88
fops is a pointer to a block_operations structure that contains the operations open, release, ioctl, media_changed, and revalidate_disk. (See include/ linux/fs.h.) In the 2.6 kernel, each device has its own set of operations.
Line 89
request_queue is a pointer to a queue that helps manage the device's pending operations.
Line 90
private_data points to information that will not be accessed by the kernel's block subsystem. Typically, this is used to store data that is used in low-level, device-specific operations.
Line 91
capacity is the size of the block device in 512-byte sectors. If the device is removable, such as a floppy disk or CD, a capacity of 0 signifies that no disk is present. If your device doesn't use 512-byte sectors, you need to set this value as if it did. For example, if your device has 1,000 256-byte sectors, that's equivalent to 500 512-byte sectors.
In addition to having a gendisk structure, a block device also needs a spinlock structure for use with its request queue.
Both the spinlock and fields in the gendisk structure must be initialized by the device driver. (Go to http://en.wikipedia.org/wiki/Ram_disk for a demonstration of initializing a RAM disk block device driver.) After the device is initialized and ready to handle requests, the add_disk() function should be called to add the block device to the system.
Finally, if the block device can be used as a source of entropy for the system, the module initialization can also call add_disk_randomness(). (For more information, see drivers/char/random.c.)
Now that we covered the basics of block device initialization, we can examine its complement, exiting and cleaning up the block device driver. This is easy in the 2.6 version of Linux.
del_gendisk( struct gendisk ) removes the gendisk from the system and cleans up its partition information. This call should be followed by putdisk (struct gendisk), which releases kernel references to the gendisk. The block device is unregistered via a call to unregister_blkdev(int major, char[16] device_name), which then allows us to free the gendisk structure.
We also need to clean up the request queue associated with the block device driver. This is done by using blk_cleanup_queue( struct *request_queue). Note: If you can only reference the request queue via the gendisk structure, be sure to call blk_cleanup_queue before freeing gendisk.
In the block device initialization and shutdown overview, we could easily avoid talking about the specifics of request queues. But now that the driver is set up, it has to actually do something, and request queues are how a block device accomplishes its major functions of reading and writing.
-----------------------------------------------------------------------
include/linux/blkdev.h
576 extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
...
-----------------------------------------------------------------------
Line 576
To create a request queue, we use blk_init_queue and pass it a pointer to a spinlock to control queue access and a pointer to a request function that is called whenever the device is accessed. The request function should have the following prototype:
static void my_request_function( request_queue_t *q );
The guts of the request function usually use a number of helper functions with ease. To determine the next request to be processed, the elv_next_request() function is called and it returns a pointer to a request structure, or it returns null if there is no next request.
In the 2.6 kernel, the block device driver iterates through BIO structures in the request structure. BIO stands for Block I/O and is fully defined in include/linux/bio.h.
The BIO structure contains a pointer to a list of biovec structures, which are defined as follows:
-----------------------------------------------------------------------
include/linux/bio.h
47 struct bio_vec {
48 struct page *bv_page;
49 unsigned int bv_len;
50 unsigned int bv_offset;
51 };
-----------------------------------------------------------------------
Each biovec uses its page structure to hold data buffers that are eventually written to or read from disk. The 2.6 kernel has numerous bio helpers to iterate over the data contained within bio structures.
To determine the size of BIO operation, you can either consult the bio_size field within the BIO struct to get a result in bytes or use the bio_sectors() macro to get the size in sectors. The block operation type, READ or WRITE, can be determined by using bio_data_dir().
To iterate over the biovec list in a BIO structure, use the bio_for_each_segment() macro. Within that loop, even more macros can be used to further delve into biovec bio_page(), bio_offset(), bio_curr_sectors(), and bio_data(). More information can be found in include/linux.bio.h and Documentation/block/biodoc.txt.
Some combination of the information contained in the biovec and the page structures allow you to determine what data to read or write to the block device. The low-level details of how to read and write the device are tied to the hardware the block device driver is using.
Now that we know how to iterate over a BIO structure, we just have to figure out how to iterate over a request structure's list of BIO structures. This is done using another macro: rq_for_each_bio:
-----------------------------------------------------------------------
include/linux/blkdev.h
495 #define rq_for_each_bio(_bio, rq) \
496 if ((rq->bio)) \
497 for (_bio = (rq)->bio; _bio; _bio = bio->bi_next)
-----------------------------------------------------------------------
Line 495
bio is the current BIO structure and rq is the request to iterate over.
After each BIO is processed, the driver should update the kernel on its progress. This is done by using end_that_request_first().
-----------------------------------------------------------------------
include/linux/blkdev.h
557 extern int end_that_request_first(struct request *, int, int);
-----------------------------------------------------------------------
Line 557
The first int argument should be non-zero unless an error has occurred, and the second int argument represents the number of sectors that the device processed.
When end_that_request_first() returns 0, the entire request has been processed and the cleanup needs to begin. This is done by calling blkdev_dequeue_request() and end_that_request_last() in that orderboth of which take the request as the sole argument.
After this, the request function has done its job and the block subsystem uses the block device driver's request queue function to perform disk operations. The device might also need to handle certain ioctl functions, as our RAM disk handles partitioning, but those, again, depend on the type of block device.
This section has only touched on the basics of block devices. There are Linux hooks for DMA operations, clustering, request queue command preparation, and many other features of more advanced block devices. For further reading, refer to the Documentation/block directory.
10.1.8. Device Model and sysfs
New in the 2.6 kernel is the Linux device model, to which sysfs is intimately related. The device model stores a set of internal data related to the devices and drivers on a system. The system tracks what devices exist and breaks them down into classes: block, input, bus, etc. The system also keeps track of what drivers exist and how they relate to the devices they manage. The device model exists within the kernel, and sysfs is a window into this model. Because some devices and drivers do not expose themselves through sysfs, a good way of thinking of sysfs is the public view of the kernel's device model.
Certain devices have multiple entries within sysfs.
Only one copy of the data is stored within the device model, but there are various ways of accessing that piece of data, as the symbolic links in the sysfs TRee shows.
The sysfs hierarchy relates to the kernel's kobject and kset structures. This model is fairly complex, but most driver writers don't have to delve too far into the details to accomplish many useful tasks. By using the sysfs concept of attributes, you work with kobjects, but in an abstracted way. Attributes are parts of the device or driver model that can be accessed or changed via the sysfs filesystem. They could be internal module variables controlling how the module manages tasks or they could be directly linked to various hardware settings. For example, an RF transmitter could have a base frequency it operates upon and individual tuners implemented as offsets from this base frequency. Changing the base frequency can be accomplished by exposing a module attribute of the RF driver to sysfs.
When an attribute is accessed, sysfs calls a function to handle that access, show() for read and store() for write. There is a one-page limit on the size of data that can be passed to show() or store() functions.
With this outline of how sysfs works, we can now get into the specifics of how a driver registers with sysfs, exposes some attributes, and registers specific show() and store() functions to operate when those attributes are accessed.
The first task is to determine what device class your new device and driver should fall under (for example, usb_device, net_device, pci_device, sys_device, and so on). All these structures have a char *name field within them. sysfs uses this name field to display the new device within the sysfs hierarchy.
After a device structure is allocated and named, you must create and initialize a devicer_driver structure:
-----------------------------------------------------------------------
include/linux/device.h
102 struct device_driver {
103 char * name;
104 struct bus_type * bus;
105
106 struct semaphore unload_sem;
107 struct kobject kobj;
108 struct list_head devices;
109
110 int (*probe) (struct device * dev);
111 int (*remove) (struct device * dev);
112 void (*shutdown) (struct device * dev);
113 int (*suspend) (struct device * dev, u32 state, u32 level);
114 int (*resume) (struct device * dev, u32 level);
115};
-----------------------------------------------------------------------
Line 103
name refers to the name of the driver that is displayed in the sysfs hierarchy.
Line 104
bus is usually filled in automatically; a driver writer need not worry about it.
Lines 105115
The programmer does not need to set the rest of the fields. They should be automatically initialized at the bus level.
We can register our driver during initialization by calling driver_register(), which passes most of the work to bus_add_driver(). Similarly upon driver exit, be sure to add a call to driver_unregister().
-----------------------------------------------------------------------
drivers/base/driver.c
86 int driver_register(struct device_driver * drv)
87 {
88 INIT_LIST_HEAD(&drv->devices);
89 init_MUTEX_LOCKED(&drv->unload_sem);
90 return bus_add_driver(drv);
91 }
-----------------------------------------------------------------------
After driver registration, driver attributes can be created via driver_attribute structures and a helpful macro, DRIVER_ATTR:
-----------------------------------------------------------------------
include/linux/device.h
133 #define DRIVER_ATTR(_name,_mode,_show,_store) \
134 struct driver_attribute driver_attr_##_name = { \
135 .attr = {.name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \
136 .show = _show, \
137 .store = _store, \
138 };
-----------------------------------------------------------------------
Line 135
name is the name of the attribute for the driver. mode is the bitmap describing the level of protection of the attribute. include/linux/stat.h contains many of these modes, but S_IRUGO (for read-only) and S_IWUSR (for root write access) are two examples.
Line 136
show is the name of the driver function to use when the attribute is read via sysfs. If reads are not allowed, NULL should be used.
Line 137
store is the name of the driver function to use when the attribute is written via sysfs. If writes are not allowed, NULL should be used.
The driver functions that implement show() and store() for a specific driver must adhere to the prototypes shown here:
-----------------------------------------------------------------------
include/linux/sysfs.h
34 struct sysfs_ops {
35 ssize_t (*show)(struct kobject *, struct attribute *,char *);
36 ssize_t (*store)(struct kobject *,struct attribute *,const char *, size_t);
37 };
-----------------------------------------------------------------------
Recall that the size of data read and written to sysfs attributes is limited to PAGE_SIZE bytes. The show() and store() driver attribute functions should ensure that this limit is enforced.
This information should allow you to add basic sysfs functionality to kernel device drivers. For further sysfs and kobject reading, see the Documentation/ device-model directory.
Another type of device driver is a network device driver. Network devices send and receive packets of data and might not necessarily be a hardware devicethe loopback device is a software-network device.
|