Serg Iakovlev

or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

Input/Output

Ядро общается с устройствами на уровне ассемберных команд. В этом разделе будет рассказано о поддержке ядром железа ,о файловом I/O.

Начнем с того , что рассмотрим , как процессор коннектится к остальной части железа. Разберемся с шинами памяти. Речь пойдет о чипсетах и контролерах как для x86 , так и для PowerPC.

Начнем с анализа кодовых слоев , начиная от приложений и до операционной системы, рассмотрим блочные устройства , используемые на дисках и его контролеры.

Связь между дисками и операционной системой будет рассмотрена в разрезе шедулинга I/O. Для понимания геометрии дисков , мы рассмотрим , как система разбивает их на партиции.

Мы увидим , как блочный драйвер позволяет контролировать различные устройства. На уровне приложения мы опробуем примеры использования I/O для работы с дисковым контролером.

Будут рассмотрены различия между символьными устройствами , блочными устройствами и сетевыми устройствами. Коснемся также DMA controller, clock,terminal devices .

Шины , мосты , порты , интерфейсы

Процессор взаимодействует с устройствами посредством шин. Шины используются для посылки и получения данных от устройств , а также для контроля над ними. Шины можно разделить на 3 типа :адресные , контрольные и шины данных.

Основная функция процессора - выполнять инструкции. Инструкции располагаются в памяти. Процессор получает доступ к этой памяти посредством шин.

Контролеры можно разделить на 2 группы - Northbridge и Southbridge. Bridge - мост - устройство , соединяющее 2 шины. Следующий рисунок показывает ,как Northbridge и Southbridge взаимодействуют с другими устройствами. Все контролеры можно назвать чипсетом.

Figure 5.1. Old Intel

Northbridge коннектится с контролером памяти и PCI-контролером.

Southbridge коннектится к низко-скоростным устройствам , таким , как IDE контролер, USB.

Для взаимодействия с периферийной аппаратурой ,такой как клавиатура , имеются специальные серийные , или параллельные порты. Они подключены к т.н. шине I/O. Это разновидность контрольной шины. Она реализует медленный тип взаимодействия. В X86 для этого имеются специальные инструкции , такие , как inb (чтение байта), outb (запись байта). Для предотвращения конфликта между шиной i/o и шиной памяти , последние могут быть активированы специальной I/O командой. В PowerPC иной механиз I/O , называемый memory-mapped I/O - для обслуживания I/O выделяется память.

В архитектуре x86 регистр первого параллельного порта расположен по адресу I/O port 0x378, в то время как в PPC он может быть расположен в зависимости от реализации в памяти по адресу 0xf0000300. Для чтения данных из этого порта в x86 мы выполняем инструкцию in al, 0x378. Для PPC это будет соответственно lbz r3, 0(0xf0000300).

Системная архитектура Интел имеет тенденцию движения в сторону концепции хабов - hub. Northbridge известен как Graphics and Memory Controller Hub (GMCH). Он поддерживает высоко-производительный графический AGP и DDR memory controller. Включая PCI Express, чипсеты Intel двигаются в сторону Memory Controller Hub (MCH) для графики и DDR2 для памяти. Southbridge известен как I/O Controller Hub (ICH). Он подсоединен к point-to-point шине , названной Intel Hub Architecture (IHA). Дополнительную информацию можно найти на сайте Интела.^[4] Figure 5.2 illustrates the ICH.

^[3] http://www.intel.com/design/chipsets/datashts/25251405.pdf.

^[4] http://www.intel.com/design/chipsets/datashts/30146403.pdf.

Figure 5.2. New Intel Hub

AMD имеет тенденцию движения от интеловских Northbridge/Southbridge к пакетной HyperTransport технологии. HyperTransport совместим с PCI на уровне операционной системы.^[5] Подробности смотрите на сайте AMD. Figure 5.3 illustrates the HyperTransport technology.

^[5] See AMD chipset datasheets for the 8000 series: http://www.amd.com/us-en/Processors/ProductInformation/0,30_118_6291_4886,00.html.

Figure 5.3. AMD HyperTransport

Apple для PowerPC имеет собственную технологию , названную Universal Motherboard Architecture (UMA).

Устройства

Two kinds of device files exist: block device files and character device files. Block devices transfer data in chunks, and character devices (as the name implies) transfer data one character at a time. A third device type, the network device, is a special case that exhibits attributes of both block and character devices. However, network devices are not represented by files.

The old method of assigned numbers for devices where the major number usually referred to a device driver or controller, and the minor number was a particular device within that controller, is giving way to a new dynamic method called devfs. The history behind this change is that the major and minor numbers are both 8-bit values; this allows for little more than 200 statically allocated major devices for the entire planate. (Block and character devices each have their own list of 256 entries.) You can find the official listing of the allocated major and minor device numbers in /Documentation/devices.txt.

The Linux Device Filesystem (devfs) has been in the kernel since version 2.3.46. devfs is not included by default in the 2.6.7 kernel build, but it can be enabled by setting CONFIG_DEVFS_FS=Y in the configuration file. With devfs, a module can register a device by name rather than a major/minor number pair. For compatibility, devfs allows the use of old major/minor numbers or generates a unique 16-bit device number on any given system.

5.2.1. Block Device Overview

As previously mentioned, the Linux operating system sees all devices as files. Any given element in a block device can be randomly referenced. A good example of a block device is the disk drive. The filesystem name for the first IDE disk is /dev/hda. The associated major number of /dev/hda is 3, and the minor number is 0. The disk drive itself usually has a controller and is electro-mechanical by nature (that is, it has moving parts). The " General System File" section in Chapter 6, "Filesystems," discusses the basic construction of a hard disk.

5.2.1.1. Generic Block Device Layer

The device driver registers itself at driver initialization time. This adds the driver to the kernel's driver table, mapping the device number to the block_device_operations structure. The block_device_operations structure contains the functions for starting and stopping a given block device in the system:

 
 -------------------------------------------------------------------------
 include/linux/fs.h
 760  struct block_device_operations {
 761   int (*open) (struct inode *, struct file *);
 762   int (*release) (struct inode *, struct file *);
 763   int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
 764   int (*media_changed) (struct gendisk *);
 765   int (*revalidate_disk) (struct gendisk *);
 766   struct module *owner;
 767  };
 -------------------------------------------------------------------------

The interfaces to the block device are similar to other devices. The functions open() (on line 761) and release() (on line 762) are synchronous (that is, they run to completion when called). The most important functions, read() and write(), are implemented differently with block devices because of their mechanical nature. Consider accessing a block of data from a disk drive. The amount of time it takes to position the head on the proper track and for the disk to rotate to the desired block can take a long time, from the processor's point of view. This latency is the driving force for the implementation of the system request queue. When the filesystem requests a block (or more) of data, and it is not in the local page cache, it places the request on a request queue and passes this queue on to the generic block device layer. The generic block device layer then determines the most efficient way to mechanically retrieve (or store) the information, and passes this on to the hard disk driver.

Most importantly, at initialization time, the block device driver registers a request queue handler with the kernel (specifically with the block device manager) to facilitate the read/write operations for the block device. The generic block device layer acts as an interface between the filesystem and the register level interface of the device and allows for per-queue tuning of the read and write queues to make better use of the new and smarter devices available. This is accomplished through the tagged command queuing helper utilities. For example, if a device on a given queue supports command queuing, read and write operations can be optimized to exploit the underlying hardware by reordering requests. An example of per-queue tuning in this case would be the ability to set how many requests are allowed to be pending. See Figure 5.4 for an illustration of how the application layer, the filesystem layer, the generic block device layer, and the device driver interrelate. The file biodoc.txt under /Documentation/block> has more helpful information on this layer and information regarding changes from earlier kernels.

Figure 5.4. Block Read/Write

5.2.2. Request Queues and Scheduling I/O

When a read or write request traverses the layers from VFS, through the filesystem drivers and page cache,^[6] it eventually ends up entering the block device driver to perform the actual I/O on the device that holds the data requested.

^[6] This traversal is described in Chapter 6.

As previously mentioned, the block device driver creates and initializes a request queue upon initialization. This initialization also determines the I/O scheduling algorithm to use when a read or write is attempted on the block device. The I/O scheduling algorithm is also known as the elevator algorithm.

The default I/O scheduling algorithm is determined by the kernel at boot time with the default being the anticipatory I/O scheduler.^[7] By setting the kernel parameter elevator to the following values, you can change the type of I/O scheduler:

^[7] Some block device drivers can change their I/O scheduler during runtime, if it's visible in sysfs.

deadline. For the deadline I/O scheduler
noop. For the no-operation I/O scheduler
as. For the anticipatory I/O scheduler

As of this writing, a patch exists that makes the I/O schedulers fully modular. Using modprobe, the user can load the modules and switch between them on the fly.^[8] With this patch, at least one scheduler must be compiled into the kernel to begin with.

^[8] For more information, do a Web search on "Jens Axboe" and "Modular IO Schedulers."

Before we can describe how these I/O schedulers work, we need to touch on the basics of request queues.

Block devices use request queues to order the many block I/O requests the devices are given. Certain block devices, such as a RAM disk, might have little need for ordering requests because the I/O requests to the device have little overhead. Other block devices, like hard drives, need to order requests because there is a great overhead in reading and writing. As previously mentioned, the head of the hard drive has to move from track to track, and each movement is agonizingly slow from the CPU's perspective.

Request queues solve this problem by attempting to order block I/O read and write requests in a manner that optimizes throughput but does not indefinitely postpone requests. A common and useful analogy of I/O scheduling is to look at how elevators work.^[9] If you were to order the stops an elevator took by the order of the requests, you would have the elevator moving inefficiently from floor to floor; it could go from the penthouse to the ground floor without ever stopping for anyone in between. By responding to requests that occur while the elevator travels in the same direction, it increases the elevator's efficiency and the riders' happiness. Similarly, I/O requests to a hard disk should be grouped together to avoid the high overhead of repeatedly moving the disk head back and forth. All the I/O schedulers mentioned (no-op, deadline, and anticipatory) implement this basic elevator functionality. The following sections look at these elevators in more detail.

^[9] This analogy is why I/O schedulers are also referred to as elevators.

5.2.2.1. No-Op I/O Scheduler

The no-op I/O scheduler^[10] takes a request and scans through its queue to determine if it can be merged with an existing request. This occurs if the new request is close to an existing request. If the new request is for I/O blocks before an existing request, it is merged on the front of the existing request. If the new request is for I/O blocks after an existing request, it is merged on the back of the existing request. In normal I/O, we read the beginning of a file before the end, and thus, most requests are merged onto the back of existing requests.

^[10] The code for the no-op I/O scheduler is located in drivers/block/noop-iosched.c.

If the no-op I/O scheduler finds that the new request cannot be merged into the existing request because it is not near enough, the scheduler looks for a place within the queue between existing requests. If the new request calls for I/O to sectors between existing requests it is inserted into the queue at the determined position. If there are no places the request can be inserted, it is placed on the tail of the request queue.

5.2.2.2. Deadline I/O Scheduler

The no-op I/O scheduler^[11] suffers from a major problem; with enough close requests, new requests are never handled. Many new requests that are close to existing ones would be either merged or inserted between existing elements, and new requests would pile up at the tail of the request queue. The deadline scheduler attempts to solve this problem by assigning each request an expiration time and uses two additional queues to manage time efficiency as well as a queue similar to the no-op algorithm to model disk efficiency.

^[11] The code for the deadline I/O scheduler is located in drivers/block/deadline-iosched.c.

When an application makes a read request, it typically waits until that request is fulfilled before continuing. Write requests, on the other hand, will not normally cause an application to wait; the write can execute in the background while the application continues on to other tasks. The deadline I/O scheduler uses this information to favor read requests over write requests. A read queue and write queue are kept in addition to the queue sorted by a request's sector proximity. In the read and write queue, requests are ordered by time (FIFO).

When a new request comes in, it is placed on the sorted queue as in the no-op scheduler. The request is also placed on either the read queue or write queue depending on its I/O request. When the deadline I/O scheduler handles a request, it first checks the head of the read queue to see if that request has expired. If that requests expiration time has been reached, it is immediately handled. Similarly, if no read request has expired, the scheduler checks the write queue to see if the request at its head has expired; if so, it is immediately handled. The standard queue is checked only when no reads or writes have expired and requests are handled in nearly the same way as the no-op algorithm.

Read requests also expire faster than write requests: a second versus 5 seconds in the default case. This expiration difference and the preference of handling read requests over write requests can lead to write requests being starved by numerous read requests. As such, a parameter tells the deadline I/O scheduler the maximum number of times reads can starve a write; the default is 2, but because sequential requests can be treated as a single request, 32 sequential read requests could pass before a write request is considered starved.^[12]

^[12] See lines 2427 of deadline-iosched.c for parameter definitions.

5.2.2.3. Anticipatory I/O Scheduling

One of the problems with the deadline I/O scheduling algorithm occurs during intensive write operations. Because of the emphasis on maximizing read efficiency, a write request can be preempted by a read, have the disk head seek to new location, and then return to the write request and have the disk head seek back to its original location. Anticipatory I/O scheduling^[13] attempts to anticipate what the next operation is and aims to improve I/O throughput in doing so.

^[13] The code for anticipatory I/O scheduling is located in drivers/block/as-iosched.c.

Structurally, the anticipatory I/O scheduler is similar to the deadline I/O scheduler. There exist a read and write queue each ordered by time (FIFO) and a default queue that is ordered by sector proximity. The main difference is that after a read request, the scheduler does not immediately proceed to handling other requests. It does nothing for 6 milliseconds in anticipation of an additional read. If another read request does occur to an adjacent area, it is immediately handled. After the anticipation period, the scheduler returns to its normal operation as described under the deadline I/O scheduler.

This anticipation period helps minimize the I/O delay associated with moving the disk head from sector to sector across the block device.

Like the deadline I/O scheduler, a number of parameters control the anticipatory I/O scheduling algorithm. The default time for reads to expire is second and the default time for writes to expire is second. Two parameters control when to check to switch between streams of reads and writes.^[14] A stream of reads checks for expired writes after second and a stream of writes checks for expired reads after second.

^[14] See lines 3060 of as-iosched.c for parameter definitions.

The default I/O scheduler is the anticipatory I/O scheduler because it optimizes throughput for most applications and block devices. The deadline I/O scheduler is sometimes better for database applications or those that require high disk performance requirements. The no-op I/O scheduler is usually used in systems where I/O seek time is near negligible, such as embedded systems running from RAM.

We now turn our attention from the various I/O schedulers in the Linux kernel to the request queue itself and the manner in which block devices initialize request queues.

5.2.2.4. Request Queue

In Linux 2.6, each block device has its own request queue that manages I/O requests to that device. A process can only update a device's request queue if it has obtained the lock of the request queue. Let's examine the request_queue structure:

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 270 struct request_queue
 271 {
 272   /*
 273   * Together with queue_head for cacheline sharing
 274   */
 275   struct list_head  queue_head;
 276   struct request   *last_merge;
 277   elevator_t    elevator;
 278
 279   /*
 280   * the queue request freelist, one for reads and one for writes
 281   */
 282   struct request_list  rq;
 ------------------------------------------------------------------------

Line 275

This line is a pointer to the head of the request queue.

Line 276

This is the last request placed into the request queue.

Line 277

The scheduling function (elevator) used to manage the request queue. This can be one of the standard I/O schedulers (noop, deadline, or anticipatory) or a new type of scheduler specifically designed for the block device.

Line 282

The request_list is a structure composed of two wait_queues: one for queuing reads to the block device and one for queuing writes.

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 283
 284   request_fn_proc   *request_fn;
 285   merge_request_fn  *back_merge_fn;
 286   merge_request_fn  *front_merge_fn;
 287   merge_requests_fn  *merge_requests_fn;
 288   make_request_fn   *make_request_fn;
 289   prep_rq_fn    *prep_rq_fn;
 290   unplug_fn    *unplug_fn;
 291   merge_bvec_fn   *merge_bvec_fn;
 292   activity_fn    *activity_fn;
 293
 -------------------------------------------------------------------------

Lines 283293

These scheduler- (or elevator-) specific functions can be defined to control how requests are managed for the block device.

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 294   /*
 295   * Auto-unplugging state
 296   */
 297   struct timer_list  unplug_timer;
 298   int      unplug_thresh; /* After this many requests */
 299   unsigned long   unplug_delay; /* After this many jiffies*/
 300   struct work_struct  unplug_work;
 301
 302   struct backing_dev_info backing_dev_info;
 303
 -------------------------------------------------------------------------

Lines 294303

These functions are used to unplug the I/O scheduling function used on the block device. Plugging refers to the practice of waiting for more requests to fill the request queue with the expectation that more requests allow the scheduling algorithm to order and sort I/O requests that enhance the time it takes to perform the I/O requests. For example, a hard drive "plugs" a certain number of read requests with the expectation that it moves the disk head less when more reads exist. It's more likely that the reads can be arranged sequentially or even clustered together into a single large read. Unplugging refers to the method in which a device decides that it can wait no longer and must service the requests it has, regardless of possible future optimizations. See documentation/block/biodoc.txt for more information.

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 304   /*
 305   * The queue owner gets to use this for whatever they like.
 306   * ll_rw_blk doesn't touch it.
 307   */
 308   void     *queuedata;
 309
 310   void     *activity_data;
 311
 -------------------------------------------------------------------------

Lines 304311

As the inline comments suggest, these lines request queue management that is specific to the device and/or device driver:

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 312   /*
 313   * queue needs bounce pages for pages above this limit
 314   */
 315   unsigned long   bounce_pfn;
 316   int      bounce_gfp;
 317
 -------------------------------------------------------------------------

Lines 312317

Bouncing refers to the practice of the kernel copying high-memory buffer I/O requests to low-memory buffers. In Linux 2.6, the kernel allows the device itself to manage high-memory buffers if it wants. Bouncing now typically occurs only if the device cannot handle high-memory buffers.

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 318   /*
 319   * various queue flags, see QUEUE_* below
 320   */
 321   unsigned long   queue_flags;
 322
 -------------------------------------------------------------------------

Lines 318321

The queue_flags variable stores one or more of the queue flags shown in Table 5.1 (see include/linux/blkdev.h, lines 368375).

Table 5.1. queue_flags
Flag Name
Flag Function
QUEUE_FLAG_CLUSTER
/* cluster several segments into 1 */
QUEUE_FLAG_QUEUED
/* uses generic tag queuing */
QUEUE_FLAG_STOPPED
/* queue is stopped */
QUEUE_FLAG_READFULL
/* read queue has been filled */
QUEUE_FLAG_WRITEFULL
/* write queue has been filled */
QUEUE_FLAG_DEAD
/* queue being torn down */
QUEUE_FLAG_REENTER
/* Re-entrancy avoidance */
QUEUE_FLAG_PLUGGED
/* queue is plugged */

 
 -------------------------------------------------------------------------
 include/linux/blkdev.h
 323   /*
 324   * protects queue structures from reentrancy
 325   */
 326   spinlock_t    *queue_lock;
 327
 328   /*
 329   * queue kobject
 330   */
 331   struct kobject kobj;
 332
 333   /*
 334   * queue settings
 335   */
 336   unsigned long   nr_requests; /* Max # of requests */
 337   unsigned int   nr_congestion_on;
 338   unsigned int   nr_congestion_off;
 339
 340   unsigned short   max_sectors;
 341   unsigned short   max_phys_segments;
 342   unsigned short   max_hw_segments;
 343   unsigned short   hardsect_size;
 344   unsigned int   max_segment_size;
 345
 346   unsigned long   seg_boundary_mask;
 347   unsigned int   dma_alignment;
 348
 349   struct blk_queue_tag *queue_tags;
 350
 351   atomic_t    refcnt;
 352
 353   unsigned int   in_flight;
 354
 355   /*
 356   * sg stuff
 357   */
 358   unsigned int   sg_timeout;
 359   unsigned int   sg_reserved_size;
 360 };
 -------------------------------------------------------------------------

Lines 323360

These variables define manageable resources of the request queue, such as locks (line 326) and kernel objects (line 331). Specific request queue settings, such as the maximum number of requests (line 336) and the physical constraints of the block device (lines 340347) are also provided. SCSI attributes (lines 355359) can also be defined, if they're applicable to the block device. If you want to use tagged command queuing use the queue_tags structure (on line 349). The refcnt and in_flight fields (on lines 351 and 353) count the number of references to the queue (commonly used in locking) and the number of requests that are in process ("in flight").

Request queues used by block devices are initialized simply in the 2.6 Linux kernel by calling the following function in the devices' __init function. Within this function, we can see the anatomy of a request queue and its associated helper routines. In the 2.6 Linux kernel, each block device controls its own locking, which is contrary to some earlier versions of Linux, and passes a spinlock as the second argument. The first argument is a request function that the block device driver provides.

 
 -------------------------------------------------------------------------
 drivers/block/ll_rw_blk.c
 1397 request_queue_t *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
 1398  {
 1399   request_queue_t *q;
 1400   static int printed;
 1401 
 1402   q = blk_alloc_queue(GFP_KERNEL);
 1403   if (!q)
 1404   return NULL;
 1405
 1406   if (blk_init_free_list(q))
 1407    goto out_init;
 1408   
 1409   if (!printed) {
 1410    printed = 1;
 1411    printk("Using %s io scheduler\n", chosen_elevator->elevator_name);
 1412   }
 1413  
 1414   if (elevator_init(q, chosen_elevator))
 1415    goto out_elv;
 1416  
 1417   q->request_fn   = rfn;
 1418   q->back_merge_fn   = ll_back_merge_fn;
 1419   q->front_merge_fn   = ll_front_merge_fn;
 1420   q->merge_requests_fn  = ll_merge_requests_fn;
 1421   q->prep_rq_fn   = NULL;
 1422   q->unplug_fn   = generic_unplug_device;
 1423   q->queue_flags   = (1 << QUEUE_FLAG_CLUSTER);
 1424   q->queue_lock   = lock;
 1425  
 1426   blk_queue_segment_boundary(q, 0xffffffff);
 1427  
 1428   blk_queue_make_request(q, __make_request);
 1429   blk_queue_max_segment_size(q, MAX_SEGMENT_SIZE);
 1430  
 1431   blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 1432   blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 1433  
 1434   return q;
 1435  out_elv:
 1436   blk_cleanup_queue(q);
 1437  out_init:
 1438   kmem_cache_free(requestq_cachep, q);
 1439   return NULL;
 1440  }
 -------------------------------------------------------------------------

Line 1402

Allocate the queue from kernel memory and zero its contents.

Line 1406

Initialize the request list that contains a read queue and a write queue.

Line 1414

Associate the chosen elevator with this queue and initialize.

Lines 14171424

Associate the elevator-specific functions with this queue.

Line 1426

This function sets the boundary for segment merging and checks that it is at least a minimum size.

Line 1428

This function sets the function used to get requests off the queue by the driver. It allows an alternate function to be used to bypass the queue.

Line 1429

Initialize the upper-size limit on a combined segment.

Line 1431

Initialize the maximum segments the physical device can handle.

Line 1432

Initialize the maximum number of physical segments per request.

The values for lines 14291432 are set in include/linux/blkdev.h.

Line 1434

Return the initialized queue.

Lines 14351439

Routines to clean up memory in the event of an error.

We now have the request queue in place and initialized.

Before we explore the generic device layer and the generic block driver, let's quickly trace the layers of software it takes to get to the manipulation of IO in the block device. (Refer to Figure 5.4.)

At the application level, an application has initiated a file operation such as fread(). This request is taken by the virtual filesystem (VFS) layer (covered in Chapter 4), where the file's dentry structure is found, and through the inode structure, where the file's read() function is called. The VFS layer tries to find the requested page in its buffer cache, but if it is a miss, the filesystem handler is called to acquire the appropriate physical blocks. The inode is linked to the filesystem handler, which is associated with the correct filesystem. The filesystem handler calls on the request queue utilities, which are part of the generic block device layer to create a request with the correct physical blocks and device. The request is put on the request queue, which is maintained by the generic block device layer.

5.2.3. Example: "Generic" Block Driver

We now look at the generic block device layer. Referring to Figure 5.4, it resides above the physical device layer and just below the filesystem layer. The most important job of the generic block layer is to maintain request queues and their related helper routines.

We first register our device with register_blkdev(major, dev_name, fops). This function takes in the requested major number, the name of this block device (this appears in the /dev directory), and a pointer to the file operations structure. If successful, it returns the desired major number.

Next, we create the gendisk structure.

The function alloc_disk(int minors) in include/linux/genhd.h takes in the number of partitions and returns a pointer to the gendisk structure. We now look at the gendisk structure:

 
 -------------------------------------------------------------------------
 include/linux/genhd.h
 081  struct gendisk {
 082   int major;    /* major number of driver */
 083   int first_minor;
 084   int minors;
 085   char disk_name[16];   /* name of major driver */
 086   struct hd_struct **part;  /* [indexed by minor] */
 087   struct block_device_operations *fops;
 088   struct request_queue *queue;
 089   void *private_data;
 090   sector_t capacity;
 091
 092   int flags;
 093   char devfs_name[64];   /* devfs crap */
 094   int number;    /* more of the same */
 095   struct device *driverfs_dev;
 096   struct kobject kobj;
 097
 098   struct timer_rand_state *random;
 099   int policy;
 100
 101   unsigned sync_io;   /* RAID */
 102   unsigned long stamp, stamp_idle;
 103   int in_flight;
 104  #ifdef  CONFIG_SMP
 105   struct disk_stats *dkstats;
 106  #else
 107   struct disk_stats dkstats;
 108  #endif
 109  };  
 -------------------------------------------------------------------------

Line 82

The major_num field is filled in from the result of register_blkdev().

Line 83

A block device for a hard drive could handle several physical drives. Although it is driver dependent, the minor number usually labels each physical drive. The first_minor field is the first of the physical drives.

Line 85

The disk_name, such as hda or sdb, is the text name for an entire disk. (Partitions within a disk are named hda1, hda2, and so on.) These are logical disks within a physical disk device.

Line 87

The fops field is the block_device_operations initialized to the file operations structure. The file operations structure contains pointers to the helper functions in the low-level device driver. These functions are driver dependent in that they are not all implemented in every driver. Commonly implemented file operations are open, close, read, and write. Chapter 4, "Memory Management," discusses the file operations structure.

Line 88

The queue field points to the list of requested operations that the driver must perform. Initialization of the request queue is discussed shortly.

Line 89

The private_data field is for driver-dependent data.

Line 90

The capacity field is to be set with the drive size (in 512KB sectors). A call to set_capacity() should furnish this value.

Line 92

The flags field indicates device attributes. In case of a disk drive, it is the type of media, such as CD, removable, and so on.

Now, we look at what is involved with initializing the request queue. With the queue already declared, we call blk_init_queue(request_fn_proc, spinlock_t). This function takes, as its first parameter, the transfer function to be called on behalf of the filesystem. The function blk_init_queue() allocates the queue with blk_alloc_queue() and then initializes the queue structure. The second parameter to blk_init_queue() is a lock to be associated with the queue for all operations.

Finally, to make this block device visible to the kernel, the driver must call add_disk():

 
 -------------------------------------------------------------------------
 Drivers/block/genhd.c
 193  void add_disk(struct gendisk *disk)
 194  {
 195   disk->flags |= GENHD_FL_UP;
 196   blk_register_region(MKDEV(disk->major, disk->first_minor),
 197     disk->minors, NULL, exact_match, exact_lock, disk);
 198   register_disk(disk);
 199   blk_register_queue(disk);
 200  }
 -------------------------------------------------------------------------

Line 196

This device is mapped into the kernel based on size and number of partitions.

The call to blk_register_region() has the following six parameters:

The disk major number and first minor number are built into this parameter.
This is the range of minor numbers after the first (if this driver handles multiple minor numbers).
This is the loadable module containing the driver (if any).
exact_match is a routine to find the proper disk.
exact_lock is a locking function for this code once the exact_match routine finds the proper disk.
Disk is the handle used for the exact_match and exact_lock functions to identify a specific disk.

Line 198

register_disk checks for partitions and adds them to the filesystem.

Line 199

5.2.4. Device Operations

The basic generic block device has open, close (release), ioctl, and most important, the request function. At the least, the open and close functions could be simple usage counters. The ioctl() interface can be used for debug and performance measurements by bypassing the various software layers. The request function, which is called when a request is put on the queue by the filesystem, extracts the request structure and acts upon its contents. Depending on whether the request is a read or write, the device takes the appropriate action.

The request queue is not accessed directly, but by a set of helper routines. (These can be found in drivers/block/elevator.c and include/linux/blkdev.h.) In keeping with our basic device model, we want to include the ability to act on the next request in our request function:

 
 -------------------------------------------------------------------------
 drivers/block/elevator.c
 186  struct request *elv_next_request(request_queue_t *q)
 -------------------------------------------------------------------------

This helper function returns a pointer to the next request structure. By examining the elements, the driver can glean all the information needed to determine the size, direction, and any other custom operations associated with this request.

When the driver finishes this request, it indicates this to the kernel by using the end_request() helper function:

 
 -------------------------------------------------------------------------
 drivers/block/ll_rw_blk.c
 2599  void end_request(struct request *req, int uptodate)
 2600  {
 2601  if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
 2602  add_disk_randomness(req->rq_disk);
 2603  blkdev_dequeue_request(req);
 2604  end_that_request_last(req);
 2605   }
 2606  }
 -------------------------------------------------------------------------

Line 2599

Pass in the request queue acquired from elev_next_request(),

Line 2601

end_that_request_first() TRansfers the proper number of sectors. (If sectors are pending, end_request() simply returns.)

Line 2602

Add to the system entropy pool. The entropy pool is the system method for generating random numbers from a function fast enough to be called at interrupt time. The basic idea is to collect bytes of data from various drivers in the system and generate a random number from them. Chapter 10, "Adding Your Code to the Kernel," discusses this. Another explanation is at the head of the file /drivers/char/random.c.

Line 2603

Remove request structure from the queue.

Line 2604

Collect statistics and make the structure available to be free.

From this point on, the generic driver services requests until it is released.

Referring to Figure 5.4, we now have the generic block device layer constructing and maintaining the request queue. The final layer in the block I/O system is the hardware (or specific) device driver. The hardware device driver uses the request queue helper routines from the generic layer to service requests from its registered request queue and send notifications when the request is complete.

The hardware device driver has intimate knowledge of the underlying hardware with regard to register locations, I/O, timing, interrupts, and DMA (discussed in the "Direct Memory Access [DMA]" section of this chapter). The complexities of a complete driver for IDE or SCSI are beyond the scope of this chapter. We offer more on hardware device drivers in Chapter 10 and a series of projects to help you produce a skeleton driver to build on.

5.2.5. Character Device Overview

Unlike the block device, the character device sends a stream of data. All serial devices are character devices. When we use the classic examples of a keyboard controller or a serial terminal as a character stream device, it is intuitively clear we cannot (nor would we want to) access the data from these devices out of order. This introduces the gray area for packetized data transmission. The Ethernet medium at the physical transmission layer is a serial device, but at the bus level, it uses DMA to transfer large chunks of data to and from memory.

As device driver writers, we can make anything happen in the hardware, but real-time practicality is the governing force keeping us from randomly accessing an audio stream or streaming data to our IDE drive. Although both sound like attractive challenges, we still have two simple rules we must follow:

All Linux device I/O is based on files.
All Linux device I/O is either character or block.

The parallel port driver at the end of this chapter is a character device driver. Similarities between character and block drivers is the file I/O-based interface. Externally, both types use file operations such as open, close, read, and write. Internally, the most obvious difference between a character device driver and a block device driver is that the character device does not have the block device system of request queues for read and write operations (as previously discussed). It is often the case that for a non-buffered character device, an interrupt is asserted for each element (character) received. To contrast this to a block device, a chunk(s) of data is retrieved and an interrupt is then asserted.

5.2.6. A Note on Network Devices

Network devices have attributes of both block and character devices and are often thought of as a special set of devices. Like a character device, at the physical level, data is transmitted serially. Like a block device, data is packetized and moved to and from the network controller via direct memory access (discussed in the "Direct Memory Access [DMA]" section).

Network devices need to be mentioned as I/O in this chapter, but because of their complexity, they are beyond the scope of this book.

5.2.7. Clock Devices

Clocks are I/O devices that count the hardware heartbeat of the system. Without the concept of elapsed time, Linux would cease to function. Chapter 7, "Scheduling and Kernel Synchronization," covers the system and real-time clocks.

5.2.8. Terminal Devices

The earliest terminals were teletype machines (hence the name tty for the serial port driver). The teletypewriter had been in development since the turn of the century with the desire to send and read real text over telegraph lines. By the early 1960s, the teletype had matured with the early RS-232 standard, and it seemed to be a match for the growing number of the day's minicomputers. For communicating with computers, the teletype gave way to the terminal of the 1970s. True terminals are becoming a rare breed. Popular with mainframe and minicomputers in the 1970s to the mid 1980s, they have given way to PCs running terminal-emulator software packages. The terminal itself (often called a "dumb" terminal) was simply a monitor and keyboard device that communicated with a mainframe by using serial communications. Unlike the PC, it had only enough "smarts" to send and receive text data.

The main console (configurable at boot time) is the first terminal to come up on a Linux system. Often, a graphical interface is launched, and terminal emulator windows are used thereafter.

5.2.9. Direct Memory Access (DMA)

The DMA controller is a hardware device that is situated between an I/O device and (usually) the high-performance bus in the system. The purpose of the DMA controller is to move large amounts of data without processor intervention. The DMA controller can be thought of as a dedicated processor programmed to move blocks of data to and from main memory. At the register level, the DMA controller takes a source and destination address and length to complete its task. Then, while the main processor is idle, it can send a burst of data from a device to memory, or from memory to memory or from memory to a device.

Many controllers (disk, network, and graphics) have a DMA engine built-in and can therefore transfer large amounts of data without using precious processor cycles.

Project: Building a Parallel Port Driver

This project introduces a basic parallel port controller, which demonstrates how the I/O routines previously discussed coalesce. The parallel port, usually integrated into the Superio section of a chipset, is a good example for a character device-driver skeleton. This driver, or dynamically loaded module, is not extremely useful, but you can build upon and improve it. Because we address the device at the register level, this module can be used in a PowerPC system for accessing I/O as long as the register I/O mapping is documented.

Our parallel port device driver uses the standard open(), close(), and most importantly, the ioctl() interface to illustrate the architecture and inner workings of the device driver. We won't be using the read() or write() functions in this project because the ioctl() call returns register values. (Because our device driver is a dynamically loadable module, we simply refer to it as a module.)

We begin with a brief description on how to talk to the parallel port and then proceed to investigate our basic character device-driver module operations. We use the ioctl() interface to reference the individual registers in the device, and create an application to interface with our module.

Parallel Port Hardware

Any Web search of the parallel port yields a massive amount of information. Because our goal for this section is to describe a Linux module, we touch only on the basics of this device.

For this project, we use an x86 system for the experiment. This driver skeleton is easily ported to PowerPC; it just needs to talk to another device at the I/O level. Although the parallel port exists in many embedded PowerPC implementations, it is not widely used in desktops (such as the G4 and G5).

For the actual communication with the parallel port registers, we use inb() and outb(). We could have just as easily used readb() and writeb(), which are available in the file io.h for both x86 and PPC architectures. The readb() and writeb() macros are a good choice for architecture independence because they each resolve to the low-level I/O routines that are used for x86 and PPC.

The parallel port in x86 systems is usually included as a part of the Superio device or it could be a separate (PCI) card added to the system. If you go to your BIOS setup screen, you can see where the parallel port(s) is mapped in the system I/O space. For x86 systems, the parallel port will be at hex address 0x278, 0x378, or 0x3bc using IRQ 7. This is the base address of the device. The parallel port has three 8-bit registers, starting at the base address shown in Table 5.2. For this example, we use a base address of 0x378.

Table 5.2. Parallel Port Registers
Bit
7
6
5
4
3
2
1
0
I/O Port Address
Data register (output)
D7
D6
D5
D4
D3
D2
D1
D0
0x378 (base+0)
Status register (input)
Busy^[*]
ACK
Paper end
Select
Error

0x379 (base+1)
Control register (output)

Select^[*]
Init
Auto feed^[*]
Strobe^[*]
0x37A (base+2)

^[*] Active low

The data register contains the 8 bits to write out to the pins on the connector.

The status register contains the input signals from the connector.

The control register sends specific control signals to the connector.

The connector for the parallel port is a 25-pin D-shell (DB-25). Table 5.3 shows how these signals map to the specific pins of the connector.

Table 5.3. Association of Signals to Pins of the Parallel Connector
Signal Name
Pin Number
Strobe
1
D0
2
D1
3
D2
4
D3
5
D4
6
D5
7
D6
8
D7
9
Acknowledge
10
Busy
11
Paper end
12
Select in
13
Auto feed
14
Error
15
Initialize
16
Select
17
Ground
1825

CAUTION!

The parallel port can be sensitive to static electricity and overcurrent. Do not use your integrated (built in to the motherboard) parallel port unless

You are certain of your hardware skills.
You have no problem destroying your portor worse, your motherboard.

We strongly suggest that you use a parallel-port adapter card for these, and all, experiments.

For input operations, we will jumper D7 (pin 9) to Acknowledge (pin 10) and D6 (pin 8) to Busy (pin 11) with 470 ohm resistors. To monitor output, we drive LEDs with data pins D0 through D4 by using a 470 ohm current limiting resistor. We can do this by using an old printer cable or a 25-pin male D-Shell connector from a local electronics store.

NOTE

A good register-level programmer should always know as much about the underlying hardware as possible. This includes finding the datasheet for your particular parallel port I/O device. In the datasheet, you can find the sink/source current limitations for your device. Many Web sites feature interface methods to the parallel port, including isolation, expanding the number of signals, and pull-up and pull-down resistors. They are a must read for any I/O controller work beyond the scope of this example.

This module addresses the parallel port by way of the outb() and inb() functions. Recall from Chapter 2, "Exploration Toolkit," that, depending on the platform compilation, these functions correctly implement the in and out instructions for x86 and the lbz and stb instructions for the memory-mapped I/O of the PowerPC. This inline code can be found in the /io.h file under the appropriate platform.

Parallel Port Software

The following discussion focuses on the pertinent driver functions for this project. The complete program listing for parll.c, along with Make and parll.h files, is included at the end of this book.

1. Setting Up the File Operations (fops)

As previously mentioned, this module uses open(), close(), and ioctl(), as well as the init and cleanup operations discussed in previous projects.

The first step is to set up our file operations structure. This structure defined in /linux/fs.h lists the possible functions we can choose to implement in our module. We do not have to itemize each operationonly the ones we want. A Web search of C99 and linux module furnishes more information on this methodology. By using this structure, we inform the kernel of the location of our implementation (or entry points) of open, release, and ioctl.

 -------------------------------------------------------------------------
 parll.c
 struct file_operations parlport_fops = { 
      .open =   parlport_open,
      .ioctl =  parlport_ioctl,
      .release =  parlport_close };
 
 -------------------------------------------------------------------------

Next, we create the functions open() and close(). These are essentially dummy functions used to flag when we have opened and closed:

 -------------------------------------------------------------------------
 parll.c
 
 static int parlport_open(struct inode *ino, struct file *filp)
 {
   printk("\n parlport open function");
   return 0;
 }
 
 static int parlport_close(struct inode *ino, struct file *filp)
 {
   printk("\n parlport close function");
   return 0;
 }
 
 -------------------------------------------------------------------------

Create the ioctl() function. Note the following declarations were made at the beginning of parll.c:

 -------------------------------------------------------------------------
 #define MODULE_NAME  "parll"
 static int base = 0x378;
 
 
 parll.c
 static int parlport_ioctl(struct inode *ino, struct file *filp,
      unsigned int ioctl_cmd, unsigned long parm)
 {
   printk("\n parlport ioctl function");
   if(_IOC_TYPE(ioctl_cmd) != IOCTL_TYPE)
   {
    printk("\n%s wrong ioctl type",MODULE_NAME);
    return -1;
   }
   switch(ioctl_cmd)
   {
    case DATA_OUT:
     printk("\n%s ioctl data out=%x",MODULE_NAME,(unsigned int)parm);
     outb(parm & 0xff, base+0);
     return (parm & 0xff);
 
    case GET_STATUS:
     parm = inb(base+1);
     printk("\n%s ioctl get status=%x",MODULE_NAME,(unsigned int)parm);
     return parm;
 
    case CTRL_OUT:
     printk("\n%s ioctl ctrl out=%x",MODULE_NAME,(unsigned int)parm);
     outb(parm && 0xff, base+2);
     return 0;
 
   }  //end switch
   return 0;
 } //end ioctl
 
 -------------------------------------------------------------------------

The ioctl() function is made available to handle any user-defined command. In our module, we surface the three registers associated with the parallel port to the user. The DATA_OUT command sends a value to the data register, the GET_STATUS command reads from the status register, and finally, the CTRL_OUT command is available to set the control signals to the port. Although a better methodology would be to hide the device specifics behind the read() and write() routines, this module is mainly for experimentation with I/O, not data encapsulation.

The three commands just used are defined in the header file parll.h. They are created by using the IOCTL helper routines for type checking. Rather than using an integer to represent an IOCTL function, we use the IOCTL type checking macro IO(type,number), where the type is defined as p (for parallel port) and number is the actual IOCTL number used in the case statement. At the beginning of parlport_ioctl(), we check the type, which should be p. Because the application code uses the same header file as the driver, the interface will be consistent.

2. Setting Up the Module Initialization Routine

The initialization module is used to associate the module with the operating system. It can also be used for early initialization of any data structures if desired. Since the parallel port driver requires no complex data structures, we simply register the module.

 -------------------------------------------------------------------------
 parll.c
 static int parll_init(void)
 {
   int retval;
 
   retval= register_chrdev(Major, MODULE_NAME, &parlport_fops);
   if(retval < 0)
   {
    printk("\n%s: can't register",MODULE_NAME);
    return retval;
   }
   else
   {
    Major=retval;
    printk("\n%s:registered, Major=%d",MODULE_NAME,Major);
 
    if(request_region(base,3,MODULE_NAME))
     printk("\n%s:I/O region busy.",MODULE_NAME);
 
   }
   return 0;
 }
 -------------------------------------------------------------------------

The init_module() function is responsible for registering the module with the kernel. The register_chrdev() function takes in the requested major number (discussed in Section 5.2 and later in Chapter 10; if 0, the kernel assigns one to the module). Recall that the major number is kept in the inode structure, which is pointed to by the dentry structure, which is pointed to by a file struct. The second parameter is the name of the device as it will appear in /proc/devices. The third parameter is the file operations structure that was just shown.

Upon successfully registering, our init routine calls request_region() with the base address of the parallel port and the length (in bytes) of the range of registers we are interested in.

The init_module() function returns a negative number upon failure.

3. Setting Up the Module Cleanup Routine

The cleanup_module() function is responsible for unregistering the module and releasing the I/O range that we requested earlier:

 -------------------------------------------------------------------------
 parll.c
  
 static void parll_cleanup( void )
 {
    printk("\n%s:cleanup ",MODULE_NAME);
    release_region(base,3);
    unregister_chrdev(Major,MODULE_NAME);
 }
 -------------------------------------------------------------------------

Finally, we include the required init and cleanup entry points.

 -----------------------------------------------------------------------
 parll.c
 module_init(parll_init);
 module_exit(parll_cleanup);
 -----------------------------------------------------------------------

4. Inserting the Module

We can now insert our module into the kernel, as in the previous projects, by using

 Lkp:~# insmod parll.ko

Looking at /var/log/messages shows us our init() routine output as before, but make specific note of the major number returned.

In previous projects, we simply inserted and removed our module from the kernel. We now need to associate our module with the filesystem with the mknod command. From the command line, enter the following:

 Lkp:~# mknod /dev/parll c <XXX> 0

The parameters:

c. Create a character special file (as opposed to block)
/dev/parll. The path to our device (for the open call)
XXX. The major number returned at init time (from /var/log/messages)
0. The minor number of our device (not used in this example)

For example, if you saw a major number of 254 in /var/log/messages, the command would look like this:

   Lkp:~# mknod /dev/parll c 254 0

5. Application Code

Here, we created a simple application that opens our module and starts a binary count on the D0 through D7 output pins.

Compile this code with gcc app.c. The executable output defaults to a.out:

 -------------------------------------------------------------------------
 app.c
 000  //application to use parallel port driver
 
 #include <fcntl.h>
 #include <linux/ioctl.h>
 004  #include "parll.h"
 
 main()
 {
   int fptr;
   int i,retval,parm =0;
 
   printf("\nopening driver now");
 012   if((fptr = open("/dev/parll",O_WRONLY))<0)
   {
    printf("\nopen failed, returned=%d",fptr);
    exit(1);
   }
 
 
 018   for(i=0;i<0xff;i++)
   {
 020    system("sleep .2");
 021    retval=ioctl(fptr,DATA_OUT,parm);
 022    retval=ioctl(fptr,GET_STATUS,parm);
 
 024    if(!(retval & 0x80))
     printf("\nBusy signal count=%x",parm);
    if(retval & 0x40)
 027     printf("\nAck signal count=%x",parm);
 028  //   if(retval & 0x20)
 //    printf("\nPaper end signal count=%x",parm);
 //   if(retval & 0x10)
 //    printf("\nSelect signal count=%x",parm);
 //   if(retval & 0x08) 
 033  //    printf("\nError signal count=%x",parm);
 
    parm++;
   }
 
 038   close(fptr);
 
 }
 -------------------------------------------------------------------------

Line 4

The header file common to both the application and the driver contains the new IOCTL helper macros for type checking.

Line 12

Open the driver to get a file handle for our module.

Line 18

Enter the loop.

Line 20

Slow down the loop so we can watch the lights/count.

Line 21

Using the file pointer, send a DATA_OUT command to the module, which in turn uses outb() to write the least significant 8 bits of the parameter to the data port.

Line 22

Read the status byte by way of the ioctl with a GET_STATUS command. This uses inb() and returns the value.

Lines 2427

Watch for our particular bits of interest. Note that Busy* is an active low signal, so when the I/O is off, we read this as true.

Lines 2833

Uncomment these as you improve on the design.

Line 38

Close our module.

If you have built the connector as outlined in Figure 5.5, the busy and ack signals come on when the two most significant bits of the count are on. The application code reads these bits and outputs accordingly.

Figure 5.5. Built Connector

We just outlined the major elements for a character device driver. By knowing these functions, it is easier to trace through working code or create your own driver. Adding an interrupt handler to this module involves a call to request_irq() and passing in the desired IRQ and the name of the handler. This would be included in the init_module().

Here are some suggested additions to the driver:

Make parallel port module service-timer interrupts to poll input.
- How can we multiplex 8 bits of I/O into 16, 32, 64? What is sacrificed?
- Send a character out the serial port from the write routine within the module.
- Add an interrupt routine by using the ack signal.

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье