Search     or:     and:
 LINUX 
 Language 
 Kernel 
 Package 
 Book 
 Test 
 OS 
 Forum 
iakovlev.org

Процессы: Модель выполнения

Термин process, определенный как базовое понятие для выполняемой программы , является важнейшей концепцией для понимания работы операционной системы. Нужно понимать разницу между программой и процессом. Программа - исполняемый файл с набором функций , а процесс - это экземпляр программы. Операционная система управляет ресурсами в соответствие с запросами процесса.

Процесс проходит через несколько состояний. Он имеет жизненный цикл : после создания он проходит ряд состояний и затем умирает :

Figure 3.1. Process Lifecycle


Процесс создается с помощью системного вызова fork(). Этот "форкнутый" процесс становится дочерним процессом того , кто его породил. Дочерний и родительский процессы продолжают жить параллельно. Родительский процесс может продолжать порождать аналогичные дочерние процессы. Так рождается иерархическая цепочка процессов.

После создания процесса , ядро инициализирует для него все необходимые структуры для того, чтобы процессор смог его выполнить.Это состояние называется состоянием готовности. После этого процесс становится выполняемым , при этом :

  • шедулер может вернуть его обратно в состояние ожидания

  • может быть прерван и переведен в состояние ожидания или блокировки

  • становится зомби либо помирает с помощью системного вызова exit().

Шедулер определяет , какой процесс будет выполняться в данный момент процессором.

Каждая программа включает в себя сегмент кода , в котором находятся инструкции для процессора. Имеется сегмент данных с набором переменных. Имеется стековый сегмент для работы функций. Также имеется т.н. "куча" - heap - в которой хранится динамически выделяемая память.

Инстанс программы мы создаем с помощью Bash , который является родителем.

Мы будем использовать слово процесс и слово задача для обозначения одной и той же вещи. Когда мы ссылаемся на выполняемый процесс , имеется ввиду , что данный процесс выполняется в данный момент CPU.

User Mode - Kernel Mode

Что это значит , когда мы говорим ,что программа выполняется в пользовательском режиме или режиме ядра ? За время своей жизни процесс может выполнять свой собственный код либо код ядра. Под кодом ядра мы имеем ввиду те моменты , когда процесс делает системные вызовы, или в момент прерываний. Все остальное относится к пользовательскому режиму.


Пример программы

Рассмотрим пример программы , названной create_process. Эта программа иллюстрирует различные фазы состояния процесса,системные вызовы, манипуляцию обьектов ядра , которые поддерживают выполнение процессов.

 
 -----------------------------------------------------------------------
 create_process.c 
 1   #include <stdio.h>
 2   #include <sys/types.h>
 3   #include <sys/stat.h>
 4   #include <fcntl.h>
 5
 6   int main(int argc, char *argv[])
 7   {
 8     int fd;
 9     int pid;
 11
 12     pid = fork();
 13     if (pid == 0)
 14     {
 15        execle("/bin/ls", NULL);
 16        exit(2);
 17     } 
 18
 19     if(waitpid(pid) < 0)
 20        printf("wait error\n");
 21
 22     pid = fork();    
 23     if (pid == 0){
 24        fd=open("Chapter_03.txt", O_RDONLY);
 25        close(fd);
 26     }
 27
 28     if(waitpid(pid)<0)  
 29        printf("wait error\n");
 30
 31
 32     exit(0);
 33   }
 --------------------------------------------------------------------
 
 
 


Дескриптор процесса

В ядре дескриптором процесса является структура под названием task_struct, которая хранит атрибуты процесса и информацию о нем. Процесс взаимодействует со многими аспектами ядра , такими как управление памятью и шедулер. Все структуры всех выполняемых процессов ядро хранит в связанном списке , называемом task_list. Также имеется ссылка на текущий выполняемый процесс и соответственную task_struct , которая называется current.

Процесс может иметь один или несколько тредов. Каждый тред имеет свою собственную task_struct , в которой имеется уникальный id-шник этого треда. Треды живут в адресном пространстве процесса.

К важнейшим категориям дескриптора процесса относятся :

  • Атрибуты процесса

  • Родительские связи процесса

  • Адресное пространство процесса

  • Управление файлами

  • Управление сигналами

  • Права процесса

  • Ограничения по ресурсам

  • Информация для шедулера

Посмотрим на саму структуру task_struct. Она определена в файлк include/linux/sched.h:

 
 -----------------------------------------------------------------------
 include/linux/sched.h
 384   struct task_struct {
 385     volatile long state;
 386     struct thread_info *thread_info;
 387     atomic_t usage;
 388     unsigned long flags;  
 389     unsigned long ptrace;
 390
 391     int lock_depth;
 392
 393     int prio, static_prio;
 394     struct list_head run_list;
 395     prio_array_t *array;
 396
 397     unsigned long sleep_avg;
 398     long interactive_credit;
 399     unsigned long long timestamp;
 400     int activated;
 401
 302     unsigned long policy;
 403     cpumask_t cpus_allowed;
 404     unsigned int time_slice, first_time_slice;
 405
 406     struct list_head tasks;
 407     struct list_head ptrace_children;
 408     struct list_head ptrace_list;
 409
 410     struct mm_struct *mm, *active_mm;
 ...
 413     struct linux_binfmt *binfmt;
 414     int exit_code, exit_signal;
 415     int pdeath_signal;
 ...
 419     pid_t pid;
 420     pid_t tgid;
 ...
 426     struct task_struct *real_parent;
 427     struct task_struct *parent;
 428     struct list_head children;
 429     struct list_head sibling;
 430     struct task_struct *group_leader;
 ...
 433     struct pid_link pids[PIDTYPE_MAX];
 434
 435     wait_queue_head_t wait_chldexit;
 436     struct completion *vfork_done;
 437     int __user *set_child_tid;
 438     int __user *clear_child_tid;
 439
 440     unsigned long rt_priority;
 441     unsigned long it_real_value, it_prof_value, it_virt_value;
 442     unsigned long it_real_incr, it_prof_incr, it_virt_incr;
 443     struct timer_list real_timer;
 444     unsigned long utime, stime, cutime, cstime;
 445     unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 446     u64 start_time;
 ...
 450     uid_t uid,euid,suid,fsuid;
 451     gid_t gid,egid,sgid,fsgid;
 452     struct group_info *group_info;
 453     kernel_cap_t cap_effective, cap_inheritable, cap_permitted;
 454     int keep_capabilities:1;
 455     struct user_struct *user;
 ...
 457     struct rlimit rlim[RLIM_NLIMITS];
 458     unsigned short used_math;
 459     char comm[16];
 ...
 461     int link_count, total_link_count;
 ...
 467     struct fs_struct *fs;
 ...
 469     struct files_struct *files;
 ...
 509     unsigned long ptrace_message;
 510     siginfo_t *last_siginfo;
 ...
 516   };
 -----------------------------------------------------------------------
 
 

Атрибуты процесса

The process attribute category is a catch-all category we defined for task characteristics related to the state and identification of a task. Examining these fields' values at any time gives the kernel hacker an idea of the current status of a process. Figure 3.2 illustrates the process attributerelated fields of the task_struct.

Figure 3.2. Process AttributeRelated Fields


3.2.1.1. state

The state field keeps track of the state a process finds itself in during its execution lifecycle. Possible values it can hold are TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, TASK_STOPPED, and TASK_DEAD (see the "Process Lifespan" section in this chapter for more detail).

3.2.1.2. pid

In Linux, each process has a unique process identifier (pid). This pid is stored in the task_struct as a type pid_t. Although this type can be traced back to an integer type, the default maximum value of a pid is 32,768 (the value pertaining to a short int).

3.2.1.3. flags

Flags define special attributes that belong to the task. Per process flags are defined in include/linux/sched.h and include those flags listed in Table 3.1. The flag's value provides the kernel hacker with more information regarding what the task is undergoing.

Table 3.1. Selected task_struct Flag's Field Values

Flag Name

When Set

PF_STARTING

Set when the process is being created.

PF_EXITING

Set during the call to do_exit().

PF_DEAD

Set during the call to exit_notify() in the process of exiting. At this point, the state of the process is either TASK_ZOMBIE or TASK_DEAD.

PF_FORKNOEXEC

The parent upon forking sets this flag.


3.2.1.4. binfmt

Linux supports a number of executable formats. An executable format is what defines the structure of how your program code is to be loaded into memory. Figure 3.2 illustrates the association between the task_struct and the linux_binfmt struct, the structure that contains all the information related to a particular binary format (see Chapter 9 for more detail).

3.2.1.5. exit_code and exit_signal

The exit_code and exit_signal fields hold a task's exit value and the terminating signal (if one was used). This is the way a child's exit value is passed to its parent.

3.2.1.6. pdeath_signal

pdeath_signal is a signal sent upon the parent's death.

3.2.1.7. comm

A process is often created by means of a command-line call to an executable. The comm field holds the name of the executable as it is called on the command line.

3.2.1.8. ptrace

ptrace is set when the ptrace() system call is called on the process for performance measurements. Possible ptrace() flags are defined in include/ linux/ptrace.h.

3.2.2. Scheduling Related Fields

A process operates as though it has its own virtual CPU. However, in reality, it shares the CPU with other processes. To sustain the switching between process executions, each process closely interrelates with the scheduler (see Chapter 7 for more detail).

However, to understand some of these fields, you need to understand a few basic scheduling concepts. When more than one process is ready to run, the scheduler decides which one runs first and for how long. The scheduler achieves fairness and efficiency by allotting each process a timeslice and a priority. The timeslice defines the amount of time the process is allowed to execute before it is switched off for another process. The priority of a process is a value that defines the relative order in which it will be allowed to be executed with respect to other waiting processesthe higher the priority, the sooner it is scheduled to run. The fields shown in Figure 3.3 keep track of the values necessary for scheduling purposes.

Figure 3.3. Scheduling Related Fields


3.2.2.1. prio

In Chapter 7, we see that the dynamic priority of a process is a value that depends on the processes' scheduling history and the specified nice value. (See the following sidebar for more information about nice values.) It is updated at sleep time, which is when the process is not being executed and when timeslice is used up. This value, prio, is related to the value of the static_prio field described next. The prio field holds +/- 5 of the value of static_prio, depending on the process' history; it will get a +5 bonus if it has slept a lot and a -5 handicap if it has been a processing hog and used up its timeslice.

3.2.2.2. static_prio

static_prio is equivalent to the nice value. The default value of static_prio is MAX_PRIO-20. In our kernel, MAX_PRIO defaults to 140.

Nice

The nice() system call allows a user to modify the static scheduling priority of a process. The nice value can range from 20 to 19. The nice() function then calls set_user_nice() to set the static_prio field of the task_struct. The static_prio value is computed from the nice value by way of the PRIO_TO_NICE macro. Likewise, the nice value is computed from the static_prio value by means of a call to NICE_TO_PRIO.

 
 ---------------------------------------kernel/sched.c
 #define NICE_TO_PRIO(nice) (MAX_RT_PRIO + nice + 20)
 #define PRIO_TO_NICE(prio) ((prio  MAX_RT_PRIO  20)
 -----------------------------------------------------
 
 


3.2.2.3. run_list

The run_list field points to the runqueue. A runqueue holds a list of all the processes to run. See the " Basic Structure" section for more information on the runqueue struct.

3.2.2.4. array

The array field points to the priority array of a runqueue. The " Keeping Track of Processes: Basic Scheduler Construction" section in this chapter explains this array in detail.

3.2.2.5. sleep_avg

The sleep_avg field is used to calculate the effective priority of the task, which is the average amount of clock ticks the task has spent sleeping.

3.2.2.6. timestamp

The timestamp field is used to calculate the sleep_avg for when a task sleeps or yields.

3.2.2.7. interactive_credit

The interactive_credit field is used along with the sleep_avg and activated fields to calculate sleep_avg.

3.2.2.8. policy

The policy determines the type of process (for example, time sharing or real time). The type of a process heavily influences the priority scheduling. For more information on this field, see Chapter 7.

3.2.2.9. cpus_allowed

The cpus_allowed field specifies which CPUs might handle a task. This is one way in which we can specify which CPU a particular task can run on when in a multiprocessor system.

3.2.2.10. time_slice

The time_slice field defines the maximum amount of time the task is allowed to run.

3.2.2.11. first_time_slice

The first_time_slice field is repeatedly set to 0 and keeps track of the scheduling time.

3.2.2.12. activated

The activated field keeps track of the incrementing and decrementing of sleep averages. If an uninterruptible task gets woken, this field gets set to -1.

3.2.2.13. rt_priority

rt_priority is a static value that can only be updated through schedule(). This value is necessary to support real-time tasks.

3.2.2.14. nivcsw and nvcsw

Different kinds of context switches exist. The kernel keeps track of these for profiling reasons. A global switch count gets set to one of the four different context switch counts, depending on the kind of transition involved in the context switch (see Chapter 7 for more information on context switch). These are the counters for the basic context switch:

  • The nivcsw field (number of involuntary context switches) keeps count of kernel preemptions applied on the task. It gets incremented only upon a task's return from a kernel preemption where the switch count is set to nivcsw.

  • The nvcsw field (number of voluntary context switches) keeps count of context switches that are not based on kernel preemption. The switch count gets set to nvcsw if the previous state was not an active preemption.

3.2.3. Process RelationsRelated Fields

The following fields of the task_struct are those related to process relationships. Each task or process p has a parent that created it. Process p can also create processes and, therefore, might have children. Because p's parent could have created more than one process, it is possible that process p might have siblings. Figure 3.4 illustrates how the task_structs of all these processes relate.

Figure 3.4. Process RelationsRelated Fields


3.2.3.1. real_parent

real_parent points to the current process' parent's description. It will point to the process descriptor of init() if the original parent of our current process has been destroyed. In previous kernels, this was known as p_opptr.

3.2.3.2. parent

parent is a pointer to the descriptor of the parent process. In Figure 3.4, we see that this points to the ptrace task_struct. When ptrace is run on a process, the parent field of task_struct points to the ptrace process.

3.2.3.3. children

children is the struct that points to the list of our current process' children.

3.2.3.4. sibling

sibling is the struct that points to the list of the current process' siblings.

3.2.3.5. group_leader

A process can be a member of a group of processes, and each group has one process defined as the group leader. If our process is a member of a group, group_leader is a pointer to the descriptor of the leader of that group. A group leader generally owns the tty from which the process was created, called the controlling terminal.

3.2.4. Process CredentialsRelated Fields

In multiuser systems, it is necessary to distinguish among processes that are created by different users. This is necessary for the security and protection of user data. To this end, each process has credentials that help the system determine what it can and cannot access. Figure 3.5 illustrates the fields in the task_struct related to process credentials.

Figure 3.5. Process CredentialsRelated Fields


3.2.4.1. uid and gid

The uid field holds the user ID number of the user who created the process. This field is used for protection and security purposes. Likewise, the gid field holds the group ID of the group who owns the process. A uid or gid of 0 corresponds to the root user and group.

3.2.4.2. euid and egid

The effective user ID usually holds the same value as the user ID field. This changes if the executed program has the set UID (SUID) bit on. In this case, the effective user ID is that of the owner of the program file. Generally, this is used to allow any user to run a particular program with the same permissions as another user (for example, root). The effective group ID works in much the same way, holding a value different from the gid field only if the set group ID (SGID) bit is on.

3.2.4.3. suid and sgid

suid (saved user ID) and sgid (saved group ID) are used in the setuid() system calls.

3.2.4.4. fsuid and fsgid

The fsuid and fsgid values are checked specifically for filesystem checks. They generally hold the same values as uid and gid except for when a setuid() system call is made.

3.2.4.5. group_info

In Linux, a user may be part of more than one group. These groups may have varying permissions with respect to system and data accesses. For this reason, the processes need to inherit this credential. The group_info field is a pointer to a structure of type group_info, which holds all the information regarding the various groups of which the process can be a member.

The group_info structure allows a process to associate with a number of groups that is bound by available memory. In Figure 3.5, you can see that a field of group_info called small_block is an array of NGROUPS_SMALL (in our case, 32) gid_t units. If a task belongs to more than 32 groups, the kernel can allocate blocks or pages that hold the necessary number of gid_ts beyond NGROUPS_SMALL. The field nblocks holds the number of blocks allocated, while ngroups holds the value of units in the small_block array that hold a gid_t value.

3.2.5. Process CapabilitiesRelated Fields

Traditionally, UNIX systems offer process-related protection of certain accesses and actions by defining any given process as privileged (super user or UID = 0) or unprivileged (any other process). In Linux, capabilities were introduced to partition the activities previously available only to the superuser; that is, capabilities are individual "privileges" that may be conferred upon a process independently of each other and of its UID. In this manner, particular processes can be given permission to perform particular administrative tasks without necessarily getting all the privileges or having to be owned by the superuser. A capability is thus defined as a given administrative operation. Figure 3.6 shows the fields that are related to process capabilities.

Figure 3.6. Process CapabilitiesRelated Fields


3.2.5.1. cap_effective, cap_inheritable, cap_permitted, and keep_capabilities

The structure used to support the capabilities model is defined in include/linux/security.h as an unsigned 32-bit value. Each 32-bit mask corresponds to a capability set; each capability is assigned a bit in each of:

  • cap_effective. The capabilities that can be currently used by the process.

  • cap_inheritable. The capabilities that are passed through a call to execve.

  • cap_permitted. The capabilities that can be made either effective or inheritable.

    One way to understand the distinction between these three types is to consider the permitted capabilities to be similar to a trivialized gene pool made available by one's parents. Of the genetic qualities made available by one's parents, we can display a subset of them (effective qualities) and/or pass them on (inheritable). Permitted capabilities constitute more of a potentiality whereas effective capabilities are an actuality.

    Therefore, cap_effective and cap_inheritable are always subsets of cap_permitted.

  • keep_capabilities. Keeps track of whether the process will drop or maintain its capabilities on a call to setuid().

Table 3.2 lists some of the supported capabilities that are defined in include/linux/capability.h.

Table 3.2. Selected Capabilities

Capability

Description

CAP_CHOWN

Ignores the restrictions imposed by chown()

CAP_FOWNER

Ignores file-permission restrictions

CAP_FSETID

Ignores setuid and setgid restrictions on files

CAP_KILL

Ignores ruid and euids when sending signals

CAP_SETGID

Ignores group-related permissions checks

CAP_SETUID

Ignores uid-related permissions checks

CAP_SETCAP

Allows a process to set its capabilities


The kernel checks if a particular capability is set with a call to capable() passing as a parameter the capability variable. Generally, the function checks to see whether the capability bit is set in the cap_effective set; if so, it sets current->flags to PF_SUPERPRIV, which indicates that the capability is granted. The function returns a 1 if the capability is granted and 0 if capability is not granted.

Three system calls are associated with the manipulation of capabilities: capget(), capset(), and prctl(). The first two allow a process to get and set its capabilities, while the prctl() system call allows manipulation of current->keep_capabilities.

3.2.6. Process LimitationsRelated Fields

A task uses a number of the resources made available by hardware and the scheduler. To keep track of how they are used and any limitations that might be applied to a process, we have the following fields.

3.2.6.1. rlim

The rlim field holds an array that provides for resource control and accounting by maintaining resource limit values. Figure 3.7 illustrates the rlim field of the task_struct.

Figure 3.7. task_struct Resource Limits


Linux recognizes the need to limit the amount of certain resources that a process is allowed to use. Because the kinds and amounts of resources processes might use varies from process to process, it is necessary to keep this information on a per process basis. What better place than to keep a reference to it in the process descriptor?

The rlimit descriptor (include/linux/resource.h) has the fields rlim_cur and rlim_max, which are the current and maximum limits that apply to that resource. The limit "units" vary by the kind of resource to which the structure refers.

 
 -----------------------------------------------------------------------
 include/linux/resource.h
 struct rlimit {
    unsigned long   rlim_cur;
    unsigned long   rlim_max;
 };
 -----------------------------------------------------------------------
 
 

Table 3.3 lists the resources upon which their limits are defined in include/asm/resource.h. However, both x86 and PPC have the same resource limits list and default values.

Table 3.3. Resource Limits Values

RL Name

Description

Default rlim_cur

Default rlim_max

RLIMIT_CPU

The amount of CPU time in seconds this process may run.

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_FSIZE

The size of a file in 1KB blocks.

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_DATA

The size of the heap in bytes.

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_STACK

The size of the stack in bytes.

_STK_LIM

RLIM_INFINITY

RLIMIT_CORE

The size of the core dump file.

0

RLIM_INFINITY

RLIMIT_RSS

The maximum resident set size (real memory).

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_NPROC

The number of processes owned by this process.

0

0

RLIMIT_NOFILE

The number of open files this process may have at one time.

INR_OPEN

INR_OPEN

RLIMIT_MEMLOCK

Physical memory that can be locked (not swapped).

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_AS

Size of process address space in bytes.

RLIM_INFINITY

RLIM_INFINITY

RLIMIT_LOCKS

Number of file locks.

RLIM_INFINITY

RLIM_INFINITY


When a value is set to RLIM_INFINITY, the resource is unlimited for that process.

The current limit (rlim_cur) is a soft limit that can be changed via a call to setrlimit(). The maximum limit is defined by rlim_max and cannot be exceeded by an unprivileged process. The geTRlimit() system call returns the value of the resource limits. Both setrlimit() and getrlimit() take as parameters the resource name and a pointer to a structure of type rlimit.

3.2.7. Filesystem- and Address SpaceRelated Fields

Processes can be heavily involved with files throughout their lifecycle, performing tasks such as opening, closing, reading, and writing. The task_struct has two fields that are associated with file- and filesystem-related data: fs and files (see Chapter 6, "Filesystems," for more detail). The two fields related to address space are active_mm and mm (see Chapter 4, "Memory Management," for more detail on mm_struct). Figure 3.8 shows the filesystem- and address spacerelated fields of the task_struct.

Figure 3.8. Filesystem- and Address SpaceRelated Fields


3.2.7.1. fs

The fs field holds a pointer to filesystem information.

3.2.7.2. files

The files field holds a pointer to the file descriptor table for the task. This file descriptor holds pointers to files (more specifically, to their descriptors) that the task has open.

3.2.7.3. mm

mm points to address-space and memory-managementrelated information.

3.2.7.4. active_mm

active_mm is a pointer to the most recently accessed address space. Both the mm and active_mm fields start pointing at the same mm_struct.

Evaluating the process descriptor gives us an idea of the type of data that a process is involved with throughout its lifetime. Now, we can look at what happens throughout the lifespan of a process. The following sections explain the various stages and states of a process and go through the sample program line by line to explain what happens in the kernel.


3.3. Process Creation: fork(), vfork(), and clone() System Calls

After the sample code is compiled into a file (in our case, an ELF executable[2]), we call it from the command line. Look at what happens when we press the Return key. We already mentioned that any given process is created by another process. The operating system provides the functionality to do this by means of the fork(), vfork(), and clone() system calls.

[2] ELF executable is an executable format that Linux supports. Chapter 9 discusses the ELF executable format.

The C library provides three functions that issue these three system calls. The prototypes of these functions are declared in <unistd.h>. Figure 3.9 shows how a process that calls fork() executes the system call sys_fork(). This figure describes how kernel code performs the actual process creation. In a similar manner, vfork() calls sys_fork(), and clone() calls sys_clone().

Figure 3.9. Process Creation System Calls


All three of these system calls eventually call do_fork(), which is a kernel function that performs the bulk of the actions related to process creation. You might wonder why three different functions are available to create a process. Each function slightly differs in how it creates a process, and there are specific reasons why one would be chosen over the other.

When we press Return at the shell prompt, the shell creates the new process that executes our program by means of a call to fork(). In fact, if we type the command ls at the shell and press Return, the pseudocode of the shell at that moment looks something like this:

 if( (pid = fork()) == 0 )
    execve("foo");
 else
    waitpid(pid);
 

We can now look at the functions and trace them down to the system call. Although our program calls fork(), it could just as easily have called vfork() or clone(), which is why we introduced all three functions in this section. The first function we look at is fork(). We delve through the calls fork(), sys_fork(), and do_fork(). We follow that with vfork() and finally look at clone() and trace them down to the do_fork() call.

3.3.1. fork() Function

The fork() function returns twice: once in the parent and once in the child process. If it returns in the child process, fork() returns 0. If it returns in the parent, fork() returns the child's PID. When the fork() function is called, the function places the necessary information in the appropriate registers, including the index into the system call table where the pointer to the system call resides. The processor we are running on determines the registers into which this information is placed.

At this point, if you want to continue the sequential ordering of events, look at the " Interrupts" section in this chapter to see how sys_fork() is called. However, it is not necessary to understand how a new process gets created.

Let's now look at the sys_fork() function. This function does little else than call the do_fork() function. Notice that the sys_fork() function is architecture dependent because it accesses function parameters passed in through the system registers.

 
 -----------------------------------------------------------------------
 arch/i386/kernel/process.c
 asmlinkage int sys_fork(struct pt_regs regs)
 {
    return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
 }
 -----------------------------------------------------------------------
 
 -----------------------------------------------------------------------
 arch/ppc/kernel/process.c
 int sys_fork(int p1, int p2, int p3, int p4, int p5, int p6,
              struct pt_regs *regs)
 {
         CHECK_FULL_REGS(regs);
         return do_fork(SIGCHLD, regs->gpr[1], regs, 0, NULL, NULL);
 }
 -----------------------------------------------------------------------
 
 

The two architectures take in different parameters to the system call. The structure pt_regs holds information such as the stack pointer. The fact that gpr[1] holds the stack pointer in PPC, whereas %esp [3] holds the stack pointer in x86, is known by convention.

[3] Recall that in code produced in "AT&T" format, registers are prefixed with a %.

3.3.2. vfork() Function

The vfork() function is similar to the fork() function with the exception that the parent process is blocked until the child calls exit() or exec().

 
 sys_vfork()
 arch/i386/kernel/process.c
 asmlinkage int sys_vfork(struct pt_regs regs)
 {
    return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.ep, &regs, 0, NULL, NULL);
 }
 -----------------------------------------------------------------------
 arch/ppc/kernel/process.c
 int sys_vfork(int p1, int p2, int p3, int p4, int p5, int p6,
               struct pt_regs *regs)
 {
         CHECK_FULL_REGS(regs);
         return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->gpr[1],
                         regs, 0, NULL, NULL);
 }
 -----------------------------------------------------------------------
 
 

The only difference between the calls to sys_fork() in sys_vfork() and sys_fork() are the flags that do_fork() is passed. The presence of these flags are used later to determine if the added behavior just described (of blocking the parent) will be executed.

3.3.3. clone() Function

The clone() library function, unlike fork() and vfork(), takes in a pointer to a function along with its argument. The child process created by do_fork()calls this function as soon as it gets created.

----------------------------------------------------------------------- sys_clone() arch/i386/kernel/process.c asmlinkage int sys_clone(struct pt_regs regs) { unsigned long clone_flags; unsigned long newsp; int __user *parent_tidptr, *child_tidptr; clone_flags = regs.ebx; newsp = regs.ecx; parent_tidptr = (int __user *)regs.edx; child_tidptr = (int __user *)regs.edi; if (!newsp) newsp = regs.esp; return do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr, child_tidptr); } ----------------------------------------------------------------------- ----------------------------------------------------------------------- arch/ppc/kernel/process.c int sys_clone(unsigned long clone_flags, unsigned long usp, int __user *parent_tidp, void __user *child_thread\ ptr, int __user *child_tidp, int p6, struct pt_regs *regs) { CHECK_FULL_REGS(regs); if (usp == 0) usp = regs->gpr[1]; /* stack pointer for chi\ ld */ return do_fork(clone_flags & ~CLONE_IDLETASK, usp, regs,\ 0, parent_tidp, child_tidp); } -----------------------------------------------------------------------

As Table 3.4 shows, the only difference between fork(), vfork(), and clone() is which flags are set in the subsequent calls to do_fork().

Table 3.4. Flags Passed to do_fork by fork(), vfork(), and clone()
 

fork()

vfork()

clone()

SIGCHLD

X

X

 

CLONE_VFORK

 

X

 

CLONE_VM

 

X

 


Finally, we get to do_fork(), which performs the real process creation. Recall that up to this point, we only have the parent executing the call to fork(), which then enables the system call sys_fork(); we still do not have a new process. Our program foo still exists as an executable file on disk. It is not running or in memory.

3.3.4. do_fork() Function

We follow the kernel side execution of do_fork() line by line as we describe the details behind the creation of a new process.

----------------------------------------------------------------------- kernel/fork.c 1167 long do_fork(unsigned long clone_flags, 1168 unsigned long stack_start, 1169 struct pt_regs *regs, 1170 unsigned long stack_size, 1171 int __user *parent_tidptr, 1172 int __user *child_tidptr) 1173 { 1174 struct task_struct *p; 1175 int trace = 0; 1176 long pid; 1177 1178 if (unlikely(current->ptrace)) { 1179 trace = fork_traceflag (clone_flags); 1180 if (trace) 1181 clone_flags |= CLONE_PTRACE; 1182 } 1183 1184 p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr, child_tidptr); -----------------------------------------------------------------------

Lines 11781183

The code begins by verifying if the parent wants the new process ptraced. ptracing references are prevalent within functions dealing with processes. This book explains only the ptrace references at a high level. To determine whether a child can be traced, fork_traceflag() must verify the value of clone_flags. If CLONE_VFORK is set in clone_flags, if SIGCHLD is not to be caught by the parent, or if the current process also has PT_TRACE_FORK set, the child is traced, unless the CLONE_UNTRACED or CLONE_IDLETASK flags have also been set.

Line 1184

This line is where a new process is created and where the values in the registers are copied out. The copy_process() function performs the bulk of the new process space creation and descriptor field definition. However, the start of the new process does not take place until later. The details of copy_process() make more sense when the explanation is scheduler-centric. See the " Keeping Track of Processes: Basic Scheduler Construction" section in this chapter for more detail on what happens here.

 
 -----------------------------------------------------------------------
 kernel/fork.c
 ...
 1189     pid = IS_ERR(p) ? PTR_ERR(p) : p->pid;
 1190
 1191     if (!IS_ERR(p)) {
 1192        struct completion vfork;
 1193
 1194        if (clone_flags & CLONE_VFORK) {
 1195          p->vfork_done = &vfork;
 1196          init_completion(&vfork);
 1197        }
 1198
 1199        if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) {
 ...
 1203          sigaddset(&p->pending.signal, SIGSTOP);
 1204          set_tsk_thread_flag(p, TIF_SIGPENDING);
 1205        }
 ...
 -----------------------------------------------------------------------
 
 

Line 1189

This is a check for pointer errors. If we find a pointer error, we return the pointer error without further ado.

Lines 11941197

At this point, check if do_fork() was called from vfork(). If it was, enable the wait queue involved with vfork().

Lines 11991205

If the parent is being traced or the clone is set to CLONE_STOPPED, the child is issued a SIGSTOP signal upon startup, thus starting in a stopped state.

 
 -----------------------------------------------------------------------
 kernel/fork.c
 1207     if (!(clone_flags & CLONE_STOPPED)) {
 ...
 1222          wake_up_forked_process(p);
 1223     } else {
 1224        int cpu = get_cpu();
 1225
 1226        p->state = TASK_STOPPED;
 1227        if (!(clone_flags & CLONE_STOPPED))
 1228          wake_up_forked_process(p);   /* do this last */
 1229        ++total_forks;
 1230
 1231        if (unlikely (trace)) {
 1232          current->ptrace_message = pid;
 1233          ptrace_notify ((trace << 8) | SIGTRAP);
 1234        }
 1235
 1236        if (clone_flags & CLONE_VFORK) {
 1237          wait_for_completion(&vfork);
 1238          if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) 
 1239             ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
 1240        } else
 ...
 1248          set_need_resched();
 1249     }
 1250     return pid;
 1251   }
 -----------------------------------------------------------------------
 
 

Lines 12261229

In this block, we set the state of the task to TASK_STOPPED. If the CLONE_STOPPED flag was not set in clone_flags, we wake up the child process; otherwise, we leave it waiting for its wakeup signal.

Lines 12311234

If ptracing has been enabled on the parent, we send a notification.

Lines 12361239

If this was originally a call to vfork(), this is where we set the parent to blocking and send a notification to the trace if enabled. This is implemented by the parent being placed in a wait queue and remaining there in a TASK_UNINTERRUPTIBLE state until the child calls exit() or execve().

Line 1248

We set need_resched in the current task (the parent). This allows the child process to run first.


3.4. Process Lifespan

Now that we have seen how a process is created, we need to look at what happens during the course of its lifespan. During this time, a process can find itself in various states. The transition between these states depends on the actions that the process performs and on the nature of the signals sent to it. Our example program has found itself in the TASK_INTERRUPTIBLE state and in TASK_RUNNING (its current state).

The first state a process state is set to is TASK_INTERRUPTIBLE. This occurs during process creation in the copy_process() routine called by do_fork(). The second state a process finds itself in is TASK_RUNNING, which is set prior to exiting do_fork(). These two states are guaranteed in the life of the process. Following those two states, many variables come into play that determine what states the process will find itself in. The last state a process is set to is TASK_ZOMBIE, during the call to do_exit(). Let's look at the various process states and the manner in which the transitions from one state to the next occur. We point out how our process proceeds from one state to another.

3.4.1. Process States

When a process is running, it means that its context has been loaded into the CPU registers and memory and that the program that defines this context is being executed. At any particular time, a process might be unable to run for a number of reasons. A process might be unable to continue running because it is waiting for input that is not present or the scheduler may have decided it has run the maximum amount of time units allotted and that it must yield to another process. A process is considered ready when it's not running but is able to run (as with the rescheduling) or blocked when waiting for input.

Figure 3.10 shows the abstract process states and underlies the possible Linux task states that correspond to each abstract state. Table 3.5 outlines the four transitions and how it is brought about. Table 3.6 associates the abstract states with the values used in the Linux kernel to identify those states.

Figure 3.10. Process State Transition


Table 3.5. Summary of Transitions

Transition

Agent of Transition

Ready to Running (A)

Selected by scheduler

Running to Ready (B)

Timeslice ends (inactive)

Process yields (active)

Blocked to Ready (C )

Signal comes in

Resource becomes available

Running to Blocked (D)

Process sleeps or waits on something


Table 3.6. Association of Linux Flags with Abstract Process States

Abstract State

Linux Task States

Ready

TASK_RUNNING

Running

TASK_RUNNING

Blocked

TASK_INTERRUPTIBLE

TASK_UNINTERRUPTIBLE

TASK_ZOMBIE

TASK_STOPPED


NOTE

The set_current_state() process state can be set if access to the task struct is available by a direct assignment setting such as current->state= TASK_INTERRUPTIBLE. A call to set_current_state(TASK_INTERRUPTIBLE) will perform the same effect.


3.4.2. Process State Transitions

We now look at the kinds of events that would cause a process to go from one state to another. The abstract process transitions (refer to Table 3.5) include the transition from the ready state to the running state, the transition from the running state to the ready state, the transitions from the blocked state to the ready state, and the transition from the running state to the blocked state. Each transition can translate into more than one transition between different Linux task states. For example, going from blocked to running could translate to going from any one of TASK_ INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, or TASK_STOPPED to TASK_RUNNING. Figure 3.11 and Table 3.7 describe these transitions.

Figure 3.11. Task State Transitions


Table 3.7. Summary of Task Transitions

Start Linux Task State

End Linux Task State

Agent of Transition

TASK_RUNNING

TASK_UNINTERRUPTIBLE

Process enters wait queue.

TASK_RUNNING

TASK_INTERRUPTIBLE

Process enters wait queue.

TASK_RUNNING

TASK_STOPPED

Process receives SIGSTOP signal or process is being traced.

TASK_RUNNING

TASK_ZOMBIE

Process is killed but parent has not called sys_wait4().

TASK_INTERRUPTIBLE

TASK_STOPPED

During signal receipt.

TASK_UNINTERRUPTIBLE

TASK_STOPPED

During waking up.

TASK_UNINTERRUPTIBLE

TASK_RUNNING

Process has received the resource it was waiting for.

TASK_INTERRUPTIBLE

TASK_RUNNING

Process has received the resource it was waiting for or has been set to running as a result of a signal it received.

TASK_RUNNING

TASK_RUNNING

Moved in and out by the scheduler.


We now explain the various state transitions detailing the Linux task state transitions under the general process transition categories.

3.4.2.1. Ready to Running

The abstract process state transition of "ready to running" does not correspond to an actual Linux task state transition because the state does not actually change (it stays as TASK_RUNNING). However, the process goes from being in a queue of ready to run tasks (or run queue) to actually being run by the CPU.

TASK_RUNNING to TASK_RUNNING

Linux does not have a specific state for the task that is currently using the CPU, and the task retains the state of TASK_RUNNING even though the task moves out of a queue and its context is now executing. The scheduler selects the task from the run queue. Chapter 7 discusses how the scheduler selects the next task to set to running.

3.4.2.2. Running to Ready

In this situation, the task state does not change even though the task itself undergoes a change. The abstract process state transition helps us understand what is happening. As previously stated, a process goes from running to being ready to run when it transitions from being run by the CPU to being placed in the run queue.

TASK_RUNNING to TASK_RUNNING

Because Linux does not have a separate state for the task whose context is being executed by the CPU, the task does not suffer an explicit Linux task state transition when this occurs and stays in the TASK_RUNNING state. The scheduler selects when to switch out a task from being run to being placed in the run queue according to the time it has spent executing and the task's priority ( Chapter 7 covers this in detail).

3.4.2.3. Running to Blocked

When a process gets blocked, it can be in one of the following states: TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, or TASK_STOPPED. We now describe how a task gets to be in each of these states from TASK_RUNNING, as detailed in Table 3.7.

TASK_RUNNING to TASK_INTERRUPTIBLE

This state is usually called by blocking I/O functions that have to wait on an event or resource. What does it mean for a task to be in the TASK_INTERRUPTIBLE state? Simply that it is not on the run queue because it is not ready to run. A task in TASK_INTERRUPTIBLE wakes up if its resource becomes available (time or hardware) or if a signal comes in. The completion of the original system call depends on the implementation of the interrupt handler. In the code example, the child process accesses a file that is on disk. The disk driver is in charge of knowing when the device is ready for the data to be accessed. Hence, the driver will have code that looks something like this:

 
 while(1)
 {
   if(resource_available)
    break();
 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 }
 set_current_state(TASK_RUNNING);
 
 

The example process enters the TASK_INTERRUPTIBLE state at the time it performs the call to open(). At this point, it is removed from being the running process by the call to schedule(), and another process that the run queue selects becomes the running process. After the resource becomes available, the process breaks out of the loop and sets the process' state to TASK_RUNNING, which puts it back on the run queue. It then waits until the scheduler determines that it is the process' turn to run.

The following listing shows the function interruptible_sleep_on(), which can set a task in the TASK_INTERRUPTIBLE state:

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2504  void interruptible_sleep_on(wait_queue_head_t *q)
 2505  {
 2506   SLEEP_ON_VAR
 2507
 2508   current->state = TASK_INTERRUPTIBLE;  
 2509
 2510   SLEEP_ON_HEAD
 2511   schedule();
 2512   SLEEP_ON_TAIL
 2513  }
 -----------------------------------------------------------------------
 
 

The SLEEP_ON_HEAD and the SLEEP_ON_TAIL macros take care of adding and removing the task from the wait queue (see the " Wait Queues" section in this chapter). The SLEEP_ON_VAR macro initializes the task's wait queue entry for addition to the wait queue.

TASK_RUNNING to TASK_UNINTERRUPTIBLE

The TASK_UNINTERRUPTIBLE state is similar to TASK_INTERRUPTIBLE with the exception that processes do not heed signals that come in while it is in kernel mode. This state is also the default state into which a task is set when it is initialized during creation in do_fork(). The sleep_on() function is called to set a task in the TASK_UNINTERRUPTIBLE state.

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2545  long fastcall __sched sleep_on(wait_queue_head_t *q)
 2546  {
 2547   SLEEP_ON_VAR
 2548
 2549   current->state = TASK_UNINTERRUPTIBLE;
 2550
 2551   SLEEP_ON_HEAD
 2552   schedule();
 2553   SLEEP_ON_TAIL
 2554
 2555   return timeout;
 2556  }
 -----------------------------------------------------------------------
 
 

This function sets the task on the wait queue, sets its state, and calls the scheduler.

TASK_RUNNING to TASK_ZOMBIE

A process in the TASK_ZOMBIE state is called a zombie process. Each process goes through this state in its lifecycle. The length of time a process stays in this state depends on its parent. To understand this, realize that in UNIX systems, any process may retrieve the exit status of a child process by means of a call to wait() or waitpid() (see the " Parent Notification and sys_wait4()" section). Hence, minimal information needs to be available to the parent, even once the child terminates. It is costly to keep the process alive just because the parent needs to know its state; hence, the zombie state is one in which the process' resources are freed and returned but the process descriptor is retained.

This temporary state is set during a process' call to sys_exit() (see the " Process Termination" section for more information). Processes in this state will never run again. The only state they can go to is the TASK_STOPPED state.

If a task stays in this state for too long, the parent task is not reaping its children. A zombie task cannot be killed because it is not actually alive. This means that no task exists to kill, merely the task descriptor that is waiting to be released.

TASK_RUNNING to TASK_STOPPED

This transition will be seen in two cases. The first case is processes that a debugger or a trace utility is manipulating. The second is if a process receives SIGSTOP or one of the stop signals.

TASK_UNINTERRUPTIBLE or TASK_INTERRUPTIBLE to TASK_STOPPED

TASK_STOPPED manages processes in SMP systems or during signal handling. A process is set to the TASK_STOPPED state when the process receives a wake-up signal or if the kernel specifically needs the process to not respond to anything (as it would if it were set to TASK_INTERRUPTIBLE, for example).

Unlike a task in state TASK_ZOMBIE, a process in state TASK_STOPPED is still able to receive a SIGKILL signal.

3.4.2.4. Blocked to Ready

The transition of a process from blocked to ready occurs upon acquisition of the data or hardware on which the process was waiting. The two Linux-specific transitions that occur under this category are TASK_INTERRUPTIBLE to TASK_RUNNING and TASK_UNINTERRUPTIBLE to TASK_RUNNING.


3.5. Process Termination

A process can terminate voluntarily and explicitly, voluntarily and implicitly, or involuntarily. Voluntary termination can be attained in two ways:

  1. Returning from the main() function (implicit)

  2. Calling exit() (explicit)

Executing a return from the main() function literally translates into a call to exit(). The linker introduces the call to exit() under these circumstances.

Involuntary termination can be attained in three ways:

  1. The process might receive a signal that it cannot handle.

  2. An exception might be raised during its kernel mode execution.

  3. The process might have received the SIGABRT or other termination signal.

The termination of a process is handled differently depending on whether the parent is alive or dead. A process can

  • Terminate before its parent

  • Terminate after its parent

In the first case, the child is turned into a zombie process until the parent makes the call to wait/waitpid(). In the second case, the child's parent status will have been inherited by the init() process. We see that when any process terminates, the kernel reviews all the active processes and verifies whether the terminating process is parent to any process that is still alive and active. If so, it changes that child's parent PID to 1.

Let's look at the example again and follow it through its demise. The process explicitly calls exit(0). (Note that it could have just as well called _exit(), return(0), or fallen off the end of main with neither call.) The exit() C library function then calls the sys_exit() system call. We can review the following code to see what happens to the process from here onward.

We now look at the functions that terminate a process. As previously mentioned, our process foo calls exit(), which calls the first function we look at, sys_exit(). We delve through the call to sys_exit() and into the details of do_exit().

3.5.1. sys_exit() Function

 
 -----------------------------------------------------------------------
 kernel/exit.c
 asmlinkage long sys_exit(int error_code)
 {
   do_exit((error_code&0xff)<<8);
 }
 -----------------------------------------------------------------------
 
 

sys_exit() does not vary between architectures, and its job is fairly straightforwardall it does is call do_exit() and convert the exit code into the format required by the kernel.

3.5.2. do_exit() Function

 
 -----------------------------------------------------------------------
 kernel/exit.c
 707  NORET_TYPE void do_exit(long code)
 708  {
 709   struct task_struct *tsk = current;
 710
 711   if (unlikely(in_interrupt()))
 712    panic("Aiee, killing interrupt handler!");
 713   if (unlikely(!tsk->pid))
 714    panic("Attempted to kill the idle task!");
 715   if (unlikely(tsk->pid == 1))
 716    panic("Attempted to kill init!");
 717   if (tsk->io_context)
 718    exit_io_context();
 719   tsk->flags |= PF_EXITING;
 720   del_timer_sync(&tsk->real_timer);
 721
 722   if (unlikely(in_atomic()))
 723    printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
 724       current->comm, current->pid,
 725       preempt_count());
 -----------------------------------------------------------------------
 
 

Line 707

The parameter code comprises the exit code that the process returns to its parent.

Lines 711716

Verify against unlikely, but possible, invalid circumstances. These include the following:

  1. Making sure we are not inside an interrupt handler.

  2. Ensure we are not the idle task (PID0=0) or the init task (PID=1). Note that the only time the init process is killed is upon system shutdown.

Line 719

Here, we set PF_EXITING in the flags field of the processes' task struct. This indicates that the process is shutting down. For example, this is used when creating interval timers for a given process. The process flags are checked to see if this flag is set and thus helps prevent wasteful processing.

 
 -----------------------------------------------------------------------
 kernel/exit.c
 ...
 727   profile_exit_task(tsk);
 728
 729   if (unlikely(current->ptrace & PT_TRACE_EXIT)) {
 730    current->ptrace_message = code;
 731    ptrace_notify((PTRACE_EVENT_EXIT << 8) | SIGTRAP);
 732   }
 733
 734   acct_process(code);
 735   __exit_mm(tsk);
 736
 737   exit_sem(tsk);
 738   __exit_files(tsk);
 739   __exit_fs(tsk);
 740   exit_namespace(tsk);
 741   exit_thread();
 ...
 -----------------------------------------------------------------------
 
 

Lines 729732

If the process is being ptraced and the PT_TRACE_EXIT flag is set, we pass the exit code and notify the parent process.

Lines 735742

These lines comprise the cleaning up and reclaiming of resources that the task has been using and will no longer need. __exit_mm() frees the memory allocated to the process and releases the mm_struct associated with this process. exit_sem() disassociates the task from any IPC semaphores. __exit_files() releases any files the task allocated and decrements the file descriptor counts. __exit_fs() releases all file system data.

 
 -----------------------------------------------------------------------
 kernel/exit.c
 ...
 744   if (tsk->leader)
 745    disassociate_ctty(1);
 746
 747   module_put(tsk->thread_info->exec_domain->module);
 748   if (tsk->binfmt)
 749    module_put(tsk->binfmt->module);
 ...
 -----------------------------------------------------------------------
 
 

Lines 744745

If the process is a session leader, it is expected to have a controlling terminal or tty. This function disassociates the task leader from its controlling tty.

Lines 747749

In these blocks, we decrement the reference counts for the module:

 
 -----------------------------------------------------------------------
 kernel/exit.c
 ...
 751   tsk->exit_code = code;
 752   exit_notify(tsk);
 753
 754   if (tsk->exit_signal == -1 && tsk->ptrace == 0)
 755    release_task(tsk);
 756
 757   schedule();
 758   BUG();
 759   /* Avoid "noreturn function does return". */
 760   for (;;) ;
 761  }
 ...
 -----------------------------------------------------------------------
 
 

Line 751

Set the task's exit code in the task_struct field exit_code.

Line 752

Send the SIGCHLD signal to parent and set the task state to TASK_ZOMBIE. exit_notify() notifies the relations of the impending task's death. The parent is informed of the exit code while the task's children have their parent set to the init process. The only exception to this is if another existing process exists within the same process group: In this case, the existing process is used as a surrogate parent.

Line 754

If exit_signal is -1 (indicating an error) and the process is not being ptraced, the kernel calls on the scheduler to release the process descriptor of this task and to reclaim its timeslice.

Line 757

Yield the processor to a new process. As we see in Chapter 7, the call to schedule() will not return. All code past this point catches impossible circumstances or avoids compiler warnings.

3.5.3. Parent Notification and sys_wait4()

When a process is terminated, its parent is notified. Prior to this, the process is in a zombie state where all its resources have been returned to the kernel, but the process descriptor remains. The parent task (for example, the Bash shell) receives the signal SIGCHLD that the kernel sends to it when the child process terminates. In the example, the shell calls wait() when it wants to be notified. A parent process can ignore the signal by not implementing an interrupt handler and can instead choose to call wait() (or waitpid()) at any point.

The wait family of functions serves two general roles:

  • Mortician. Getting task death information.

  • Grave digger. Getting rid of all traces of a process.

Our parent program can choose to call one of the four functions in the wait family:

  • pid_t wait(int *status)

  • pid_t waitpid(pid_t pid, int *status, int options)

  • pid_t wait3(int *status, int options, struct rusage *rusage)

  • pid_t wait4(pid_t pid, int *status, int options, struct rusage *rusage)

Each function will in turn call sys_wait4(), which is where the bulk of the notification occurs.

A process that calls one of the wait functions is blocked until one of its children terminates or returns immediately if the child has terminated (or if the parent is childless). The sys_wait4() function shows us how the kernel manages this notification:

 
 -----------------------------------------------------------------------
 kernel/exit.c
 1031  asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, 
 int options, struct rusage * ru)
 1032  {
 1033   DECLARE_WAITQUEUE(wait, current);
 1034   struct task_struct *tsk;
 1035   int flag, retval;
 1036
 1037   if (options & ~(WNOHANG|WUNTRACED|__WNOTHREAD|__WCLONE|__WALL))
 1038    return -EINVAL;
 1039
 1040   add_wait_queue(&current->wait_chldexit,&wait);
 1041  repeat:
 1042   flag = 0;
 1043   current->state = TASK_INTERRUPTIBLE;
 1044   read_lock(&tasklist_lock);
 ...
 -----------------------------------------------------------------------
 
 

Line 1031

The parameters include the PID of the target process, the address in which the exit status of the child should be placed, flags for sys_wait4(), and the address in which the resource usage information of the child should be placed.

Lines 1033 and 1040

Declare a wait queue and add the process to it. (This is covered in more detail in the " Wait Queues" section.)

Line 10371038

This code mostly checks for error conditions. The function returns a failure code if the system call is passed options that are invalid. In this case, the error EINVAL is returned.

Line 1042

The flag variable is set to 0 as an initial value. This variable is changed once the pid argument is found to match one of the calling task's children.

Line 1043

This code is where the calling process is set to blocking. The state of the task is moved from TASK_RUNNING to TASK_INTERRUPTIBLE.

 
 -----------------------------------------------------------------------
 kernel/exit.c
 ...
 1045   tsk = current;
 1046   do {
 1047    struct task_struct *p;
 1048    struct list_head *_p;
 1049    int ret;
 1050
 1051    list_for_each(_p,&tsk->children) {
 1052     p = list_entry(_p,struct task_struct,sibling);
 1053
 1054     ret = eligible_child(pid, options, p);
 1055     if (!ret)
 1056       continue;
 1057     flag = 1;
 1058     switch (p->state) {
 1059     case TASK_STOPPED:
 1060       if (!(options & WUNTRACED) &&
 1061        !(p->ptrace & PT_PTRACED))
 1062        continue;
 1063       retval = wait_task_stopped(p, ret == 2,
 1064           stat_addr, ru);
 1065       if (retval != 0) /* He released the lock. */
 1066        goto end_wait4;
 1067       break;
 1068     case TASK_ZOMBIE: 
 ...
 1072       if (ret == 2)
 1073        continue;
 1074       retval = wait_task_zombie(p, stat_addr, ru);
 1075       if (retval != 0) /* He released the lock. */
 1076        goto end_wait4;
 1077       break;
 1078     }
 1079    }
 ...
 1091    tsk = next_thread(tsk);
 1092    if (tsk->signal != current->signal)
 1093     BUG();
 1094   } while (tsk != current);
 ...
 -----------------------------------------------------------------------
 
 

Lines 1046 and 1094

The do while loop iterates once through the loop while looking at itself, then continues while looking at other tasks.

Line 1051

Repeat the action on every process in the task's children list. Remember that this is the parent process that is waiting on its children's exit. The process is currently in TASK_INTERRUPTIBLE and iterating over its children list.

Line 1054

Determine if the pid parameter passed is unreasonable.

Line 10581079

Check the state of each of the task's children. Actions are performed only if a child is stopped or if it is a zombie. If a task is sleeping, ready, or running (the remaining states), nothing is done. If a child is in TASK_STOPPED and the UNtrACED option has been used (which means that the task wasn't stopped because of a process trace), we verify if the status of that child has been reported and return the child's information. If a child is in TASK_ZOMBIE, it is reaped.

 
 -----------------------------------------------------------------------
 kernel/exit.c
 ...
 1106   retval = -ECHILD;
 1107  end_wait4:
 1108   current->state = TASK_RUNNING;
 1109   remove_wait_queue(&current->wait_chldexit,&wait);
 1110   return retval;
 1111  }
 -----------------------------------------------------------------------
 
 

Line 1106

If we have gotten to this point, the PID specified by the parameter is not a child of the calling process. ECHILD is the error used to notify us of this event.

Line 11071111

At this point, the children list has been processed, and any children that needed to be reaped have been reaped. The parent's block is removed and its state is set to TASK_RUNNING once again. Finally, the wait queue is removed.

At this point, you should be familiar with the various stages that a process goes through during its lifecycle, the kernel functions that make all this happen, and the structures the kernel uses to keep track of all this information. Now, we look at how the scheduler manipulates and manages processes to create the effect of a multithreaded system. We also see in more detail how processes go from one state to another.


3.6. Keeping Track of Processes: Basic Scheduler Construction

Until this point, we kept the concepts of the states and the transitions process-centric. We have not spoken about how the transition is managed, nor have we spoken about the kernel infrastructure, which performs the running and stopping of processes. The scheduler handles all these details. Having finished the exploration of the process lifecycle, we now introduce the basics of the scheduler and how it interacts with the do_fork() function during process creation.

3.6.1. Basic Structure

The scheduler operates on a structure called a run queue. There is one run queue per CPU on the system. The core data structures within a run queue are two priority-ordered arrays. One of these contains active tasks and the other contains expired tasks. In general, an active task runs for a set amount of time, the length of its timeslice or quantum, and is then inserted into the expired array to wait for more CPU time. When the active array is empty, the scheduler swaps the two arrays by exchanging the active and expired pointers. The scheduler then begins executing tasks on the new active array.

Figure 3.12 illustrates the priority arrays within the run queue. The definition of the priority array structure is as follows:

 
 -----------------------------------------------------------------------
 kernel/sched.c
 192  struct prio_array {
 193   int nr_active;
 194   unsigned long bitmap[BITMAP_SIZE];
 195   struct list_head queue[MAX_PRIO];
 196  };
 -----------------------------------------------------------------------
 
 

Figure 3.12. Priority Arrays in a Run Queue


The fields of the prio_array struct are as follows:

  • nr_active. A counter that keeps track of the number of tasks held in the priority array.

  • bitmap. This keeps track of the priorities within the array. The actual length of bitmap depends on the size of unsigned longs on the system. It will always be enough to store MAX_PRIO bits, but it could be longer.

  • queue. An array that stores lists of tasks. Each list holds tasks of a certain priority. Thus, queue[0] holds a list of all tasks of priority 0, queue[1] holds a list of all tasks of priority 1, and so on.

With this basic understanding of how a run queue is organized, we can now embark on following a task through the scheduler on a single CPU system.

3.6.2. Waking Up from Waiting or Activation

Recall that when a process calls fork(), a new process is made. As previously mentioned, the process calling fork() is called the parent, and the new process is called the child. The newly created process needs to be scheduled for access to the CPU. This occurs via the do_fork() function.

Two important lines deal with the scheduler in do_fork() related to waking up processes. copy_process(), called on line 1184 of linux/kernel/fork.c, calls the function sched_fork(), which initializes the process for an impending insertion into the scheduler's run queue. wake_up_forked_process(), called on line 1222 of linux/kernel/fork.c, takes the initialized process and inserts it into the run queue. Initialization and insertion have been separated to allow for the new process to be killed, or otherwise terminated, before being scheduled. The process will only be scheduled if it is created, initialized successfully, and has no pending signals.

3.6.2.1. sched_fork(): Scheduler Initialization for Newly Forked Process

The sched_fork()function performs the infrastructure setup the scheduler requires for a newly forked process:

 
 -----------------------------------------------------------------------
 kernel/sched.c
 719 void sched_fork(task_t *p)
 720 {
 721   /*  
 722   * We mark the process as running here, but have not actually
 723   * inserted it onto the runqueue yet. This guarantees that
 724   * nobody will actually run it, and a signal or other external
 725   * event cannot wake it up and insert it on the runqueue either.
 726   */
 727   p->state = TASK_RUNNING;
 728   INIT_LIST_HEAD(&p->run_list);
 729   p->array = NULL;
 730   spin_lock_init(&p->switch_lock);
 -----------------------------------------------------------------------
 
 

Line 727

The process is marked as running by setting the state attribute in the task structure to TASK_RUNNING to ensure that no event can insert it on the run queue and run the process before do_fork() and copy_process() have verified that the process was created properly. When that verification passes, do_fork() adds it to the run queue via wake_up_forked_process().

Line 728730

The process' run_list field is initialized. When the process is activated, its run_list field is linked into the queue structure of a priority array in the run queue. The process' array field is set to NULL to represent that it is not part of either priority array on a run queue. The next block of sched_fork(), lines 731 to 739, deals with kernel preemption. (Refer to Chapter 7 for more information on preemption.)

 
 -----------------------------------------------------------------------
 kernel/sched.c
 740   /*
 741   * Share the timeslice between parent and child, thus the
 742   * total amount of pending timeslices in the system doesn't change,
 743   * resulting in more scheduling fairness.
 744   */
 745   local_irq_disable();
 746   p->time_slice = (current->time_slice + 1) >> 1;
 747   /*
 748   * The remainder of the first timeslice might be recovered by
 749   * the parent if the child exits early enough.
 750   */
 751   p->first_time_slice = 1;
 752   current->time_slice >>= 1;
 753   p->timestamp = sched_clock();
 754   if (!current->time_slice) {
 755     /*
 756     * This case is rare, it happens when the parent has only
 757     * a single jiffy left from its timeslice. Taking the
 758     * runqueue lock is not a problem.
 759     */
 760     current->time_slice = 1;
 761     preempt_disable();
 762     scheduler_tick(0, 0);
 763     local_irq_enable();
 764     preempt_enable();
 765   } else
 766     local_irq_enable();
 767 }
 -----------------------------------------------------------------------
 
 

Lines 740753

After disabling local interrupts, we divide the parent's timeslice between the parent and the child using the shift operator. The new process' first timeslice is set to 1 because it hasn't been run yet and its timestamp is initialized to the current time in nanosec units.

Lines 754767

If the parent's timeslice is 1, the division results in the parent having 0 time left to run. Because the parent was the current process on the scheduler, we need the scheduler to choose a new process. This is done by calling scheduler_tick() (on line 762). Preemption is disabled to ensure that the scheduler chooses a new current process without being interrupted. Once all this is done, we enable preemption and restore local interrupts.

At this point, the newly created process has had its scheduler-specific variables initialized and has been given an initial timeslice of half the remaining timeslice of its parent. By forcing a process to sacrifice a portion of the CPU time it's been allocated and giving that time to its child, the kernel prevents processes from seizing large chunks of processor time. If processes were given a set amount of time, a malicious process could spawn many children and quickly become a CPU hog.

After a process has been successfully initialized, and that initialization verified, do_fork() calls wake_up_forked_process():

 
 -----------------------------------------------------------------------
 kernel/sched.c
 922 /*
 923 * wake_up_forked_process - wake up a freshly forked process.
 924 *
 925 * This function will do some initial scheduler statistics housekeeping
 926 * that must be done for every newly created process.
 927 */
 928 void fastcall wake_up_forked_process(task_t * p)
 929 {
 930   unsigned long flags;
 931   runqueue_t *rq = task_rq_lock(current, &flags);
 932
 933   BUG_ON(p->state != TASK_RUNNING);
 934
 935   /*
 936   * We decrease the sleep average of forking parents
 937   * and children as well, to keep max-interactive tasks
 938   * from forking tasks that are max-interactive.
 939   */
 940   current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
 941     PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
 942
 943   p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
 944     CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
 945
 946   p->interactive_credit = 0;
 947
 948   p->prio = effective_prio(p);
 949   set_task_cpu(p, smp_processor_id());
 950
 951   if (unlikely(!current->array))
 952     __activate_task(p, rq);
 953   else {
 954     p->prio = current->prio;
 955     list_add_tail(&p->run_list, &current->run_list);
 956     p->array = current->array;
 957     p->array->nr_active++;
 958     rq->nr_running++;
 959   }
 960   task_rq_unlock(rq, &flags);
 961  }
 -----------------------------------------------------------------------
 
 

Lines 930934

The first thing that the scheduler does is lock the run queue structure. Any modifications to the run queue must be made with the lock held. We also throw a bug notice if the process isn't marked as TASK_RUNNING, which it should be thanks to the initialization in sched_fork() (see Line 727 in kernel/sched.c shown previously).

Lines 940947

The scheduler calculates the sleep average of the parent and child processes. The sleep average is the value of how much time a process spends sleeping compared to how much time it spends running. It is incremented by the amount of time the process slept, and it is decremented on each timer tick while it's running. An interactive, or I/O bound, process spends most of its time waiting for input and normally has a high sleep average. A non-interactive, or CPU-bound, process spends most of its time using the CPU instead of waiting for I/O and has a low sleep average. Because users want to see results of their input, like keyboard strokes or mouse movements, interactive processes are given more scheduling advantages than non-interactive processes. Specifically, the scheduler reinserts an interactive process into the active priority array after its timeslice expires. To prevent an interactive process from creating a non-interactive child process and thereby seizing a disproportionate share of the CPU, these formulas are used to lower the parent and child sleep averages. If the newly forked process is interactive, it soon sleeps enough to regain any scheduling advantages it might have lost.

Line 948

The function effective_prio() modifies the process' static priority. It returns a priority between 100 and 139 (MAX_RT_PRIO to MAX__PRIO-1). The process' static priority can be modified by up to 5 in either direction based on its previous CPU usage and time spent sleeping, but it always remains in this range. From the command line, we talk about the nice value of a process, which can range from +19 to -20 (lowest to highest priority). A nice priority of 0 corresponds to a static priority of 120.

Line 749

The process has its CPU attribute set to the current CPU.

Lines 951960

The overview of this code block is that the new process, or child, copies the scheduling information from its parent, which is current, and then inserts itself into the run queue in the appropriate place. We have finished our modifications of the run queue, so we unlock it. The following paragraph and Figure 3.13 discuss this process in more detail.

Figure 3.13. Run Queue Insertion


The pointer array points to a priority array in the run queue. If the current process isn't pointing to a priority array, it means that the current process has finished or is asleep. In that case, the current process' runlist field is not in the queue of the run queue's priority array, which means that the list_add_tail() operation (on line 955) would fail. Instead, we insert the newly created process using __activate_task(), which adds the new process to the queue without referring to its parent.

In the normal case, when the current process is waiting for CPU time on a run queue, the process is added to the queue residing at slot p->prio in the priority array. The array that the process was added to has its process counter, nr_active, incremented and the run queue has its process counter, nr_running, incremented. Finally, we unlock the run queue lock.

The case where the current process doesn't point to a priority array on the run queue is useful in seeing how the scheduler manages the run queue and priority array attributes.

 
 -----------------------------------------------------------------------
 kernel/sched.c
 366 static inline void __activate_task(task_t *p, runqueue_t *rq)
 367 {
 368   enqueue_task(p, rq->active);
 369   rq->nr_running++;
 370 }
 -----------------------------------------------------------------------
 
 

__activate_task() places the given process p on to the active priority array on the run queue rq and increments the run queue's nr_running field, which is the counter for total number of processes that are on the run queue.

 
 -----------------------------------------------------------------------
 kernel/sched.c
 311 static void enqueue_task(struct task_struct *p, prio_array_t *array)
 312 {
 313   list_add_tail(&p->run_list, array->queue + p->prio);
 314   __set_bit(p->prio, array->bitmap);
 315   array->nr_active++;
 316   p->array = array;
 317 }
 -----------------------------------------------------------------------
 
 

Lines 311312

enqueue_task() takes a process p and places it on priority array array, while initializing aspects of the priority array.

Line 313

The process' run_list is added to the tail of the queue located at p->prio in the priority array.

Line 314

The priority array's bitmap at priority p->prio is set so when the scheduler runs, it can see that there is a process to run at priority p->prio.

Line 315

The priority array's process counter is incremented to reflect the addition of the new process.

Line 316

The process' array pointer is set to the priority array to which it was just added.

To recap, the act of adding a newly forked process is fairly straightforward, even though the code can be confusing because of similar names throughout the scheduler. A process is placed at the end of a list in a run queue's priority array at the slot specified by the process' priority. The process then records the location of the priority array and the list it's located in within its structure.


3.7. Wait Queues

We discussed the process transition between the states of TASK_RUNNING and TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. Now, we look at another structure that's involved in this transition. When a process is waiting on an external event to occur, it is removed from the run queue and placed on a wait queue. Wait queues are doubly linked lists of wait_queue_t structures. The wait_queue_t structure is set up to hold all the information required to keep track of a waiting task. All tasks waiting on a particular external event are placed in a wait queue. The tasks on a given wait queue are woken up, at which point the tasks verify the condition they are waiting for and either resume sleep or remove themselves from the wait queue and set themselves back to TASK_RUNNING. You might recall that sys_wait4() system calls use wait queues when a parent requests status of its forked child. Note that a task waiting for an external event (and therefore is no longer on the run queue[4]) is either in the TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE states.

[4] A task is removed from the run queue once it sleeps and, therefore, yields control to another process.

A wait queue is a doubly linked list of wait_queue_t structures that hold pointers to the process task structures of the processes that are blocking. Each list is headed up by a wait_queue_head_t structure, which marks the head of the list and holds the spinlock to the list to prevent wait_queue_t additional race conditions. Figure 3.14 illustrates wait queue implementation. We now look at the wait_queue_t and the wait_queue_head_t structures:

 -----------------------------------------------------------------------
 include/linux/wait.h
 19  typedef struct __wait_queue wait_queue_t;
 ...
 23  struct __wait_queue {
 24   unsigned int flags;
 25  #define WQ_FLAG_EXCLUSIVE  0x01
 26   struct task_struct * task;
 27   wait_queue_func_t func;
 28   struct list_head task_list;
 29  };
 30
 31  struct __wait_queue_head {
 32   spinlock_t lock;
 33   struct list_head task_list;
 34  };
 35  typedef struct __wait_queue_head wait_queue_head_t;
 -----------------------------------------------------------------------
 

Figure 3.14. Wait Queue Structures


The wait_queue_t structure is comprised of the following fields:

  • flags. Can hold the value WQ_FLAG_EXCLUSIVE, which is set to 1, or ~WQ_FLAG_EXCLUSIVE, which would be 0. The WQ_FLAG_EXCLUSIVE flag marks this process as an exclusive process. Exclusive and non-exclusive processes are discussed in the next section.

  • task. The pointer to the task descriptor of the process being placed on the wait queue.

  • func. A structure holding a function used to wake the task on the wait queue. This field uses as default default_wake_function(), which is covered in detail in the section, " Waiting on the Event."

    wait_queue_func_t is defined as follows:

     ------------------------------------------------------------------
     include/linux/wait.h
     typedef int (*wait_queue_func_t)(wait_queue_t *wait, 
     unsigned mode, int sync);
     ------------------------------------------------------------------
     

    where wait is the pointer to the wait queue, mode is either TASK_ INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, and sync specifies if the wakeup should be synchronous.

  • task_list. The structure that holds pointers to the previous and next elements in the wait queue.

The structure __wait_queue_head is the head of a wait queue list and is comprised of the following fields:

  • lock. One lock per list allows the addition and removal of items into the wait queue to be synchronized.

  • task_list. The structure that points to the first and last elements in the wait queue.

The "Wait Queues" section in Chapter 10, "Adding Your Code to the Kernel," describes an example implementation of a wait queue. In general, the way in which a process puts itself to sleep involves a call to one of the wait_event* macros (which is discussed shortly) or by executing the following steps, as in the example shown in Chapter 10:

1.
By declaring the wait queue, the process sleeps on by way of DECLARE_WAITQUEUE_HEAD.

2.
Adding itself to the wait queue by way of add_wait_queue() or add_wait_queue_exclusive().

3.
Changing its state to TASK_INTERRUPTIBLE or TASK_ UNINTERRUPTIBLE.

4.
Testing for the external event and calling schedule(), if it has not occurred yet.

5.
After the external event occurs, setting itself to the TASK_RUNNING state.

6.
Removing itself from the wait queue by calling remove_wait_queue().

The waking up of a process is handled by way of a call to one of the wake_up macros. These wake up all processes that belong to a particular wait queue. This places the task in the TASK_RUNNING state and places it back on the run queue.

Let's look at what happens when we call the add_wait_queue() functions.

3.7.1. Adding to the Wait Queue

Two different functions are used to add sleeping processes into a wait queue: add_wait_queue() and add_wait_queue_exclusive(). The two functions exist to accommodate the two types of sleeping processes. Non-exclusive waiting processes are those that wait for the return of a condition that is not shared by other waiting processes. Exclusive waiting processes are waiting for a condition that another waiting process might be waiting on, which potentially generates a race condition.

The add_wait_queue() function inserts a non-exclusive process into the wait queue. A non-exclusive process is one which will, under any circumstance, be woken up by the kernel when the event it is waiting for comes to fruition. The function sets the flags field of the wait queue struct, which represents the sleeping process to 0, sets the wait queue lock to avoid interrupts accessing the same wait queue from generating a race condition, adds the structure to the wait queue list, and restores the lock from the wait queue to make it available to other processes:

 
 -----------------------------------------------------------------------
 kernel/fork.c
 93  void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
 94  {
 95   unsigned long flags;
 96
 97   wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 98   spin_lock_irqsave(&q->lock, flags);
 99   __add_wait_queue(q, wait);
 100   spin_unlock_irqrestore(&q->lock, flags);
 101  }
 -----------------------------------------------------------------------
 
 
 

The add_wait_queue_exclusive() function inserts an exclusive process into the wait queue. The function sets the flags field of the wait queue struct to 1 and proceeds in much the same manner as add_wait_queue() exclusive, with the exception that it adds exclusive processes into the queue from the tail end. This means that in a particular wait queue, the non-exclusive processes are at the front and the exclusive processes are at the end. This comes into play with the order in which the processes in a wait queue are woken up, as we see when we discuss waking up sleeping processes:

 
 -----------------------------------------------------------------------
 kernel/fork.c
 105  void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)
 106  {
 107   unsigned long flags;
 108
 109   wait->flags |= WQ_FLAG_EXCLUSIVE;
 110   spin_lock_irqsave(&q->lock, flags);
 111   __add_wait_queue_tail(q, wait);
 112   spin_unlock_irqrestore(&q->lock, flags);
 113  }
 -----------------------------------------------------------------------
 
 

3.7.2. Waiting on the Event

The sleep_on(), sleep_on_timeout(), and interruptible_sleep_on() interfaces, although still supported in 2.6, will be removed for the 2.7 release. Therefore, we cover only the wait_event*() interfaces that are to replace the sleep_on*() interfaces.

The wait_event*() interfaces include wait_event(), wait_event_ interruptible(), and wait_event_interruptible_timeout(). Figure 3.15 shows the function skeleton calling routing.

Figure 3.15. wait_event*() Call Graph


We go through and describe the interfaces related to wait_event() and mention what the differences are with respect to the other two functions. The wait_event() interface is a wrapper around the call to __wait_event() with an infinite loop that is broken only if the condition being waited upon returns. wait_event_interruptible_timeout() passes a third parameter called ret of type int, which is used to pass the timeout time.

wait_event_interruptible() is the only one of the three interfaces that returns a value. This return value is ERESTARTSYS if a signal broke the waiting event, or 0 if the condition was met:

 
 -----------------------------------------------------------------------
 include/linux/wait.h
 137  #define wait_event(wq, condition)       
 138  do {            
 139   if (condition)         
 140    break;         
 141   __wait_event(wq, condition);       
 142  } while (0)
 -----------------------------------------------------------------------
 
 

The __wait_event() interface does all the work around the process state change and the descriptor manipulation:

 
 -----------------------------------------------------------------------
 include/linux/wait.h
 121  #define __wait_event(wq, condition)
 122  do {
 123   wait_queue_t __wait;
 124   init_waitqueue_entry(&__wait, current);
 125
 126   add_wait_queue(&wq, &__wait);
 127   for (;;) {
 128    set_current_state(TASK_UNINTERRUPTIBLE);
 129    if (condition)
 130     break;
 131    schedule();
 132   }
 133   current->state = TASK_RUNNING;
 134   remove_wait_queue(&wq, &__wait);
 135  } while (0)
 -----------------------------------------------------------------------
 
 

Line 124126

Initialize the wait queue descriptor for the current process and add the descriptor entry to the wait queue that was passed in. Up to this point, __wait_event_interruptible and __wait_event_interruptible_timeout look identical to __wait_event.

Lines 127132

This code sets up an infinite loop that will only be broken out of if the condition is met. Before blocking on the condition, we set the state of the process to TASK_INTERRUPTIBLE by using the set_current_state macro. Recall that this macro references the pointer to the current process so we do not need to pass in the process information. Once it blocks, it yields the CPU to the next process by means of a call to the scheduler(). __wait_event_interruptible() differs in one large respect at this point; it sets the state field of the process to TASK_ UNINTERRUPTIBLE and waits on a signal_pending call to the current process. __wait_event_interruptible_timeout is much like __wait_event_ interruptible except for its call to schedule_timeout() instead of the call to schedule() when calling the scheduler. schedule_timeout takes as a parameter the timeout length passed in to the original wait_event_interruptible_ timeout interface.

Lines 133134

At this point in the code, the condition has been met or, in the case of the other two interfaces, a signal might have been received or the timeout reached. The state field of the process descriptor is now set back to TASK_RUNNING (the scheduler places this in the run queue). Finally, the entry is removed from the wait queue. The remove_wait_queue() function locks the wait queue before removing the entry, and then it restores the lock before returning.

3.7.3. Waking Up

A process must be woken up to verify whether its condition has been met. Note that a process might put itself to sleep, but it cannot wake itself up. Numerous macros can be used to wake_up tasks in a wait queue but only three main "wake_up" functions exist. The macros wake_up, wake_up_nr, wake_up_all, wake_up_ interruptible, wake_up_interruptible_nr, and wake_up_ interruptible_all all call __wake_up() with different parameters. The macros wake_up_all_sync and wake_up_interruptible_sync both call __wake_ up_sync() with different parameters. Finally, the wake_up_locked macro defaults to the __wake_up_locked() function:

----------------------------------------------------------------------- include/linux/wait.h 116 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr)); 117 extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode)); 118 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr)); 119 120 #define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1) 121 #define wake_up_nr(x, nr) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, nr) 122 #define wake_up_all(x) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 0) 123 #define wake_up_all_sync(x) __wake_up_sync((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 0) 124 #define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1) 125 #define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr) 126 #define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0) 127 #define wake_up_locked(x) __wake_up_locked((x), TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE) 128 #define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1 129 ) -----------------------------------------------------------------------

Let's look at __wake_up():

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2336  void fastcall __wake_up(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
 2337  {
 2338    unsigned long flags;
 2339
 2340    spin_lock_irqsave(&q->lock, flags);
 2341    __wake_up_common(q, mode, nr_exclusive, 0);
 2342    spin_unlock_irqrestore(&q->lock, flags);
 2343  }
 -----------------------------------------------------------------------
 
 

Line 2336

The parameters passed to __wake_up include q, the pointer to the wait queue; mode, the indicator of the type of thread to wake up (this is identified by the state of the thread); and nr_exclusive, which indicates whether it's an exclusive or non-exclusive wakeup. An exclusive wakeup (when nr_exclusive = 0) wakes up all the tasks in the wait queue (both exclusive and non-exclusive), whereas a non-exclusive wakeup wakes up all the non-exclusive tasks and only one exclusive task.

Lines 2340, 2342

These lines set and unset the wait queue's spinlock. Set the lock before calling __wake_up_common()to ensure no race condition comes up.

Line 2341

The function __wake_up_common() performs the bulk of the wakeup function:

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2313  static void __wake_up_common(wait_queue_head_t *q, 
 unsigned int mode, int nr_exclusive, int sync)
 2314  {
 2315   struct list_head *tmp, *next;
 2316
 2317   list_for_each_safe(tmp, next, &q->task_list) {
 2318      wait_queue_t *curr;
 2319      unsigned flags; 
 2320     curr = list_entry(tmp, wait_queue_t, task_list);
 2321      flags = curr->flags;
 2322      if (curr->func(curr, mode, sync) &&
 2323       (flags & WQ_FLAG_EXCLUSIVE) &&
 2324       !--nr_exclusive)
 2325        break;
 2326    }
 2327  }
 -----------------------------------------------------------------------
 
 

Line 2313

The parameters passed to __wake_up_common are q, the pointer to the wait queue; mode, the type of thread to wake up; nr_exclusive, the type of wakeup previously shown; and sync, which states whether the wakeup should be synchronous.

Line 2315

Here, we set temporary pointers for list-entry manipulation.

Line 2317

The list_for_each_safe macro scans each item of the wait queue. This is the beginning of our loop.

Line 2320

The list_entry macro returns the address of the wait queue structure held by the tmp variable.

Line 2322

The wait_queue_t's func field is called. By default, this calls default_wake_function(), which is shown here:

 
 -----------------------------------------------------------------------
 kernel/sched.c
 2296  int default_wake_function(wait_queue_t *curr, unsigned mode, 
 int sync)
 2297  {
 2298   task_t *p = curr->task;
 2299   return try_to_wake_up(p, mode, sync);
 2300  }
 -----------------------------------------------------------------------
 
 

This function calls try_to_wake_up() (kernel/sched.c) on the task pointed to by the wait_queue_t structure. This function performs the bulk of the work of waking up a process, including putting it on the run queue.

Lines 23222325

The loop terminates if the process being woken up is the first exclusive process. This makes sense if we realize that all the exclusive processes are queued at the end of the wait queue. After we encounter the first exclusive task in the wait queue all remaining tasks will also be exclusive so we do not want to wake them and we break out of the loop.


3.8. Asynchronous Execution Flow

We mentioned that processes can transition from one state to another by means of interrupts, for instance going from TASK_INTERRUPTIBLE to TASK_RUNNING. One of the ways this is attained is by means of asynchronous execution which includes exceptions and interrupts. We have mentioned that processes move in and out of user and kernel mode. We will now go into a description of how exceptions work and follow it up with an explanation of how interrupts work.

3.8.1. Exceptions

Exceptions, also known as synchronous interrupts, are events that occur entirely within the processor's hardware. These events are synchronous to the execution of the processor; that is, they occur not during but after the execution of a code instruction. Examples of processor exceptions include the referencing of a virtual memory location, which is not physically there (known as a page fault) and a calculation that results in a divide by 0. The important thing to note with exceptions (sometimes called soft irqs) is that they typically happen after an intruction's execution. This differentiates them from external or asynchronous events, which are discussed later in Section 3.8.2, "Interrupts."

Most modern processors (the x86 and the PPC included) allow the programmer to initiate an exception by executing certain instructions. These instructions can be thought of as hardware-assisted subroutine calls. An example of this is the system call.

3.8.1.1. System Calls

Linux provides user mode programs with entry points into the kernel by which services or hardware access can be requested from the kernel. These entry points are standardized and predefined in the kernel. Many of the C library routines available to user mode programs, such as the fork() function in Figure 3.9, bundle code and one or more system calls to accomplish a single function. When a user process calls one of these functions, certain values are placed into the appropriate processor registers and a software interrupt is generated. This software interrupt then calls the kernel entry point. Although not recommended, system calls (syscalls) can also be accessed from kernel code. From where a syscall should be accessed is the source of some discussion because syscalls called from the kernel can have an improvement in performance. This improvement in performance is weighed against the added complexity and maintainability of the code. In this section, we explore the "traditional" syscall implementation where syscalls are called from user space.

Syscalls have the ability to move data between user space and kernel space. Two functions are provided for this purpose: copy_to_user() and copy_from_user(). As in all kernel programming, validation (of pointers, lengths, descriptors, and permissions) is critical when moving data. These functions have the validation built in. Interestingly, they return the number of bytes not transferred.

By its nature, the implementation of the syscall is hardware specific. Traditionally, with Intel architecture, all syscalls have used software interrupt 0x80.[5]

[5] In an effort to gain in performance with the newer (PIV+) Intel processors, work has been done with the implementation of vsyscalls. vsyscalls are based on calls to user space memory (in particular, a "vsyscall" page) and use the faster sysenter and sysexit instructions (when available) over the traditional int 0x80 call. Similar performance work is also being pursued on many PPC implementations.

Parameters of the syscall are passed in the general registers with the unique syscall number in %eax. The implementation of the system call on the x86 architecture limits the number of parameters to 5. If more than 5 are required, a pointer to a block of parameters can be passed. Upon execution of the assembler instruction int 0x80, a specific kernel mode routine is called by way of the exception-handling capabilities of the processor. Let's look at an example of how a system call entry is initialized:

 set_system_gate(SYSCALL_VECTOR,&system_call);
 

This macro creates a user privilege descriptor at entry 128 (SYSCALL_VECTOR), which points to the address of the syscall handler in entry.S (system_call).

As we see in the next section on interrupts, PPC interrupt routines are "anchored" to certain memory locations; the external interrupt handler is anchored to address 0x500, the system timer is anchored to address 0x900, and so on. The system call instruction sc vectors to address 0xc00. Let's explore the code segment from head.S where the handler is set for the PPC system call:

 
 -----------------------------------------------------------------------
 arch/ppc/kernel/head.S
 484  /* System call */
 485   . = 0xc00
 486  SystemCall:
 487   EXCEPTION_PROLOG
 488   EXC_XFER_EE_LITE(0xc00, DoSyscall)
 -----------------------------------------------------------------------
 
 

Line 485

The anchoring of the address. This line tells the loader that the next instruction is located at address 0xc00. Because labels follow similar rules, the label SystemCall along with the first line of code in the macro EXCEPTION_PROLOG both start at address 0xc00.

Line 488

This macro dispatches the DoSyscall() handler.

For both architectures, the syscall number and any parameters are stored in the processor's registers.

When the x86 exception handler processes the int 0x80, it indexes into the system call table. The file arch/i386/kernel/entry.S contains low-level interrupt handling routines and the system call table, sys_call_table. Likewise for the PPC, the syscall low-level routines are in arch/ppc/kernel/entry.S and the sys_call_table is in arch/ppc/kernel/misc.S.

The syscall table is an assembly code implementation of an array in C with 4-byte entries. Each entry is initialized to the address of a function. By convention, we must prepend the name of our function with "sys_." Because the position in the table determines the syscall number, we must add the name of our function to the end of the list. Even with different assembly languages, the tables are nearly identical between the architectures. However, the PPC table has only 255 entries at the time of this writing, while the x86 table has 275.

The files include/asm-i386/unistd.h and include/asm-ppc/unistd.h associate the system calls with their positional numbers in the sys_call_table. The "sys_" is replaced with a "__NR_" in this file. This file also has macros to assist the user program in loading the registers with parameters. (See the assembly programming section in Chapter 2, "Exploration Toolkit," for a crash course in C and assembler variables and inline assembler.)

Let's look at how we would add a system call named sys_ourcall. The system call must be added to the sys_call_table. The addition of our system call into the x86 sys_call_table is shown here:

 
 -----------------------------------------------------------------------
 arch/i386/kernel/entry.S
 607  .data
 608  ENTRY(sys_call_table)
 609  .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting*/
 ...
 878  .long sys_tgkill   /* 270 */
 879  .long sys_utimes
 880  .long sys_fadvise64_64
 881  .long sys_ni_syscall   /* sys_vserver */
 882  .long sys_ourcall   /* our syscall will be 274 */
 883
 884  nr_syscalls=(.-sys_call_table)/4
 ----------------------------------------------------------------------- 
 
 

In x86, our system call would be number 274. If we were to add a syscall named sys_ourcall in PPC, the entry would be number 255. Here, we show how it would look when we introduce the association of our system call with its positional number into include/asm-ppc/unistd.h. __NR_ourcall is number-entry number 255 at the end of the table:

 
 -----------------------------------------------------------------------
 include/asm-ppc/unistd.h
 /*
  * This file contains the system call numbers.
  */
 
 #define __NR_restart_syscall  0
 #define __NR_exit   1
 #define __NR_fork   2
 ...
 #define __NR_utimes   271
 #define __NR_fadvise64_64  272
 #define __NR_vserver  273
 #define __NR_ourcall   274
 
 /* #define NR_syscalls 274  this is the old value before our syscall */
 #define NR_syscalls 275
 -----------------------------------------------------------------------
 
 

The next section discusses interrupts and the hardware involved to alert the kernel to the need for handling them. Where exceptions as a group diverge somewhat is what their handler does in response to being called. Although exceptions travel the same route as interrupts at handling time, exceptions tend to send signals back to the current process rather than work with hardware devices.

3.8.2. Interrupts

Interrupts are asynchronous to the execution of the processor, which means that interrupts can happen in between instructions. The processor receives notification of an interrupt by way of an external signal to one of its pins (INTR or NMI). This signal comes from a hardware device called an interrupt controller. Interrupts and interrupt controllers are hardware and system specific. From architecture to architecture, many differences exist in how interrupt controllers are designed. This section touches on the major hardware differences and functions tracing the kernel code from the architecture-independent to the architecture-dependent parts.

An interrupt controller is needed because the processor must be in communication with one of several peripheral devices at any given moment. Older x86 computers used a cascaded pair of Intel 8259 interrupt controllers configured in such a way[6] that the processor was able to discern between 15 discrete interrupt lines (IRQ) (see Figure 3.16).When the interrupt controller has a pending interrupt (for example, when you press a key), it asserts its INT line, which is connected to the processor. The processor then acknowledges this signal by asserting its acknowledge line connected to the INTA line on the interrupt controller. At this moment, the interrupt controller transfers the IRQ data to the processor. This sequence is known as an interrupt-acknowledge cycle.

[6] An IRQ from the first 8259 (usually IRQ2) is connected to the output of the second 8259.

Figure 3.16. Cascaded Interrupt Controllers


Newer x86 processors have a local Advanced Programmable Interrupt Controller (APIC). The local APIC (which is built into the processor package) receives interrupt signals from the following:

  • Processor's interrupt pins (LINT0, LINT1)

  • Internal timer

  • Internal performance monitor

  • Internal temperature sensor

  • Internal APIC error

  • Another processor (inter-processor interrupts)

  • An external I/O APIC (via an APIC bus on multiprocessor systems)

After the APIC receives an interrupt signal, it routes the signal to the processor core (internal to the processor package). The I/O APIC shown in Figure 3.17 is part of a processor chipset and is designed to receive 24 programmable interrupt inputs.

Figure 3.17. I/O APIC


The x86 processors with local APIC can also be configured with 8259 type interrupt controllers instead of the I/O APIC architecture (or the I/O APIC can be configured to interface to an 8259 controller). To find out if a system is using the I/O APIC architecture, enter the following on the command line:

 lkp:~# cat /proc/interrupts
 

If you see I/O-APIC listed, it is in use. Otherwise, you see XT-PIC, which means it is using the 8259 type architecture.

The PowerPC interrupt controllers for the Power Mac G4 and G5 are integrated into the Key Largo and K2 I/O controllers. Entering this on the command line:

 lkp:~# cat /proc/interrupts
 

on a G4 machine yields OpenPIC, which is an Open Programmable Interrupt Controller standard initiated by AMD and Cyrix in 1995 for multiprocessor systems. MPIC is the IBM implementation of OpenPIC, and is used in several of their CHRP designs. Old-world Apple machines had an in-house interrupt controller and, for the 4xx embedded processors, the interrupt controller core is integrated into the ASIC chip.

Now that we have had the necessary discussion of how, why, and when interrupts are delivered to the kernel by the hardware, we can analyze a real-world example of the kernel handling the Hardware System Timer interrupt and expand on where the interrupt is delivered. As we go through the System Timer code, we see that at interrupt time, the hardware-to-software interface is implemented in both the x86 and PPC architectures with jump tables that select the proper handler code for a given interrupt.

Each interrupt of the x86 architecture is assigned a unique number or vector. At interrupt time, this vector is used to index into the Interrupt Descriptor Table (IDT). (See the Intel Programmer's Reference for the format of the x86 gate descriptor.) The IDT allows the hardware to assist the software with address resolution and privilege checking of handler code at interrupt time. The PPC architecture is somewhat different in that it uses an interrupt table created at compile time to execute the proper interrupt handler. (Later in this section, there is more on the software aspects of initialization and use of the jump tables, when we compare x86 and PPC interrupt handling for the system timer.) The next section discusses interrupt handlers and their implementation. We follow that with a discussion of the system timer as an example of the Linux implementation of interrupts and their associated handlers.

We now talk about the different kinds of interrupt handlers.

3.8.2.1. Interrupt Handlers

Interrupt and exception handlers look much like regular C functions. They mayand often docall hardware-specific assembly routines. Linux interrupt handlers are broken into a high-performance top half and a low-performance bottom half:

  • Top half. Must execute as quickly as possible. Top-half handlers, depending on how they are registered, can run with all local (to a given processor) interrupts disabled (a fast handler). Code in a top-half handler needs to be limited to responding directly to the hardware and/or performing time-critical tasks. To remain in the top-half handler for a prolonged period of time could significantly impact system performance. To keep performance high and latency (which is the time it takes to acknowledge a device) low, the bottom-half architecture was introduced.

  • Bottom half. Allows the handler writer to delay the less critical work until the kernel has more time.[7] Remember, the interrupt came in asynchronously with the execution of the system; the kernel might have been doing something more time critical at that moment. With the bottom-half architecture, the handler writer can have the kernel run the less critical handler code at a later time.

    [7] Earlier versions of Linux used a top-half/bottom-half handler for the system timer. It has since been rewritten to be a high-performance top half only.

Table 3.8 illustrates the four most common methods of bottom-half interrupt handling.

Table 3.8. Bottom-Half Interrupt Handling Methods

"Old" bottom halves

These pre-SMP handlers are being phased out because of the fact that only one bottom half can run at a time regardless of the number of processors. This system has been removed in the 2.6 kernel and is mentioned only for reference.

Work queues

The top-half code is said to run in interrupt context, which means it is not associated with any process. With no process association, the code cannot sleep or block. Work queues run in process context and have the abilities of any kernel thread. Work queues have a rich set of functions for creation, scheduling, canceling, and so on. For more information on work queues, see the "Work Queues and Interrupts" section in Chapter 10.

Softirqs

Softirqs run in interrupt context and are similar to bottom halves except that softirqs of the same type can run on multiple processors simultaneously. Only 32 softirqs are available in the system. The system timer uses softirqs.

Tasklets

Similar to softirqs except that no limit exists. All tasklets are funneled through one softirq, and the same tasklet cannot run simultaneously on multiple processors. The tasklet interface is simpler to use and implement compared to softirqs.


3.8.2.2. IRQ Structures

Three main structures contain all the information related to IRQ's: irq_desc_t, irqaction, and hw_interrupt_type. Figure 3.18 illustrates how they interrelate.

Figure 3.18. IRQ Structures


Struct irq_desc_t

The irq_desc_t structure is the primary IRQ descriptor. irq_desc_t structures are stored in a globally accessible array of size NR_IRQS (whose value is architecture dependent) called irq_desc.

 
 -----------------------------------------------------------------------
 include/linux/irq.h
 60  typedef struct irq_desc {
 61   unsigned int status;   /* IRQ status */
 62   hw_irq_controller *handler;
 63   struct irqaction *action; /* IRQ action list */
 64   unsigned int depth;   /* nested irq disables */
 65   unsigned int irq_count;  /* For detecting broken interrupts */
 66   unsigned int irqs_unhandled;
 67   spinlock_t lock;
 68  } ____cacheline_aligned irq_desc_t;
 69
 70  extern irq_desc_t irq_desc [NR_IRQS];
 -----------------------------------------------------------------------
 
 

Line 61

The value of the status field is determined by setting flags that describe the status of the IRQ line. Table 3.9 shows the flags.

Table 3.9. irq_desc_t->field Flags

Flag

Description

IRQ_INPROGRESS

Indicates that we are in the process of executing the handler for that IRQ line.

IRQ_DISABLED

Indicates that the IRQ is disabled by software so that its handler is not executed even if the physical line itself is enabled.

IRQ_PENDING

A middle state that indicates that the occurrence of the interrupt has been acknowledged, but the handler has not been executed.

IRQ_REPLAY

The previous IRQ has not been acknowledged.

IRQ_AUTODETECT

The state the IRQ line is set when being probed.

IRQ_WAITING

Used when probing.

IRQ_LEVEL

The IRQ is level triggered as opposed to edge triggered.

IRQ_MASKED

This flag is unused in the kernel code.

IRQ_PER_CPU

Used to indicate that the IRQ line is local to the CPU calling.


Line 62

The handler field is a pointer to the hw_irq_controller. The hw_irq_ controller is a typedef for hw_interrupt_type structure, which is the interrupt controller descriptor used to describe low-level hardware.

Line 63

The action field holds a pointer to the irqaction struct. This structure, described later in more detail, keeps track of the interrupt handler routine to be executed when the IRQ is enabled.

Line 64

The depth field is a counter of nested IRQ disables. The IRQ_DISABLE flag is cleared only when the value of this field is 0.

Lines 6566

The irq_count field, along with the irqs_unhandled field, identifies IRQs that might be stuck. They are used in x86 and PPC64 in the function note_interrupt() (arch/<arch>/kernel/irq.c).

Line 67

The lock field holds the spinlock to the descriptor.

Struct irqaction

The kernel uses the irqaction struct to keep track of interrupt handlers and the association with the IRQ. Let's look at the structure and the fields we will view in later sections:

 
 -----------------------------------------------------------------------
 include/linux/interrupt.h
 35  struct irqaction {
 36   irqreturn_t (*handler)  (int, void *, struct pt_regs *);
 37   unsigned long flags;
 38   unsigned long mask;
 39   const char *name;
 40   void *dev_id;
 41   struct irqaction *next;
 42  };
 ------------------------------------------------------------------------ 
 
 

Line 36

The field handler is a pointer to the interrupt handler that will be called when the interrupt is encountered.

Line 37

The flags field can hold flags such as SA_INTERRUPT, which indicates the interrupt handler will run with all interrupts disabled, or SA_SHIRQ, which indicates that the handler might share an IRQ line with another handler.

Line 39

The name field holds the name of the interrupt being registered.

Struct hw_interrupt_type

The hw_interrupt_type or hw_irq_controller structure contains all the data related to the system's interrupt controller. First, we look at the structure, and then we look at how it is implemented for a couple of interrupt controllers:

 
 -----------------------------------------------------------------------
 include/linux/irq.h
 40  struct hw_interrupt_type {
 41   const char * typename;
 42   unsigned int (*startup)(unsigned int irq);
 43   void (*shutdown)(unsigned int irq);
 44   void (*enable)(unsigned int irq);
 45   void (*disable)(unsigned int irq);
 46   void (*ack)(unsigned int irq);
 47   void (*end)(unsigned int irq);
 48   void (*set_affinity)(unsigned int irq, cpumask_t dest);
 49  };
 ------------------------------------------------------------------------
 
 

Line 41

The typename holds the name of the Programmable Interrupt Controller (PIC). (PICs are discussed in detail later.)

Lines 4248

These fields hold pointers to PIC-specific programming functions.

Now, let's look at our PPC controller. In this case, we look at the PowerMac's PIC:

 
 -----------------------------------------------------------------------
 arch/ppc/platforms/pmac_pic.c
 170  struct hw_interrupt_type pmac_pic = {
 171   " PMAC-PIC ",
 172   NULL,
 173   NULL,
 174   pmac_unmask_irq,
 175   pmac_mask_irq,
 176   pmac_mask_and_ack_irq,
 177   pmac_end_irq,
 178   NULL
 179  };
 ------------------------------------------------------------------------
 
 

As you can see, the name of this PIC is PMAC-PIC, and it has four of the six functions defined. The pmac_unamsk_irq and the pmac_mask_irq functions enable and disable the IRQ line, respectively. The function pmac_mask_and_ack_irq acknowledges that an IRQ has been received, and pmac_end_irq takes care of cleaning up when we are done executing the interrupt handler.

 
 -----------------------------------------------------------------------
 arch/i386/kernel/i8259.c
 59  static struct hw_interrupt_type i8259A_irq_type = {
 60   "XT-PIC",
 61   startup_8259A_irq,
 62   shutdown_8259A_irq,
 63   enable_8259A_irq,
 64   disable_8259A_irq,
 65   mask_and_ack_8259A,
 66   end_8259A_irq,
 67   NULL
 68  };
 ------------------------------------------------------------------------
 
 

The x86 8259 PIC is called XT-PIC, and it defines the first five functions. The first two, startup_8259A_irq and shutdown_8259A_irq, start up and shut down the actual IRQ line, respectively.

3.8.2.3. An Interrupt Example: System Timer

The system timer is the heartbeat for the operating system. The system timer and its interrupt are initialized during system initialization at boot-up time. The initialization of an interrupt at this time uses interfaces different to those used when an interrupt is registered at runtime. We point out these differences as we go through the example.

As more complex support chips are produced, the kernel designer has gained several options for the source of the system timer. The most common timer implementation for the x86 architecture is the Programmable Interval Time (PIT) and, for the PowerPC, it is the decrementer.

The x86 architecture has historically implemented the PIT with the Intel 8254 timer. The 8254 is used as a 16-bit down counterinterrupting on terminal count. That is, a value is written to a register and the 8254 decrements this value until it gets to 0. At that moment, it activates an interrupt to the IRQ 0 input on the 8259 interrupt controller, which was previously mentioned in this section.

The system timer implemented in the PowerPC architecture is the decrementer clock, which is a 32-bit down counter that runs at the same frequency as the CPU. Similar to the 8259, it activates an interrupt at its terminal count. Unlike the Intel architecture, the decrementer is built in to the processor.

Every time the system timer counts down and activates an interrupt, it is known as a tick. The rate or frequency of this tick is set by the HZ variable.

HZ

HZ is a variation on the abbreviation for Hertz (Hz), named for Heinrich Hertz (1857-1894). One of the founders of radio waves, Hertz was able to prove Maxwell's theories on electricity and magnetism by inducing a spark in a wire loop. Marconi then built on these experiments leading to modern radio. In honor of the man and his work the fundamental unit of frequency is named after him; one cycle per second is equal to one Hertz.

HZ is defined in include/asm-xxx/param.h. Let's take a look at what these values are in our x86 and PPC.

----------------------------------------------------------------------- include/asm-i386/param.h 005 #ifdef __KERNEL__ 006 #define HZ 1000 /* internal kernel timer frequency */ ----------------------------------------------------------------------- ----------------------------------------------------------------------- include/asm-ppc/param.h 008 #ifdef __KERNEL__ 009 #define HZ 100 /* internal kernel timer frequency */ -----------------------------------------------------------------------

The value of HZ has been typically 100 across most architectures, but as machines become faster, the tick rate has increased on certain models. Looking at the two main architectures we are using for this book, we can see (above) the default tick rate for both architectures is 1000. The period of 1 tick is 1/HZ. Thus the period (or time between interrupts) is 1 millisecond. We can see that as the value of HZ goes up, we get more interrupts in a given amount of time. While this yields better resolution from the timekeeping functions, it is important to note that more of the processor time is spent answering the system timer interrupts in the kernel. Taken to an extreme, this could slow the system response to user mode programs. As with all interrupt handling, finding the right balance is key.


We now begin walking through the code with the initialization of the system timer and its associated interrupts. The handler for the system timer is installed near the end of kernel initialization; we pick up the code segments as start_kernel(), the primary initialization function executed at system boot time, first calls trap_init(), then init_IRQ(), and finally time_init():

 
 init/main.c
 386  asmlinkage void __init start_kernel(void)
 387  {
 ...
 413  trap_init();
 ...
 415  init_IRQ();
 ...
 419  time_init();
 ...
   }
 -----------------------------------------------------------------------
 
 

Line 413

The macro trap_init() initializes the exception entries in the Interrupt Descriptor Table (IDT) for the x86 architecture running in protected mode. The IDT is a table set up in memory. The address of the IDT is set in the processor's IDTR register. Each element of the interrupt descriptor table is one of three gates. A gate is an x86 protected mode address that consists of a selector, an offset, and a privilege level. The gate's purpose is to transfer program control. The three types of gates in the IDT are system, where control is transferred to another task; interrupt, where control is passed to an interrupt handler with interrupts disabled; and trap, where control is passed to the interrupt handler with interrupts unchanged.

The PPC is architected to jump to specific addresses, depending on the exception. The function trap_init() is a no-op for the PPC. Later in this section, as we continue to follow the system timer code, we will contrast the PPC interrupt table with the x86 interrupt descriptor table initialized next.

 
 ----------------------------------------------------------------------- 
 arch/i386/kernel/traps.c
 900  void __init trap_init(void)
 901  {
 902  #ifdef CONFIG_EISA
 903   if (isa_readl(0x0FFFD9) == 'E'+('I'<<8)+('S'<<16)+('A'<<24)) {
 904    EISA_bus = 1;
 905   }
 906  #endif
 907
 908  #ifdef CONFIG_X86_LOCAL_APIC
 909   init_apic_mappings();
 910  #endif
 911
 912   set_trap_gate(0,&divide_error);
 913   set_intr_gate(1,&debug);
 914   set_intr_gate(2,&nmi);
 915   set_system_gate(3,&int3);  /* int3-5 can be called from all */
 916   set_system_gate(4,&overflow);
 917   set_system_gate(5,&bounds);
 918   set_trap_gate(6,&invalid_op);
 919   set_trap_gate(7,&device_not_available);
 920   set_task_gate(8,GDT_ENTRY_DOUBLEFAULT_TSS);
 921   set_trap_gate(9,&coprocessor_segment_overrun);
 922   set_trap_gate(10,&invalid_TSS);
 923   set_trap_gate(11,&segment_not_present);
 924   set_trap_gate(12,&stack_segment);
 925   set_trap_gate(13,&general_protection);
 926   set_intr_gate(14,&page_fault);
 927   set_trap_gate(15,&spurious_interrupt_bug);
 928   set_trap_gate(16,&coprocessor_error);
 929   set_trap_gate(17,&alignment_check);
 930  #ifdef CONFIG_X86_MCE
 931   set_trap_gate(18,&machine_check);
 932  #endif
 933   set_trap_gate(19,&simd_coprocessor_error);
 934
 935   set_system_gate(SYSCALL_VECTOR,&system_call) ;
 936
 937   /*
 938   * default LDT is a single-entry callgate to lcall7 for iBCS
 939   * and a callgate to lcall27 for Solaris/x86 binaries
 940   */
 941   set_call_gate(&default_ldt[0],lcall7);
 942   set_call_gate(&default_ldt[4],lcall27);
 943
 944   /*
 945   * Should be a barrier for any external CPU state.
 846   */
 947   cpu_init();
 948
 949   trap_init_hook();
 950  }
 ----------------------------------------------------------------------- 
 
 

Line 902

Look for EISA signature. isa_readl() is a helper routine that allows reading the EISA bus by mapping I/O with ioremap().

Lines 908910

If an Advanced Programmable Interrupt Controller (APIC) exists, add its address to the system fixed address map. See include/asm-i386/fixmap.h for "special" system address helper routines; set_fixmap_nocache().init_apic_mappings() uses this routine to set the physical address of the APIC.

Lines 912935

Initialize the IDT with trap gates, system gates, and interrupt gates.

Lines 941942

These special intersegment call gates support the Intel Binary Compatibility Standard for running other UNIX binaries on Linux.

Line 947

For the currently executing CPU, initialize its tables and registers.

Line 949

Used to initialize system-specific hardware, such as different kinds of APICs. This is a no-op for most x86 platforms.

Line 415

The call to init_IRQ() initializes the hardware interrupt controller. Both x86 and PPC architectures have several device implementations. For the x86 architecture, we explore the i8259 device. For PPC, we explore the code associated with the Power Mac.

The PPC implementation of init_IRQ() is in arch/ppc/kernel/irq.c. Depending on the particular hardware configuration, init_IRQ() calls one of several routines to initialize the PIC. For a Power Mac configuration, the function pmac_pic_init() in arch/ppc/platforms/pmac_pic.c is called for the G3, G4, and G5 I/O controllers. This is a hardware-specific routine that tries to identify the type of I/O controller and set it up appropriately. In this example, the PIC is part of the I/O controller device. The process for interrupt initialization is similar to x86, with the minor difference being the system timer is not started in the PPC version of init_IRQ(), but rather in the time_init() function, which is covered later in this section.

The x86 architecture has fewer options for the PIC. As previously discussed, the older systems use the cascaded 8259, while the later systems use the IOAPIC architecture. This code explores the APIC with the emulated 8259 type controllers:

 
 ----------------------------------------------------------------------- 
 arch/i386/kernel/i8259.c
 342  void __init init_ISA_irqs (void)
 343  {
 344   int i;
 345  #ifdef CONFIG_X86_LOCAL_APIC
 346   init_bsp_APIC();
 347  #endif
 348   init_8259A(0);
 ...
 351   for (i = 0; i < NR_IRQS; i++) {
 352    irq_desc[i].status = IRQ_DISABLED;
 353    irq_desc[i].action = 0;
 354    irq_desc[i].depth = 1;
 355  
 356    if (i < 16) {
 357     /*
 358      * 16 old-style INTA-cycle interrupts:
 359      */
 360     irq_desc[i].handler = &i8259A_irq_type;
 361    } else {
 362     /*
 363      * 'high' PCI IRQs filled in on demand
 364      */
 365     irq_desc[i].handler = &no_irq_type;
 366    }
 367   }
 368  }
 ...
 409
 410  void __init init_IRQ(void)
 411  {
 412   int i;
 413
 414   /* all the set up before the call gates are initialized */
 415   pre_intr_init_hook();
 ...
 422  for (i = 0; i < NR_IRQS; i++) {
 423   int vector = FIRST_EXTERNAL_VECTOR + i;
 424   if (vector != SYSCALL_VECTOR) 
 425    set_intr_gate(vector, interrupt[i]) ;
 426  }
 ...
 431  intr_init_hook();
 ...
 437  setup_timer();
 ...
 }
 ----------------------------------------------------------------------- 
 
 

Line 410

This is the function entry point called from start_kernel(), which is the primary kernel initialization function called at system startup.

Lines 342348

If the local APIC is available and desired, initialize it and put it in virtual wire mode for use with the 8259. Then, initialize the 8259 device using register I/O in init_8259A(0).

Lines 422426

On line 424, syscalls are not included in this loop because they were already installed earlier in TRap_init(). Linux uses an Intel interrupt gate (kernel-initiated code) as the descriptor for interrupts. This is set with the set_intr_gate() macro (on line 425). Exceptions use the Intel system gate and trap gate set by the set_system_gate() and set_trap_gate(), respectively. These macros can be found in arch/i386/kernel/traps.c.

Line 431

Set up interrupt handlers for the local APIC (if used) and call setup_irq() in irq.c for the cascaded 8259.

Line 437

Start the 8253 PIT using register I/O.

Line 419

Now, we follow time_init() to install the system timer interrupt handler for both PPC and x86. In PPC, the system timer (abbreviated for the sake of this discussion) initializes the decrementer:

 
 ----------------------------------------------------------------------- 
 arch/ppc/kernel/time.c
 void __init time_init(void)
 {
 ...
 317   ppc_md.calibrate_decr();
 ...
 351   set_dec(tb_ticks_per_jiffy);
 ...
 }
 ----------------------------------------------------------------------- 
 
 

Line 317

Figure the proper count for the system HZ value.

Line 351

Set the decrementer to the proper starting count.

The interrupt architecture of the PowerPC and its Linux implementation does not require the installation of the timer interrupt. The decrementer interrupt vector comes in at 0x900. The handler call is hard coded at this location and it is not shared:

 
 ----------------------------------------------------------------------- 
 arch/ppc/kernel/head.S
   /* Decrementer */
 479  EXCEPTION(0x900, Decrementer, timer_interrupt, EXC_XFER_LITE)
 -----------------------------------------------------------------------
 
 

More detail on the EXCEPTION macro for the decrementer is given later in this section. The handler for the decrementer is now ready to be executed when the terminal count is reached.

The following code snippets outline the x86 system timer initialization:

 
 ----------------------------------------------------------------------- 
 arch/i386/kernel/time.c
 void __init time_init(void)
 
 {
 ...
 340   time_init_hook();
 }
 -----------------------------------------------------------------------
 
 

The function time_init() flows down to time_init_hook(), which is located in the machine-specific setup file setup.c:

 
 ----------------------------------------------------------------------- 
 arch/i386/machine-default/setup.c
 072  static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, 0, "timer", NULL, NULL};
 ...
 081  void __init time_init_hook(void)
 082  {
 083   setup_irq(0, &irq0);
 084  }
 -----------------------------------------------------------------------
 
 

Line 72

We initialize the irqaction struct that corresponds to irq0.

Lines 8184

The function call setup_irq(0, &irq0) puts the irqaction struct containing the handler timer_interrupt() on the queue of shared interrupts associated with irq0.

This code segment has a similar effect to calling request_irq() for the general case handler (those not loaded at kernel initialization time). The initialization code for the timer interrupt took a shortcut to get the handler into irq_desc[]. Runtime code uses disable_irq(), enable_irq(), request_irq(), and free_irq() in irq.c. All these routines are utilities to work with IRQs and touch an irq_desc struct at one point.

Interrupt Time

For PowerPC, the decrementer is internal to the processor and has its own interrupt vector at 0x900. This contrasts the x86 architecture where the PIT is an external interrupt coming in from the interrupt controller. The PowerPC external controller comes in on vector 0x500. A similar situation would arise in the x86 architecture if the system timer were based on the local APIC.

Tables 3.10 and 3.11 describe the interrupt vector tables of the x86 and PPC architectures, respectively.

Table 3.10. x86 Interrupt Vector Table

Vector Number/IRQ

Description

0

Divide error

1

Debug extension

2

NMI interrupt

3

Breakpoint

4

INTO-detected overflow

5

BOUND range exceeded

6

Invalid opcode

7

Device not available

8

Double fault

9

Coprocessor segment overrun (reserved)

10

Invalid task state segment

11

Segment not present

12

Stack fault

13

General protection

14

Page fault

15

(Intel reserved. Do not use.)

16

Floating point error

17

Alignment check

18

Machine check*

1931

(Intel reserved. Do not use.)

32255

Maskable interrupts


Table 3.11. PPC Offset of Interrupt Vector

Offset (Hex)

Interrupt Type

00000

Reserved

00100

System reset

00200

Machine check

00300

Data storage

00400

Instruction storage

00500

External

00600

Alignment

00700

Program

00800

Floating point unavailable

00900

Decrementer

00A00

Reserved

00B00

Reserved

00C00

System call

00D00

Trace

00E00

Floating point assist

00E10

Reserved

00FFF

Reserved

01000

Reserved, implementation specific

02FFF

(End of interrupt vector locations)


Note the similarities between the two architectures. These tables represent the hardware. The software interface to the Intel exception interrupt vector table is the Interrupt Descriptor Table (IDT) that was previously mentioned in this chapter.

As we proceed, we can see how the Intel architecture handles a hardware interrupt by way of an IRQ, to a jump table in entry.S, to a call gate (descriptor), to finally the handler code. Figure 3.19 illustrates this.

Figure 3.19. x86 Interrupt Flow


PowerPC, on the other hand, vectors to specific offsets in memory where the code to jump to the appropriate handler is located. As we see next, the PPC jump table in head.S is indexed by way of being fixed in memory. Figure 3.20 illustrates this.

Figure 3.20. PPC Interrupt Flow


This should become clearer as we now explore the PPC external (offset 0x500) and timer (offset 0x900) interrupt handlers.

Processing the PowerPC External Interrupt Vector

As previously discussed, the processor jumps to address 0x500 in the event of an external interrupt. Upon further investigation of the EXCEPTION() macro in the file head.S, we can see the following lines of code is linked and loaded such that it is mapped to this memory region at offset 0x500. This architected jump table has the same effect as the x86 IDT:

 
 -----------------------------------------------------------------------
 arch/ppc/kernel/head.S
 453   /* External interrupt */
 454  EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE)
 
 The third parameter, do_IRQ(), is called next. Let's take a look at this function.
 arch/ppc/kernel/irq.c
 510  void do_IRQ(struct pt_regs *regs)
 511  {
 512  int irq, first = 1;
 513  irq_enter();
 ...
 523  while ((irq = ppc_md.get_irq(regs)) >= 0) {
 524   ppc_irq_dispatch_handler(regs, irq);
 525   first = 0;
 526  }
 527  if (irq != -2 && first)
 528   /* That's not SMP safe ... but who cares ? */
 529   ppc_spurious_interrupts++;
 530  irq_exit();
 531  }
 -----------------------------------------------------------------------
 
 

Lines 513530

Indicate to the preemption code that we are in a hardware interrupt.

Line 523

Read from the interrupt controller a pending interrupt and convert to an IRQ number (until all interrupts are handled).

Line 524

The ppc_irq_dispatch_handler() handles the interrupt. We look at this function in more detail next.

The function ppc_irq_dispatch_handler() is nearly identical to the x86 function do_IRQ():

 
 -----------------------------------------------------------------------
 arch/ppc/kernel/irq.c
 428  void ppc_irq_dispatch_handler(struct pt_regs *regs, int irq)
 429  {
 430  int status;
 431  struct irqaction *action;
 432  irq_desc_t *desc = irq_desc + irq;
 433
 434  kstat_this_cpu.irqs[irq]++;
 435  spin_lock(&desc->lock);
 436  ack_irq(irq);
 ...
 441  status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
 442  if (!(status & IRQ_PER_CPU))
 443   status |= IRQ_PENDING; /* we _want_ to handle it */
 ...
 449   action = NULL;
 450   if (likely(!(status & (IRQ_DISABLED | IRQ_INPROGRESS)))) {
 451     action = desc->action;
 452    if (!action || !action->handler) {
 453     ppc_spurious_interrupts++;
 454     printk(KERN_DEBUG "Unhandled interrupt %x, disabled\n",irq);
 455      /* We can't call disable_irq here, it would deadlock */
 456      ++desc->depth;
 457      desc->status |= IRQ_DISABLED;
 458      mask_irq(irq);
 459      /* This is a real interrupt, we have to eoi it,
 460       so we jump to out */
 461      goto out;
 462      }
 463      status &= ~IRQ_PENDING; /* we commit to handling */
 464      if (!(status & IRQ_PER_CPU))
 465      status |= IRQ_INPROGRESS; /* we are handling it */
 466    }
 567   desc->status = status;
 ...
 489  for (;;) {
 490   spin_unlock(&desc->lock);
 491   handle_irq_event(irq, regs, action);
 492   spin_lock(&desc->lock);
 493
 494   if (likely(!(desc->status & IRQ_PENDING)))
 495    break;
 496   desc->status &= ~IRQ_PENDING;
 497  }
 498 out:
 499  desc->status &= ~IRQ_INPROGRESS;
 ...
 511  }  
 -----------------------------------------------------------------------
 
 

Line 432

Get the IRQ from parameters and gain access to the appropriate irq_desc.

Line 435

Acquire the spinlock on the IRQ descriptor in case of concurrent accesses to the same interrupt by different CPUs.

Line 436

Send an acknowledgment to the hardware. The hardware then reacts accordingly, preventing further interrupts of this type from being processed until this one is finished.

Lines 441443

The flags IRQ_REPLAY and IRQ_WAITING are cleared. In this case, IRQ_REPLAY indicates that the IRQ was dropped earlier and is being resent. IRQ_WAITING indicates that the IRQ is being tested. (Both cases are outside the scope of this discussion.) In a uniprocessor system, the IRQ_PENDING flag is set, which indicates that we commit to handling the interrupt.

Line 450

This block of code checks for conditions under which we would not process the interrupt. If IRQ_DISABLED or IRQ_INPROGRESS are set, we can skip over this block of code. The IRQ_DISABLED flag is set when we do not want the system to respond to a particular IRQ line being serviced. IRQ_INPROGRESS indicates that an interrupt is being serviced by a processor. This is used in the case a second processor in a multiprocessor system tries to raise the same interrupt.

Lines 451462

Here, we check to see if the handler exists. If it does not, we break out and jump to the "out" label in line 498.

Lines 463465

At this point, we cleared all three conditions for not servicing the interrupt, so we are committing to doing so. The flag IRQ_INPROGRESS is set and the IRQ_PENDING flag is cleared, which indicates that the interrupt is being handled.

Lines 489497

The interrupt is serviced. Before an interrupt is serviced, the spinlock on the interrupt descriptor is released. After the spinlock is released, the routine handle_irq_event() is called. This routine executes the interrupt's handler. Once done, the spinlock on the descriptor is acquired once more. If the IRQ_PENDING flag has not been set (by another CPU) during the course of the IRQ handling, break out of the loop. Otherwise, service the interrupt again.

Processing the PowerPC System Timer Interrupt

As noted in timer_init(), the decrementer is hard coded to 0x900. We can assume the terminal count has been reached and the handler timer_interrupt() in arch/ppc/kernel/time.c is called at this time:

 
 -----------------------------------------------------------------------
 arch/ppc/kernel/head.S
   /* Decrementer */
 479  EXCEPTION(0x900, Decrementer, timer_interrupt, EXC_XFER_LITE)
 -----------------------------------------------------------------------
 
 

Here is the timer_interrupt() function.

 
 -----------------------------------------------------------------------
 arch/ppc/kernel/time.c
 145  void timer_interrupt(struct pt_regs * regs)
 146  {
 ...
 152   if (atomic_read(&ppc_n_lost_interrupts) != 0)
 153    do_IRQ(regs);
 154
 155   irq_enter();
 ...
 159    if (!user_mode(regs))
 160     ppc_do_profile(instruction_pointer(regs));
 ...
 165   write_seqlock(&xtime_lock);
 166    
 167    do_timer(regs);
 ...   
 189  if (ppc_md.set_rtc_time(xtime.tv_sec+1 + time_offset) == 0) 
 ...   
 195   write_sequnlock(&xtime_lock);
 ...
 198    set_dec(next_dec);
 ...
 208   irq_exit();
 209  }
 -----------------------------------------------------------------------
 
 

Line 152

If an interrupt was lost, go back and call the external handler at 0x900.

Line 159

Do kernel profiling for kernel routing debug.

Lines 165 and 195

Lock out this block of code.

Line 167

This code is the same function used in the x86 timer interrupt (coming up next).

Line 189

Update the RTC.

Line 198

Restart the decrementer for the next interrupt.

Line 208

Return from the interrupt.

The interrupted code now runs as normal until the next interrupt.

Processing the x86 System Timer Interrupt

Upon activation of an interrupt (in our example, the PIT has counted down to 0 and activated IRQ0), the interrupt controller activates an interrupt line going into the processor. The assembly code in enTRy.S has an entry point that corresponds to each descriptor in the IDT. IRQ0 is the first external interrupt and is vector 32 in the IDT. The code is then ready to jump to entry point 32 in the jump table in enTRy.S:

 
 -----------------------------------------------------------------------
 arch/i386/kernel/entry.S
 385  vector=0
 386  ENTRY(irq_entries_start)
 387  .rept NR_IRQS
 388   ALIGN
 389  1:  pushl $vector-256
 390   jmp common_interrupt
 391  .data
 392  .long 1b
 393  .text
 394  vector=vector+1
 395  .endr
 396
 397   ALIGN
 398  common_interrupt:
 399   SAVE_ALL
 400   call do_IRQ
 401   jmp ret_from_intr
 -----------------------------------------------------------------------
 
 

This code is a fine piece of assembler magic. The repeat construct .rept (on line 387), and its closing statement (on line 395) create the interrupt jump table at compile time. Notice that as this block of code is repeatedly created, the vector number to be pushed at line 389 is decremented. By pushing the vector, the kernel code now knows what IRQ it is working with at interrupt time.

When we left off the code trace for x86, the code jumps to the proper entry point in the jump table and saves the IRQ on the stack. The code then jumps to the common handler at line 398 and calls do_IRQ() (arch/i386/kernel/irq.c) at line 400. This function is almost identical to ppc_irq_dispatch_handler(), which was described in the section, "Processing the PowerPC External Interrupt Vector" so we will not repeat it here.

Based on the incoming IRQ, the function do_irq() accesses the proper element of irq_desc and jumps to each handler in the chain of action structures. Here, we have finally made it to the actual handler function for the PIT: timer_interrupt(). See the following code segments from time.c. Maintaining the same order as in the source file, the handler starts at line 274:

 
 -----------------------------------------------------------------------
 arch/i386/kernel/time.c
 274 irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 275  {
 ...
 287   do_timer_interrupt(irq, NULL, regs);
 ...
 290   return IRQ_HANDLED;
 291  }  
 -----------------------------------------------------------------------
 
 

Line 274

This is the entry point for the system timer interrupt handler.

Line 287

This is the call to do_timer_interrupt().

 
 -----------------------------------------------------------------------
 arch/i386/kernel/time.c
 208  static inline void do_timer_interrupt(int irq, void *dev_id,
 209       struct pt_regs *regs)
 210  {
    ...
 227   do_timer_interrupt_hook(regs); 
  ...
 250  }
 ------------------------------------------------------------
 
 

Line 227

Call to do_timer_interrupt_hook(). This function is essentially a wrapper around the call to do_timer(). Let's look at it:

 
 -----------------------------------------------------------------------
 include/asm-i386/mach-default/do_timer.h
 016  static inline void do_timer_interrupt_hook(struct pt_regs   *regs)
 017  {
 018   do_timer(regs);
 ...
 025   x86_do_profile(regs);
 ...
 030  }
 ------------------------------------------------------------------
 
 

Line 18

This is where the call to do_timer() gets made. This function performs the bulk of the work for updating the system time.

Line 25

The x86_do_profile() routine looks at the eip register for the code that was running before the interrupt. Over time, this data indicates how often processes are running.

At this point, the system timer interrupt returns from do_irq()to enTRy.S for system housekeeping and the interrupted thread resumes.

As previously discussed, the system timer is the heartbeat of the Linux operating system. Although we have used the timer as an example for interrupts in this chapter, its use is prevalent throughout the entire operating system.


Project: current System Variable

This chapter explored the task_struct and current, which is the system variable that points to the currently executing task's task_struct. This project's goal is to reinforce the idea that the kernel is an orderly yet ever-changing series of linked structures that are created and destroyed as programs run. As we have seen, the task_struct structure is one of the most important structures the kernel owns, in that it has all the information needed for any given task. This project module accesses that structure just as the kernel does and serves as a base for future explorations for the reader.

In this project, we access the current task_struct and print various elements from that structure. Starting with the file include/linux/sched.h, we find the task_struct. Using sched.h, we reference current->pid and current->comm., the current process ID and name, and follow these structures to the parent's pid and comm. Next, we borrow a routine from printk() and send a message back to the current tty terminal we are using.

See the following code:

NOTE

From running the first program (hellomod), what do you think the name of the current process will be when the initialization routine prints out current->comm? What will be the name of the parent process? See the following code discussion.


Project Source Code[8]

[8] You can use the project source as a starting point to explore the running kernel. Although the kernel has many useful routines to view, such as its internals (for example, strace()), building your own tools, such as this project, sheds light on the real-time aspects of the Linux kernel.

----------------------------------------------------------------------- currentptr.c 001 #include <linux/module.h> 002 #include <linux//kernel.h> 003 #include <linux/init.h> 004 #include <linux/sched.h> 005 #include <linux/tty.h> 006 007 void tty_write_message1(struct tty_struct *, char *); 008 009 static int my_init( void ) 010 { 011 012 char *msg="Hello tty!"; 013 014 printk("Hello, from the kernel...\n"); 015 printk("parent pid =%d(%s)\n",current->parent->pid ,current->parent->comm); 016 printk("current pid =%d(%s)\n",current->pid,current->comm); 017 018 tty_write_message1(current->signal->tty,msg); 019 return 0; 020 } 022 static void my_cleanup( void ) { printk("Goodbye, from the kernel...\n"); } 027 module_init(my_init); 028 module_exit(my_cleanup); // This routine was borrowed from <printk.c> 032 void tty_write_message1(struct tty_struct *tty, char *msg) { if (tty && tty->driver->write) tty->driver->write(tty, 0, msg, strlen(msg)); return; 037 } -----------------------------------------------------------------------

Line 4

sched.h contains struct task_struct {}, which is where we reference the process ID (->pid), and the name of the current task (->comm.), as well as the pointer to the parent PID (->parent), which references the parent task structure. We also find a pointer to the signal structure, which contains a reference to the tty structure (see lines 1822).

Line 5

tty.h contains struct tty_struct {}, which is used by the routine we borrowed from printk.c (see lines 3237).

Line 12

This is the simple message string we want to send back to our terminal.

Line 15

Here, we reference the parent PID and its name from our current task structure. The answer to the previous question is that the parent of our task is the current shell program; in this case, it was Bash.

Line 16

Here, we reference the current PID and its name from our current task structure. To answer the other half of the previous question, we entered insmod on the Bash command line and that is printed out as the current process.

Line 18

This is a function borrowed from kernel/printk.c. It is used to redirect messages to a specific tty. To illustrate our current point, we pass this routine the tty_struct of the tty (window or command line) from where we invoke our program. This is referenced by way of current->signal->tty. The msg parm is the string we declared on line 12.

Lines 3238

The tty write function checks that the tty exists and then calls the appropriate device driver with the message.

Running the Code

Compile and insmod() the code as in the first project.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

 Автор  Комментарий к данной статье