Serg Iakovlev

or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

BIOS и Open Firmware

При включении компьютера вначале начинает работать биос , который позже загружает ядро.
Загрузчики

Загрузчик - программа , которая находится в загрузочном секторе. Загрузочным устройством обычно явдяется мастер-диск. Загрузчик может называться BIOS (x86) или firmware (PPC).
В x86 BIOS позволяет вручную настроить порядок загрузочных устройств, которым может быть флоппи , флеш,си-ди или диск. При форматировании диска командой fdisk создается т.н. Master Boot Record (MBR), который располагается в 1-м секторе (sector 0, cylinder 0, head 0) загрузочного устройства. MBR включает в себя небольшую программу и таблицу для 4-х значений. Бут-сектор имеет размер 512 байт , последние 2 байта - 0xAA55. Table 8.1 показывает состав MBR.

Table 8.1. MBR Components
Offset
Length
Purpose
0x00
0x1bd
MBR program code
0x1be
0x40
Partition table
0x1fe
0x2
Hex marker or signature

Таблица MBR хранит информацию о партициях. Table 8.2 показана в 16-байтном разрезе:

Table 8.2. MBR 16-byte Entries
Offset
Length
Purpose
0x00
1
Active Boot Partition Flag
0x01
3
Starting Cylinder/Head/Sector of boot partition
0x04
1
Partition Type (Linux uses 0x83,PPC PReP uses 0x41)
0x05
3
Ending Cylinder/Head/Sector of boot partition
0x08
4
Partition starting sector number
0x0c
4
Partition length (in sectors)

Биос определяет загрузочное устройство.
После этого MBR копируется в память по адресу 0x7c00 и запускается. В таблице MBR находится адрес активной партиции , с этого адреса копируется код в память и запускается. Этот код обычно является загрузчиком операционной системы (GRUB или LILO), который дальше будет запущен и загрузит уже саму операционную систему. Следующая таблица Figure 8.2 показывает , как выглядит память на момент загрузки.
Figure 8.2. View of Memory at Bootup Time

GRUB

Grand Unified Bootloader (GRUB) - x86-совместимый загрузчик , который загружает Линукс. GRUB 2 портирован на PPC на момент написания этой книги. Информацию можно найти на www.gnu.org/software/grub. GRUB определяет файловую систему загрузочного устройства, ядро загружается по его имени. GRUB - 2-ступенчатый (two-stage) загрузчик. ^[1] Первая фаза называется BIOS и находится в MBR. 2-я часть уже грузится с помощью Stage 1 . Можно нарисовать следующую цепочку :

Stage 1
Инициализация.
Определение загрузочного устройства.
Загрузка первого сектора Stage 2.
Загрузка Stage 2.
Stage 2
Остальная загрузка Stage 2.
Запуск загруженного кода.

Доступ к GRUB можно получить из командной строки. Фрагмент конфигурационного файла GRUB:

---------------------------------------------------------------------- /boot/menu.lst ... title Kernel 2.6.7, test kernel root (hd0,0) kernel /boot/bzImage-2.6.7-mytestkernel root=/dev/hda1 ro ^[2] ... -----------------------------------------------------------------------

^[2] В строке дается список параметров. Подробности можно найти на www.tldp.org/HOWTO/BootPrompt-HOWTO.html

Опциями являются : title, - просто метка; root, - текущее загрузочное устройство root device для hd0, partition 0; и само kernel, которое является boot image . Остальные параметры передаются уже самим ядром при загрузке.
Определение места в памяти , в котором будет находится сам kernel image , можно найти в коде : arch/i386/boot/setup.S для x86:

---------------------------------------------------------------------- arch/i836/boot/setup.S 61 INITSEG = DEF_INITSEG # 0x9000, we move boot here, out of the way 62 SYSSEG = DEF_SYSSEG # 0x1000, system loaded at 0x10000 (65536). 63 SETUPSEG = DEF_SETUPSEG # 0x9020, this is the current segment -----------------------------------------------------------------------

Мы видим , что образ загружается по линейному адресу 0x9000 , после чего запускается код , лежащий по адресу 0x9020. Ядро разархивируется и грузится по адресу 0x10000 и запускается.

GRUB основан на Multiboot Specification.
8.2.1.1. Multiboot Specification

The Multiboot Specification describes an interface between any potential bootloader and any potential operating system. The Multiboot Specification does not say how a bootloader should work, but how it must interface with the operating system being loaded. Currently targeted at x86 architectures and free 32-bit operating systems, it provides a standard means for a bootloader to pass configuration information to an operating system. The OS image can be of any type (ELF or special), but must contain a multiboot header in the first 8K of the image, as well as the magic number 0x1BADB002. The multiboot-compliant loader should also provide a method for auxiliary boot modules or drivers to be used by the OS at boot time as certain OSes do not load all the programs necessary for operation into the bootable kernel image. This is often done to modularize boot kernels and keep the boot kernel to a manageable size.

The Multiboot Specification dictates that, when the bootloader invokes the OS, the system must be in a specific 32-bit real mode state such that the OS can successfully make calls back into BIOS if desired. Finally, the bootloader must present the OS with a data structure filled with essential machine data. We now look at the multiboot information data structure.

----------------------------------------------------------------------- typedef struct multiboot_info { ulong flags; // indicate following fields ulong mem_lower; // if flags[0],amnt of mem < 1M ulong mem_upper; // if flags[0],amnt of mem > 1M ulong boot_device; // if flags[1],drive,part1,2,3 ulong cmdline; // if flags[2],addr of cmd line ulong mods_count; // if flags[3],#of boot modules ulong mods_addr; // if flags[3],addr of first boot module. union { aout_symbol_table_t aout_sym; // if flags[4], symbol table from a.out kernel image elf_section_header_table_t elf_sec;// if flags[5], header from ELF kernel. } u; ulong mmap_length; // if flags[6],BIOS mem map len ulong mmap_addr; // if flags[6],BIOS map addr ulong drives_length; // if flags[7],BIOS drive info structs ulong drives_length; // if flags[7],first BIOS drive info struct. ulong config_table // if flags[8],ROM config table ulong boot_loader_name // if flags[9],addr of string ulong apm_table // if flags[10],addr of APM info table ulong vbe_control_info // if flags[11],video mode settings ulong vbe_mode_info ulong vbe_mode ulong vbe_interface_seg ulong vbe_interface_off ulong vbe_interface_len }; -----------------------------------------------------------------------

A pointer to this structure is passed in EBX when control is passed to the OS. The first field, flags, indicates which of the following fields are valid. Unused fields must be 0. You can learn more about the Multiboot Specification at www.gnu.org/software/grub/manual/multiboot/multiboot.html.

8.2.2. LILO

The LInux LOader (LILO) has been used for years as an x86 loader for Linux. It was one of the earliest boot-loading programs available to assist in the configuration and loading of the Linux kernel. LILO is similar to GRUB in the sense that it is a two-stage bootloader. LILO uses a configuration file and does not have a command-line interface.

Again, we start with BIOS initializing the system and loading the MBR (Stage 1) into memory and transferring control to it. The breakdown of the events occurring in each of LILO's stages is as follows:

Stage 1
Begins execution and displays "L."
Detects disk geometry and displays "I."
Loads Stage 2 code.
Stage 2
Begins execution and displays "L."
Locates boot data and OS and displays "O."
Determines which OS to start and jumps to it.

A stanza from the LILO configuration file looks like this:

---------------------------------------------------------------------- /etc/lilo.conf image=/boot/bzImage-2.6.7-mytestkernel label=Kernel 2.6.7, my test kernel root=/dev/hda6 read-only -----------------------------------------------------------------------

The parameters are image, which indicates the pathname of the kernel; label, which is a string describing the configuration; root, which indicates the partition where the root filesystem resides; and read-only, which indicates that the root partition cannot be altered during boot.

Here is a list of the differences between GRUB and LILO:

LILO stores configuration information in the MBR. If any changes are made, /sbin/lilo must be run to update the MBR.
LILO cannot read various filesystems.
LILO has no interactive command-line interface.

Let's review what happens when LILO is the bootloader. First, the MBR (which contains LILO) is copied to 0x7c00 and begins execution. LILO begins by copying the kernel image referenced in /etc/lilo.conf from the hard drive. This image, created by build.c, is made up of the init sector (loaded at 0x90000), the setup sector (loaded at 0x90200), and the compressed image (loaded at 0x10000). LILO then jumps to label start_of _setup at address 0x90200.

8.2.3. PowerPC and Yaboot

Yaboot is a bootloader based on the Open Firmware (OF) of New World PowerPC machines. Similar to LILO and GRUB, Yaboot uses a configuration file and a utility such as ybin or ybootconfig to set up a bootstrap partition containing Yaboot. Similar to the x86 BIOS, OF allows configuration of the boot device. However, in the OF case, it varies by system. OF settings can be usually found by pressing "Command+Option/Alt+o+f..?"

Yaboot uses the following steps to boot:

1.
Yaboot gets called by OF.

2.
Finds boot device, boot path, and opens boot partition.

3.
Opens /etc/yaboot.conf or command shell.

4.
Loads image or kernel and initrd.

5.
Executes image.

As you can see, the kernel-loading stanza for Yaboot is similar to LILO and GRUB:

---------------------------------------------------------------------- yaboot.conf label=Linux root=/dev/hda11 sysmap=/boot/System.map read-only -----------------------------------------------------------------------

As in LILO, ybin installs Yaboot to the boot partition. Any updates/changes to the Yaboot configuration require rerunning ybin.

Documentation on Yaboot can be found at www.penguinppc.org.

8.3. Architecture-Dependent Memory Initialization

We now take a moment to discuss hardware management features in PPC and x86. Both x86 and PowerPC architectures have hardware memory-management features to support real and virtual addressing environments. As in all operating systems, Linux Memory Management depends on the underlying hardware architecture. This section describes the hardware initialization of both architectures. Because the initialization of memory management is extremely hardware dependent, the hardware specifications need to be understood in order to follow the initialization process. Memory management is one of the first subsystems to be initialized and begins prior to the execution of start_kernel() because of its highly architecture-dependent nature.

8.3.1. PowerPC Hardware Memory Management

Also known as "storage control" in the PowerPC world, this section describes the hardware-supported features of address translation specific to the PowerPC architecture. We follow up with a discussion on how Linux uses (or disregards, for the sake of portability) these features from system power-on through kernel initialization.

8.3.1.1. Real Addressing Mode

From embedded up to high performance, all PowerPC processors come out of hardware reset in real mode.^[3] PowerPC real-addressing mode is defined as having the processor in a state of disabled address translation. Address translation is controlled by the instruction relocate (IR) and data relocate (DR) bits in the Machine State Register (MSR). For fetch instructions, if the IR bit is 0, the effective address (EA) is the same as the real address. For load and store instructions, the DR bit in the MSR plays a similar role.
^[3] Even the 440 series of processors, which technically have no real mode, start with a "shadow" TLB that maps linear addresses to physical addresses.

The MSR, which is illustrated in Figure 8.3, is a 64- or 32-bit register that describes the current state of the processor. On a 32-bit implementation, the IR and DR are bits 26 and 27.

Figure 8.3. PowerPC Machine State Register (MSR)

Because address translation in Linux is a combination of hardware and software structures, real mode is fundamental to the boot process of initializing the memory subsystem and the memory-management structures of Linux. The need to enable address translation is exemplified by the inherent limitations of real mode. Real mode is only capable of addressing the implemented address width; this is 64- or 32-bit in most applications. The two major limitations are as follows:

There is no hardware protection for load/store operations.
Any access (instruction or data) to or from an address that does not have a device physically attached to the bus might cause a Machine Check (also known as a Checkstop), which in most cases, is unrecoverable.

8.3.1.2. Address Translation

The lack of address translation is real addressing. Address translation opens the door to virtual addressing where every possible address is not physically available at any given instance, but through the clever use of hardware and software, every possible address can be made virtually available when accessed.

With address translation enabled, the PowerPC architecture translates an EA by one of two methods: Segmented Address Translation or Block Address Translation (see Figure 8.4). If the EA can be translated by both methods, Block Address Translation takes precedence. Address translation is said to be enabled when MSR_IR=1, or MSR_DR=1, or both. Segmented Address Translation breaks virtual memory into segments, which are divided into 4KB pages, each representing physical memory. Block Address Translation breaks memory into regions ranging from 128MB to 256MB.

Figure 8.4. 32-Bit Address Translation

Memory Addressing Terminology

When we reference memory, we really only have two distinct methodologies or modes: real addressing, where each increment of the address specifies a specific base unit (usually a byte) in physical memory; and virtual addressing, where the address is a computation in hardware and/or software. Here are some example terms used for each:

Real addressing. Physical, bus
Virtual addressing. Effective, protected, and translated

In PowerPC, effective address space is considered a subset of virtual address space.

Terms such as linear, flat, and logical can apply to both modes.

Segmented Address Translation Direct Store Segment T

The next level of translation is determined by the T bit, which is located in the Segment Register. Bits 0:3 of the EA select one of 16 segment registers (SRs) in the PowerPC 7xx series. Figure 8.5 illustrates the segment register.

Figure 8.5. Segment Register

With the T bit set, the segment is deemed a direct store segment to an I/O device, and there is no reference to hardware page tables. The I/O address is made up of a permission bit, the BUID, the controller-specific field, and bits 4:31 of the EA. Linux does not use direct store segmentation.

When the Segmented Address Translation Ordinary Segment T is not set, the virtual segment ID (VSID) field is used.

Referring to Figure 8.6, a 52-bit virtual address (VA) is formed by concatenating bits 20:31 of the EA (the offset within a given page), bits 4:19 of the EA, and bits 8:31 of the selected segment register VSID field. The most significant 40 bits of the VA make up the virtual page number (VPN). The PowerPC architecture uses a Hashed Page Table to map VPNs to real page numbers (the real address of a desired page in memory). The hash function uses the VPN and the value in Storage Description Register 1 (SDR1) to store and retrieve a Page Table Entry (PTE). The PTE, which is illustrated in Figure 8.7, is an 8-byte structure that contains all the necessary attributes of a page in memory.

Figure 8.6. Segment Translation

Figure 8.7. Page Table Entry

Block Address Translation

As its name implies, Block Address Translation (BAT) is an addressing mechanism that allows for mapping blocks of contiguous memory from 125KB to 256MB. BAT registers are privileged special purpose registers (SPRs) in the PowerPC architecture. Figure 8.8 illustrates the BAT register.

Figure 8.8. BAT Register

The formation of a real address from a BAT register can be seen in Figure 8.9. Four Instruction BAT (IBAT) registers and four Data BAT (DBAT) registers can be read or written using mtspr and mfspr PPC instructions.^[4]
^[4] Block Address Translation is not implemented on all PowerPC processors. Notably, it was not implemented on G4 or G5. It is implemented in the 4xx-embedded processors.

Figure 8.9. BAT Real

Translation Lookaside Buffers

The Translation Lookaside Buffers (TLBs) can be thought of as a hardware cache with hardware protection for the paging system. The TLB varies in length with PowerPC architectures and contains an index of the most recently used PTEs. The paging software must be sure to keep the TLBs in sync with the page table. When the processor cannot find a page in the hash table,^[5] the Linux page tables are then searched. If the page is still not found, a normal page fault is generated. Information on optimization of the synchronization between the Linux page tables and PPC hash tables can be found in the document, "Low Level Optimizations in the PowerPC/Linux Kernels," by Paul Mackerras.
^[5] Hash tables are not implemented on all PowerPC processors. They are absent in the 4xx- and 8xx-embedded systems where a TLB miss generates an exception in the hardware and the paging software, and then brings the page in.

Storage Access Mode Control

When address translation is enabled (MSR_IR=1, or MSR_DR=1, or both) and accomplished by way of Segmented Address Translation or Block Address Translation, the storage mode is determined by four control bits: W, I, M, and G. For Segmented Address Translation, they are bits 25:28 of the second word of a PTE, and the same bits for the second SPR of the DBAT. (The G-bit is reserved in the IBAT.) Two more bitsReference and Control, which are located in the PTEare available for Segmented Address Translation. The R and C bits are set by hardware or software. (See the following sidebar for a discussion of the W, I, M, G, R, and C bits.)

Control Bits

The W, I, M, G, R, and C bits control how the processor accesses the cache and main memory:

W (Write Through). If data is in the cache and a store operation is performed on it, if W=1, the copy in main memory must also be updated.
I (Cache Inhibit). Updates bypass cache and goes straight through to main memory.
M (Memory Coherence). When M=1, hardware memory coherency is enforced.
G (Guarded). When G=1, speculative execution is suppressed.
R (Referenced). When R=1, the Page Table entry has been referenced.
C (Changed). When C=1, the Page Table entry has been changed.

8.3.1.3. How Linux Uses PPC Address Translation

We now look at the code that influences memory management in PPC.

The following code is the first in the kernel distribution to get control. This routine calls back into the Firmware for allocation of temporary regions by using the claim() function. The kernel is then decompressed into its proper location:

---------------------------------------------------------------------- arch/ppc/boot/openfirmware/newworldmain.c 40 void boot(int a1, int a2, void *prom) ... 54 claim(initrd_start, RAM_END - initrd_start, 0); 55 printf("initial ramdisk moving 0x%x <- 0x%p (%x bytes)\n\r", 56 initrd_start, (char *)(&__ramdisk_begin), initrd_size); 57 memcpy((char *)initrd_start, (char *)(&__ramdisk_begin), initrd_size); ... 63 /* claim 3MB starting at PROG_START */ 64 claim(PROG_START, PROG_SIZE, 0); 65 dst = (void *) PROG_START; 66 if (im[0] == 0x1f && im[1] == 0x8b) { 67 /* claim some memory for scratch space */ 68 avail_ram = (char *) claim(0, SCRATCH_SIZE, 0x10); 69 begin_avail = avail_high = avail_ram; 70 end_avail = avail_ram + SCRATCH_SIZE; 71 printf("heap at 0x%p\n", avail_ram); 72 printf("gunzipping (0x%p <- 0x%p:0x%p)...", dst, im, im+len); 73 gunzip(dst, PROG_SIZE, im, &len); 74 printf("done %u bytes\n", len); 75 printf("%u bytes of heap consumed, max in use %u\n", 76 avail_high - begin_avail, heap_max); ... 86 sa = (unsigned long)PROG_START; 87 printf("start address = 0x%x\n", sa); 88 89 (*(kernel_start_t)sa)(a1, a2, prom); ----------------------------------------------------------------------

Line 40

Entry point to this file is the function boot(a1, a2, *prom).

Line 54

Function claim() is called to allocate memory just below 1M and ramdisk is copied into that memory.

Line 64

Function claim() is called to allocate 3M of memory, starting at 0x1_0000 for the image.

Line 68

Function claim() is called to allocate 8K of memory starting at 0x00 for scratch/heap.

Line 73

The image is gunzipped to address 0x1_0000 (PROG_START).

Line 89

Jump to 0x1_0000 ((*kernel_start_t)sa) with parameters (a1, a2, and prom) where a1 holds the value in r3 (equal to the boot ramdisk start), a2 holds the value in r4 (equal to the boot ramdisk size or 0xdeadbeef in the case of no ramdisk) and prom holds the value in r5 (code stored in system ROM).

The next code block readies the hardware memory-management features of the various PowerPC processors. The first 16M of RAM is mapped to 0xc0000000:

---------------------------------------------------------------------- arch/ppc/kernel/head.S 131 __start: ... 150 bl early_init in <arch/ppc/kernel/setup.c> (283) ... 170 bl mmu_off ... 171 RFI: SRR0=>IP, SRR1=>MSR 172 #ifndef CONFIG_POWER4 173 bl clear_bats 174 bl flush_tlbs 175 176 bl initial_bats 177 #if !defined(CONFIG_APUS) && defined(CONFIG_BOOTX_TEXT) 178 bl setup_disp_bat 179 #endif 180 #else /* CONFIG_POWER4 */ 181 bl reloc_offset 182 bl initial_mm_power4 183 #endif /* CONFIG_POWER4 */ 185 /* 186 * Call setup_cpu for CPU 0 and initialize 6xx Idle 187 */ 188 bl reloc_offset 189 li r24,0 /* cpu# */ 190 bl call_setup_cpu /* Call setup_cpu for this CPU */ 195 #ifdef CONFIG_POWER4 196 bl reloc_offset 197 bl init_idle_power4 198 #endif /* CONFIG_POWER4 */ 199 210 bl reloc_offset 211 mr r26,r3 212 addis r4,r3,KERNELBASE@h /* current address of _start */ 213 cmpwi 0,r4,0 /* are we already running at 0? */ 214 bne relocate_kernel 215 ... 224 turn_on_mmu: 225 mfmsr r0 226 ori r0,r0,MSR_DR|MSR_IR 227 mtspr SRR1,r0 228 lis r0,start_here@h 229 ori r0,r0,start_here@l 230 mtspr SRR0,r0 231 SYNC 232 RFI /* enables MMU */ ----------------------------------------------------------------------

Line 131

This is the entry point to this code. Get minimal mmu environment set up. (Note that APUS stands for Amiga Power Up System.)

Line 150

There might be a difference between where the kernel is loaded and where it is linked. The function early_init returns the physical address of the current code.

Line 170

Shut off memory-management unit of PPC. If both IR and DR are enabled, leave them on; otherwise, shut off relocation.

Lines 173176

If not power4 or G5, clear the BAT registers, flush TLBs, and set up BATs to map the first 16M of RAM to 0xc0000000.

Note the various labels for kernel memory used throughout the kernel:

---------------------------------------------------------------------- arch/ppc/defconfig CONFIG_KERNEL_START=0xc0000000 -----------------------------------------------------------------------

and

---------------------------------------------------------------------- include/asm-ppc/page.h #define PAGE_OFFSET CONFIG_KERNEL_START #define KERNELBASE PAGE_OFFSET ----------------------------------------------------------------------

Lines 181182

By using segmentation, set up kernel memory for power4 and G5.

Lines 188198

setup_cpu() initializes the kernel and user features, such as cache configuration, or whether an FPU or MMU exists. (Note that at this writing, init_idle_power4 is a noop.)

Line 210

Relocate kernel to KERNELBASE or 0x00, depending on the platform.

Lines 224232

Turn on the MMU (if it is not already) by enabling IR and DR in MSR. Then, execute an RFI instruction causing a jump to the label start_here:. (Note: The RFI instruction loads the MSR with the contents of SRR1 and branches to the address in SRR0.)

The following code is where the kernel starts. It sets up all memory in the system based on the command line:

---------------------------------------------------------------------- arch/ppc/kernel/head.S 1337 start_here: ... 1364 bl machine_init 1365 bl MMU_init ... 1385 lis r4,2f@h 1386 ori r4,r4,2f@l 1387 tophys(r4,r4) 1388 li r3,MSR_KERNEL & ~(MSR_IR|MSR_DR) 1389 FIX_SRR1(r3,r5) 1390 mtspr SRR0,r4 1391 mtspr SRR1,r3 1392 SYNC 1393 RFI 1394 /* Load up the kernel context */ 1395 2: bl load_up_mmu ... 1411 /* Now turn on the MMU for real! */ 1412 li r4,MSR_KERNEL 1413 FIX_SRR1(r4,r5) 1414 lis r3,start_kernel@h 1415 ori r3,r3,start_kernel@l 1416 mtspr SRR0,r3 1417 mtspr SRR1,r4 1418 SYNC 1419 RFI ----------------------------------------------------------------------

Line 1337

This line is the entry point to this section.

Line 1364

machine_init() (see the file arch/ppc/kernel/setup.c, line 532) sets up machine-specific information, such as NVRAM, L2, CPU cache line size, debugging, and so on.

Line 1365

MMU_init() (see file arch/ppc/mm/init.c, line 234) discovers the total memory size for highmem and lowmem. It then initializes the MMU hardware (MMU_init_hw(), line 267), sets up Hash Page Table (arch/ppc/mm/hashtable.s), maps all RAM starting at KERNELBASE (mapin_ram(), line 272), maps all I/O (setup_io_mappings(), line 285), and initializes context management(mmu_context_init(), line 288).

Line 1385

Shut off IR and DR to set up SDR1. This holds the real address of the Page Table and how many bits from the hash are used in the Page Table Index.

Line 1395

Clear TLBs, load SDR1 (hash table base and size), set up segmentation, and, depending on the particular PPC platform, initialize the BAT registers.

Lines 14121419

Turn on IR, DR, and RFI to start_kernel in /init/main.c. Note that at interrupt time in the PowerPC architecture, the contents of the Instruction Address Registser (ISR) holds the address the processor must return to after servicing the interrupt. This value is saved in the Save Restore Register 0 (SRR0). The Machine Status Register is in turn saved in the Save Restore Register 1 (SRR1). In shorthand, at interrupt time:

IAR->SRR0
MSR->SRR1

The RFI instruction, which is normally executed at the end of an interrupt routine, is the inverse of this procedure, where SRR0 is restored to the IAR and SRR1 is restored to the MSR. In shorthand:

SRR0->IAR
SRR1->MSR

The code in lines 13851419 uses this methodology to turn memory management on and off by this three-step process:

1.
Sets the desired bits for the MSR (refer to Figure 8.1) in SRR1.

2.
Sets the desired address we want to jump to in SRR0.

3.
Executes the RFI instruction.

8.3.2. x86 Intel-Based Hardware Memory Management

At power-on, all Intel processors are in real address mode. Real addressing is a compatibility mode to the early Intel processors. As processors grew more complex, legacy code was always in use that newer processors still needed to be able to run. In real address mode, the processor can execute a program written for the 8086 and 8088 processors using the same instructions and, more importantly, the same method of addressing or address translation. The end result of address translation is how the processor accesses the system memory. The early Intel processors had a 20-bit address bus, which accessed approximately 64K bytes of memory. This is the limitation put on the early code in the system. In real address mode, the linear address is the same as the physical address. As we move through the code that initializes memory management, we see more of the features of the later processors being used in the hardware and more complex structures added to the software.

The code in setup.S performs several important functions with respect to memory initialization:

---------------------------------------------------------------------- arch/i386/boot/setup.S 307 #define SMAP 0x534d4150 308 309 meme820: 310 xorl %ebx, %ebx # continuation counter 311 movw $E820MAP, %di # point into the whitelist 312 # so we can have the bios 313 # directly write into it. 314 315 jmpe820: 316 movl $0x0000e820, %eax # e820, upper word zeroed 317 movl $SMAP, %edx # ascii 'SMAP' 318 movl $20, %ecx # size of the e820rec 319 pushw %ds # data record. 320 popw %es 321 int $0x15 # make the call 322 jc bail820 # fall to e801 if it fails 323 324 cmpl $SMAP, %eax # check the return is 'SMAP' 325 jne bail820 # fall to e801 if it fails 326 ... 333 good820: 334 movb (E820NR), %al # up to 32 entries 335 cmpb $E820MAX, %al 336 jnl bail820 337 338 incb (E820NR) 339 movw %di, %ax 340 addw $20, %ax 341 movw %ax, %di 342 again820: 343 cmpl $0, %ebx # check to see if 344 jne jmpe820 # %ebx is set to EOF 345 bail820: -----------------------------------------------------------------------

Lines 307345

Looking at the code segment, we first see (on line 321) a call to the BIOS int15h function with ax= 0xe820. This returns the addresses and lengths of the many different types of memory of which BIOS is aware. This simple memory map represents the basic pool from which all the pages of memory in Linux are obtained. As seen from further studying of the code, the memory map can be obtained by three methods: 0xe820, 0xe801, and 0x88. All three methods have to do with compatibility with existing BIOS distributions and their platforms.

---------------------------------------------------------------------- arch/i386/boot/setup.S 595 # Now we move the system to its rightful place ... but we check if we have a # big-kernel. In that case we *must* not move it ... 597 testb $LOADED_HIGH, %cs:loadflags 598 jz do_move0 # .. then we have a normal low 599 # loaded zImage 600 # .. or else we have a high 601 # loaded bzImage 602 jmp end_move # ... and we skip moving 603 604 do_move0: 605 movw $0x100, %ax # start of destination segment 606 movw %cs, %bp # aka SETUPSEG 607 subw $DELTA_INITSEG, %bp # aka INITSEG 608 movw %cs:start_sys_seg, %bx # start of source segment 609 cld 610 do_move: 611 movw %ax, %es # destination segment 612 incb %ah # instead of add ax,#0x100 613 movw %bx, %ds # source segment 614 addw $0x100, %bx 615 subw %di, %di 616 subw %si, %si 617 movw $0x800, %cx 618 rep 619 movsw 620 cmpw %bp, %bx # assume start_sys_seg > 0x200, 621 # so we will perhaps read one 622 # page more than needed, but 623 # never overwrite INITSEG 624 # because destination is a 625 # minimum one page below source 626 jb do_move 627 628 end_move: ----------------------------------------------------------------------

Lines 595628

This code is the kernel image created by build.c and loaded by LILO. It is made up of the init sector (at address 0x9000), the setup sector (at address 0x9200), and the compressed image. The image is originally loaded at address 0x10000. If it is LARGE (>0X7FF), it is left in place; otherwise, it is moved down to 0x1000.

---------------------------------------------------------------------- arch/i386/boot/setup.S 723 # Try enabling A20 through the keyboard controller 724 #endif /* CONFIG_X86_VOYAGER */ 725 a20_kbc: 726 call empty_8042 727 728 #ifndef CONFIG_X86_VOYAGER 729 call a20_test # Just in case the BIOS worked 730 jnz a20_done # but had a delayed reaction. 731 #endif 732 733 movb $0xD1, %al # command write 734 outb %al, $0x64 735 call empty_8042 736 737 movb $0xDF, %al # A20 on 738 outb %al, $0x60 739 call empty_8042 ----------------------------------------------------------------------

Forming the 20-bit Physical Address in Intel Real Address Mode

The Intel 8088 processor in the original IBM PC had only 20 address lines [0...19]. This allowed the system to access up to 1 megabyte plus approximately 64K bytes of memory (0 to 0x10_FFEF) internally, but physically (on the bus) the last 64K of addressable memory was actually the first 64K of real memory!

Internal to the processor, a 20-bit address is formed from a 16-bit segment selector and a 16-bit segment offset. The selector is shifted left 4 bits and added to the offset, which is extended by 4 bits. The sum of these registers is the physical address seen on the bus.

For example:

To obtain the highest address, we load a segment selector (CS, DS, ES, and so on) with a value of 0xFFFF and an index register (SI, DI, and so on) with a value of 0xFFFF. Internal to the processor, the segment selector is shifted left 4 bits and added to the offset.

0xFFFF shifted left 4 bits
=
0x0F_FFF0
Add the offset
+
0x00_FFFF
Internal sum
=
0x10_FFEF
External Physical Address
=
0x00_FFEF

This resulting Physical Address is the same as a segment selector with the value of 0x0000 and an offset value of 0xFFEF (0000:FFEF).

Accessing the highest address and above would wrap back into low memory at 0xFFEF. Certain programs written for this processor would depend on this 20-bit wrap-around behavior. The introduction of the Intel 286 and later processors with wider address busses incorporated Real Addressing to maintain backward compatibility with 8088 and 8086. Real Addressing mode did not take into account legacy software that depended on the 20-bit wrap-around. The A20M# signal pin was added to mimic this "feature" of the earlier processors. Asserting this signal would mask off the A20 signal allowing the low memory to be accessed once again.

A logic gate was used to enable or disable the memory bus A20 signal. The original design to assert this gate was to use an extra I/O signal from the keyboard controller that was controlled by I/O ports 0x60 and 0x64. A "Fast Gate A20" method was later developed which used I/O port 0x92 designed into the system board. Since all x86 processors come out of reset in Real Address mode, it is wise for boot code to make certain address line A20 is enabled by one or both of these methods.

Lines 723739

This code is a fascinating throwback to the early Intel processors. This is a mere nuisance in the setup of Memory Management.

---------------------------------------------------------------------- arch/i386/boot/setup.S 790 # set up gdt and idt 791 lidt idt_48 # load idt with 0,0 792 xorl %eax, %eax # Compute gdt_base 793 movw %ds, %ax # (Convert %ds:gdt to a linear ptr) 794 shll $4, %eax 795 addl $gdt, %eax 796 movl %eax, (gdt_48+2) 797 lgdt gdt_48 # load gdt with whatever is 798 # appropriate ... 981 gdt: 982 .fill GDT_ENTRY_BOOT_CS,8,0 983 984 .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb) 985 .word 0 # base address = 0 986 .word 0x9A00 # code read/exec 987 .word 0x00CF # granularity = 4096, 386 988 # (+5th nibble of limit) 989 990 .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb) 991 .word 0 # base address = 0 992 .word 0x9200 # data read/write 993 .word 0x00CF # granularity = 4096, 386 994 # (+5th nibble of limit) 995 gdt_end: 996 .align 4 997 998 .word 0 # alignment byte 999 idt_48: 1000 .word 0 # idt limit = 0 1001 .word 0, 0 # idt base = 0L 1002 1003 .word 0 # alignment byte 1004 gdt_48: 1005 .word gdt_end - gdt - 1 # gdt limit 1006 .word 0, 0 # gdt base (filled in later) ----------------------------------------------------------------------

Lines 790797

The structures and data for the provisional GDT and IDT are compiled into the end of setup.S. These tables are implemented in their simplest form.

Lines 9811006

These lines are the compiled-in values for the provisional GDT. The GDT has a code and data descriptor, each representing 4GB of memory starting at 0x00. The IDT is left initialized to 0x00 and is filled in later.

As far as memory management on an Intel platform is concerned, entering protected mode is one of the most important phases. At this point, the hardware begins to build a virtual address space for the operating system.

Protected Mode

The Intel method of memory management is called protected mode. The protection refers to multiple independent segmented address spaces that are protected from each other. The other half of Intel memory management is paging or page translation. System programmers can make use of various combinations of segmentation and paging, but Linux uses a flat model where segmentation is all but eliminated. In the flat model, each process has access to its entire 32-bit address space (4GB).

---------------------------------------------------------------------- arch/i386/boot/setupS 830 movw $1, %ax # protected mode (PE) bit 831 lmsw %ax # This is it! 832 jmp flush_instr 833 834 flush_instr: 835 xorw %bx, %bx # Flag to indicate a boot 836 xorl %esi, %esi # Pointer to real-mode code 837 movw %cs, %si 838 subw $DELTA_INITSEG, %si 839 shll $4, %esi -----------------------------------------------------------------------

Lines 830831

Set the PE bit in the Machine Status Word to enter protected mode. The jmp instruction begins executing in protected mode.

Lines 834839

Save a 32-bit pointer to real-mode for decompressing and loading the kernel later on in startup_32().

Recall that in real addressing mode, code is executed by using 16-bit instructions. The current file is compiled using the .code16 assembler directive, which enforces this mode; this is also known as a 16-bit module in the Intel Programmer's Reference. To jump from a 16-bit module to a 32-bit module, the Intel architecture (and assembler magic) allows us to build a 32-bit instruction in a 16-bit module.

Build and execute the 32-bit jump:

---------------------------------------------------------------------- arch/i386/boot/setup.S 841 # jump to startup_32 in arch/i386/kernel/head.S 842 # 843 # NOTE: For high loaded big kernels we need a 844 # jmpi 0x100000,__BOOT_CS 845 # 846 # but we haven't yet reloaded the CS register, so the default size 847 # of the target offset still is 16 bit. 848 # However, using an operand prefix (0x66), the CPU will properly 849 # take our 48 bit far pointer. (INTeL 80386 Programmer's Reference 850 # Manual, Mixing 16-bit and 32-bit code, page 16-6) 851 852 .byte 0x66, 0xea # prefix + jmpi-opcode 853 code32: .long 0x1000 # will be set to 0x100000 854 # for big kernels 855 .word __BOOT_CS -----------------------------------------------------------------------

Line 852

This line builds the 32-bit jump instruction.

After this jump is executed, the system uses the provisional GDT and the code is executing in 32-bit protected mode, starting at the label startup_32 in arch/i386/kernel/head.S line 57.

8.3.2.1. Protected Mode

Until this point, the discussion has been how to get the Intel system ready to set up paging. As we trace through the code in head.S, we see what initialization needs to take place and how Linux uses the x86-based protected mode paging system. This is the final code before the kernel is started in main.c. For complete information on the many possible modes and settings that relate to memory initialization and Intel processors, look at the Intel Architecture Software Developers Manual, Volume 3.

---------------------------------------------------------------------- arch/i386/kernel/head.S 057 ENTRY(startup_32) 058 059 /* 060 * Set segments to known values. 061 */ 062 cld 063 lgdt boot_gdt_descr - __PAGE_OFFSET 064 movl $(__BOOT_DS),%eax 065 movl %eax,%ds 066 movl %eax,%es 067 movl %eax,%fs 068 movl %eax,%gs 068 081 /* 082 * Initialize page tables. This creates a PDE and a set of page 083 * tables, which are located immediately beyond _end. The variable 084 * init_pg_tables_end is set up to point to the first "safe" location. 085 * Mappings are created both at virtual address 0 (identity mapping) 086 * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END. 087 * 088 * Warning: don't use %esi or the stack in this code. However, %esp 089 * can be used as a GPR if you really need it... 090 */ 091 page_pde_offset = (__PAGE_OFFSET >> 20); 092 093 movl $(pg0 - __PAGE_OFFSET), %edi 094 movl $(swapper_pg_dir - __PAGE_OFFSET), %edx 095 movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */ 096 10: 097 leal 0x007(%edi),%ecx /* Create PDE entry */ 098 movl %ecx,(%edx) /* Store identity PDE entry */ 099 movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */ 100 addl $4,%edx 101 movl $1024, %ecx 102 11: 103 stosl 104 addl $0x1000,%eax 105 loop 11b 106 /* End condition: we must map up to and including INIT_MAP_BEYOND_END */ 107 /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */ 108 leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp 109 cmpl %ebp,%eax 110 jb 10b 111 movl %edi,(init_pg_tables_end - __PAGE_OFFSET) 112 113 #ifdef CONFIG_SMP ... 156 3: 157 #endif /* CONFIG_SMP */ 158 159 /* 160 * Enable paging 161 */ 162 movl $swapper_pg_dir-__PAGE_OFFSET,%eax 163 movl %eax,%cr3 /* set the page table pointer.. */ 164 movl %cr0,%eax 165 orl $0x80000000,%eax 166 movl %eax,%cr0 /* ..and set paging (PG) bit */ 167 ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ 168 1: 169 /* Set up the stack pointer */ 170 lss stack_start,%esp ... 177 pushl $0 178 popfl 179 180 #ifdef CONFIG_SMP 181 andl %ebx,%ebx 182 jz 1f /* Initial CPU cleans BSS */ 183 jmp checkCPUtype 184 1: 185 #endif /* CONFIG_SMP */ 186 187 /* 188 * start system 32-bit setup. We need to re-do some of the things done 189 * in 16-bit mode for the "real" operations. 190 */ 191 call setup_idt 192 193 * 194 * Copy bootup parameters out of the way. 195 * Note: %esi still has the pointer to the real-mode data. 196 */ 197 movl $boot_params,%edi 198 movl $(PARAM_SIZE/4),%ecx 199 cld 200 rep 201 movsl 202 movl boot_params+NEW_CL_POINTER,%esi 203 andl %esi,%esi 204 jnz 2f # New command line protocol 205 cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR 206 jne 1f 207 movzwl OLD_CL_OFFSET,%esi 208 addl $(OLD_CL_BASE_ADDR),%esi 209 2: 210 movl $saved_command_line,%edi 211 movl $(COMMAND_LINE_SIZE/4),%ecx 212 rep 213 movsl 214 1: 215 checkCPUtype: ... 279 lgdt cpu_gdt_descr 280 lidt idt_descr ... 303 call start_kernel ----------------------------------------------------------------------

Line 57

This line is the 32-bit protected mode entry point for the kernel code. Currently, the code uses the provisional GDT.

Line 63

This code initializes the GDTR with the base address of the boot GDT. This boot GDT is the same as the provisional GDT used in setup.S (4GB code and data starting at address 0x00000000) and is used only by this boot code.

Lines 6468

Initialize the remaining segment registers with __BOOT_DS, which resolves to 24 (see /include/asm-i386/segment.h). This value points to the 24th selector (starting at 0) in the final GDT, which is set later in this code.

Lines 91111

Create a page directory entry (PDE) in swapper_pg_dir that references a page table (pg0) with 0 based (identity) entries and duplicate PAGE_OFFSET (kernel memory) entries.

Lines 113157

This code block initializes secondary (non-boot) processors to the page tables. For this discussion, we focus on the boot processor.

Lines 162164

The cr3 register is the entry point for x86 hardware paging. This register is initialized to point to the base of the Page Directory, which in this case, is swapper_pg_dir.

Lines 165168

Set the PG (paging) bit in cr0 of the boot processor. The PG bit enables the paging mechanism in the x86 architecture. The jump instruction (on line 167) is recommended when changing the PG bit to ensure that all instructions within the processor are serialized at the moment of entering or exiting paging mode.

Line 170

Initialize the stack to the start of the data segment (see also lines 401403).

Lines 177178

The eflags register is a read/write system register that contains the status of interrupts, modes, and permissions. This register is cleared by pushing a 0 onto the stack and directly popping it into the register with the popfl instruction.

Lines 180185

The general-purpose register ebx is used as a flag to indicate whether it is the boot processor to the processor that runs this code. Because we are tracing this code as the boot processor, ebx has been cleared (0), and we jump to the call to setup_idt.

Line 191

The routine setup_idt initializes an Interrupt Descriptor Table (IDT) where each entry points to a dummy handler. The IDT, discussed in Chapter 7, "Scheduling and Kernel Synchronization," is a table of functions (or handlers) that are called when the processor needs to immediately execute time-critical code.

Lines 197214

The user can pass certain parameters to Linux at boot time. They are stored here for later use.

Lines 215303

The code listed on these lines does a large amount of necessary (but tedious) x86 processor-version checking and some minor initialization. By way of the cupid instruction (or lack thereof), certain bits are set in the eflags register and cr0. One notable setting in cr0 is bit 4, the extension type (ET). This bit indicates the support of math-coprocessor instructions in older x86 processors. The most important lines of code in this block are lines 279280. This is where the IDT and the GDT are loaded (by way of the lidt and lgdt instructions) into the idtr and gdtr registers. Finally, on line 303, we jump to the routine start_kernel().

With the code in head.S, the system can now map a logical address to a linear address to finally a physical address (see Figure 8.10). Starting with a logical address, the selector (in the CS, DS, ES, etc., registers) references one of the descriptors in the GDT. The offset is the flat address that we seek. The information from the descriptor and the offset are combined to form the logical address.

Figure 8.10. Boot-Time Paging

In the code walkthrough, we saw how the Page Directory (swapper_pg_dir) and Page Table (pg0) were created and that cr3 was initialized to point to the Page Directory. As previously discussed, the processor becomes aware of where to look for the paging components by cr3's setting, and setting cr0 (PG bit) is how the processor is informed to start using them. On the logical address, bits 22:31 indicate the Page Directory Entry (PDE), bits 12:21 indicate the Page Table Entry (PTE), and bits 0:11 indicate the offset (in this example, 4KB) into the physical page.

The system now has 8MB of memory mapped out using a provisional paging system. The next step is to call the function start_kernel() in init/main.c.

8.3.3. PowerPC and x86 Code Convergence

Notice that both the PowerPC code and the x86 code have now converged on start_kernel() in init/main.c. This routine, which is located in the architecture-independent section of the code, calls architecture-specific routines to finish memory initialization.

The first function called in this file is setup_arch() in arch/i386/ kernel/ setup.c, which then calls paging_init() in arch/i386/mm/init.c, which then calls pagetable_init() in the same file. The remainder of system memory is allocated here to produce the final page tables.

In the PowerPC world, much has already been done. The setup_arch() file in arch/ppc/kernel/setup.c then calls paging_init() in arch/ppc/mm/init.c. The one notable function performed in paging_init() for PPC is to set all pages to be in the DMA zone.

8.4. Initial RAM Disk

LILO, GRUB, and Yaboot support the loading of the initial RAM disk (initrd). initrd acts as a root filesystem before the final root filesystem is loaded and initialized. We refer to the loading of the final root filesystem as pivoting the root.

This initial step allows Linux to initially come up with certain modules precompiled and then dynamically load other modules and drivers from initrd. The major difference to the bootloader is that it loads a minimal kernel and the RAM disk during Stage 2. The kernel initializes using the RAM disk, mounts the final root filesystem, and then removes the initrd.

initrd allows for

Configuring a kernel at boot time
Keeping a small general-purpose kernel
Having one kernel for several hardware configurations

The previously referenced stanzas are the most common for loading Linux with Yaboot, GRUB, and LILO. Each bootloader has a rich set of commands for their configuration files. For a customized or special function boot process, a quick Web search on GRUB and LILO configuration files yields good information on the subject.

Now that we have seen how the kernel is loaded and how memory initialization starts, let's look at the process of kernel initialization.

8.5. The Beginning: start_kernel()

This discussion begins with the jump to the start_kernel() (init/main.c) function, the first architecture-independent part of the code to be called.

With the jump to start_kernel(), we execute Process 0, which is otherwise known as the root thread. Process 0 spawns off Process 1, known as the init process. Process 0 then becomes the idle thread for the CPU. When /sbin/init is called, we have only those two processes running:

---------------------------------------------------------------------- init/main.c 396 asmlinkage void __init start_kernel(void) 397 { 398 char * command_line; 399 extern char saved_command_line[]; 400 extern struct kernel_param __start___param[], __stop___param[]; ... 405 lock_kernel(); 406 page_address_init(); 407 printk(linux_banner); 408 setup_arch(&command_line); 409 setup_per_cpu_areas(); ... 415 smp_prepare_boot_cpu(); ... 422 sched_init(); 423 424 build_all_zonelists(); 425 page_alloc_init(); 426 printk("Kernel command line: %s\n", saved_command_line); 427 parse_args("Booting kernel", command_line, __start___param, 428 __stop___param - __start___param, 429 &unknown_bootoption); 430 sort_main_extable(); 431 trap_init(); 432 rcu_init(); 433 init_IRQ(); 434 pidhash_init(); 435 init_timers(); 436 softirq_init(); 437 time_init(); ... 444 console_init(); 445 if (panic_later) 446 panic(panic_later, panic_param) ; 447 profile_init(); 448 local_irq_enable(); 449 #ifdef CONFIG_BLK_DEV_INITRD 450 if (initrd_start && !initrd_below_start_ok && 451 initrd_start < min_low_pfn << PAGE_SHIFT) { 452 printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - " 453 "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT); 454 initrd_start = 0; 455 } 456 #endif 457 mem_init(); 458 kmem_cache_init(); 459 if (late_time_init) 460 late_time_init(); 461 calibrate_delay(); 462 pidmap_init(); 463 pgtable_cache_init(); 464 prio_tree_init(); 465 anon_vma_init(); 466 #ifdef CONFIG_X86 467 if (efi_enabled) 468 efi_enter_virtual_mode(); 469 #endif 470 fork_init(num_physpages); 471 proc_caches_init(); 472 buffer_init(); 473 unnamed_dev_init(); 474 security_scaffolding_startup(); 475 vfs_caches_init(num_physpages); 476 radix_tree_init(); 477 signals_init(); 478 /* rootfs populating might need page-writeback */ 479 page_writeback_init(); 480 #ifdef CONFIG_PROC_FS 481 proc_root_init(); 482 #endif 483 check_bugs(); ... 490 init_idle(current, smp_processor_id()); ... 493 rest_init(); 494 } -----------------------------------------------------------------------

8.5.1. The Call to lock_kernel()

Line 405

In the 2.6 Linux kernel, the default configuration is to have a preemptible kernel. A preemptible kernel means that the kernel itself can be interrupted by a higher priority task, such as a hardware interrupt, and control is passed to the higher priority task. The kernel must save enough state so that it can return to executing when the higher priority task finishes.

Early versions of Linux implemented kernel preemption and SMP locking by using the Big Kernel Lock (BKL). Later versions of Linux correctly abstracted preemption into various calls, such as preempt_disable(). The BKL is still with us in the initialization process. It is a recursive spinlock that can be taken several times by a given CPU. A side effect of using the BKL is that it disables preemption, which is an important side effect during initialization.

Locking the kernel prevents it from being interrupted or preempted by any other task. Linux uses the BKL to do this. When the kernel is locked, no other process can execute. This is the antithesis of a preemptible kernel that can be interrupted at any point. In the 2.6 Linux kernel, we use the BKL to lock the kernel upon startup and initialize the various kernel objects without fear of being interrupted. The kernel is unlocked on line 493 within the rest_init() function. Thus, all of start_kernel() occurs with the kernels locked. Let's look at what happens in lock_kernel():

---------------------------------------------------------------------- include/linux/smp_lock.h 42 static inline void lock_kernel(void) 43 { 44 int depth = current->lock_depth+1; 45 if (likely(!depth)) 46 get_kernel_lock(); 47 current->lock_depth = depth; 48 } -----------------------------------------------------------------------

Lines 4448

The init task has a special lock_depth of -1. This ensures that in multi-processor systems, different CPUs do not attempt to simultaneously grab the kernel lock. Because only one CPU runs the init task, only it can grab the big kernel lock because depth is 0 only for init (otherwise, depth is greater than 0). A similar trick is used in unlock_kernel() where we test (--current->lock_depth < 0). Let's see what happens in get_kernel_lock():

---------------------------------------------------------------------- include/linux/smp_lock.h 10 extern spinlock_t kernel_flag; 11 12 #define kernel_locked() (current->lock_depth >= 0) 13 14 #define get_kernel_lock() spin_lock(&kernel_flag) 15 #define put_kernel_lock() spin_unlock(&kernel_flag) ... 59 #define lock_kernel() do { } while(0) 60 #define unlock_kernel() do { } while(0) 61 #define release_kernel_lock(task) do { } while(0) 62 #define reacquire_kernel_lock(task) do { } while(0) 63 #define kernel_locked() 1 -----------------------------------------------------------------------

Lines 1015

These macros describe the big kernel locks that use standard spinlock routines. In multiprocessor systems, it is possible that two CPUs might try to access the same data structure. Spinlocks, which are explained in Chapter 7, prevent this kind of contention.

Lines 5963

In the case where the kernel is not preemptible and not operating over multiple CPUs, we simply do nothing for lock_kernel() because nothing can interrupt us anyway.

The kernel has now seized the BKL and will not let go of it until the end of start_kernel(); as a result, all the following commands cannot bepreempted.

8.5.2. The Call to page_address_init()

Line 406

The call to page_address_init() is the first function that is involved with the initialization of the memory subsystem in this architecture-dependent portion of the code. The definition of page_address_init() varies according to three different compile-time parameter definitions. The first two result in page_address_init() being stubbed out to do nothing by defining the body of the function to be do { } while (0), as shown in the following code. The third is the operation we explore here in more detail. Let's look at the different definitions and discuss when they are enabled:

---------------------------------------------------------------------- include/linux/mm.h 376 #if defined(WANT_PAGE_VIRTUAL) 382 #define page_address_init() do { } while(0) 385 #if defined(HASHED_PAGE_VIRTUAL) 388 void page_address_init(void); 391 #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL) 394 #define page_address_init() do { } while(0) ----------------------------------------------------------------------

The #define for WANT_PAGE_VIRTUAL is set when the system has direct memory mapping, in which case simply calculating the virtual address of the memory location is sufficient to access the memory location. In cases where all of RAM is not mapped into the kernel address space (as is often the case when himem is configured), we need a more involved way to acquire the memory address. This is why the initialization of page addressing is defined only in the case where HASHED_PAGE_VIRTUAL is set.

We now look at the case where the kernel has been told to use HASHED_PAGE_VIRTUAL and where we need to initialize the virtual memory that the kernel is using. Keep in mind that this happens only if himem has been configured; that is, the amount of RAM the kernel can access is larger than that mapped by the kernel address space (generally 4GB).

In the process of following the function definition, various kernel objects are introduced and revisited. Table 8.2 shows the kernel objects introduced during the process of exploring page_address_init().

Table 8.2. Objects Introduced During the Call to page_address_init()
Object Name
Description
page_address_map
Struct
page_address_slot
Struct
page_address_pool
Global variable
page_address_maps
Global variable
page_address_htable
Global variable

---------------------------------------------------------------------- mm/highmem.c 510 static struct page_address_slot { 511 struct list_head lh; 512 spinlock_t lock; 513 } ____cacheline_aligned_in_smp page_address_htable[1<<PA_HASH_ORDER]; ... 591 static struct page_address_map page_address_maps[LAST_PKMAP]; 592 593 void __init page_address_init(void) 594 { 595 int i; 596 597 INIT_LIST_HEAD(&page_address_pool); 598 for (i = 0; i < ARRAY_SIZE(page_address_maps); i++) 599 list_add(&page_address_maps[i].list, &page_address_pool) ; 600 for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) { 601 INIT_LIST_HEAD(&page_address_htable[i].lh); 602 spin_lock_init(&page_address_htable[i].lock); 603 } 604 spin_lock_init(&pool_lock); 605 } ----------------------------------------------------------------------

Line 597

The main purpose of this line is to initialize the page_address_pool global variable, which is a struct of type list_head and point to a list of free pages allocated from page_address_maps (line 591). Figure 8.11 illustrates page_address_pool.

Figure 8.11. Data Structures Surrounding the Page Address Map Pool

Lines 598599

We add each list of pages in page_address_maps to the doubly linked list headed by page_address_pool. We describe the page_address_map structure in detail next.

Lines 600603

We initialize each page address hash table's list_head and spinlock. The page_address_htable variable holds the list of entries that hash to the same bucket. Figure 8.12 illustrates the page address hash table.

Figure 8.12. Page Address Hash Table

Line 604

We initialize the page_address_pool's spinlock.

Let's look at the page_address_map structure to better understand the lists we just saw initialized. This structure's main purpose is to maintain the association with a page and its virtual address. This would be wasteful if the page had a linear association with its virtual address. This becomes necessary only if the addressing is hashed:

---------------------------------------------------------------------- mm/highmem.c 490 struct page_address_map { 491 struct page *page; 492 void *virtual; 493 struct list_head list; 494 }; -----------------------------------------------------------------------

As you can see, the object keeps a pointer to the page structure that's associated with this page, a pointer to the virtual address, and a list_head struct to maintain its position in the doubly linked list of the page address list it is in.

8.5.3. The Call to printk(linux_banner)

Line 407

This call is responsible for the first console output made by the Linux kernel. This introduces the global variable linux_banner:

---------------------------------------------------------------------- init/version.c 31 const char *linux_banner = 32 "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n"; -----------------------------------------------------------------------

The version.c file defines linux_banner as just shown. This string provides the user with a reference of the Linux kernel version, the gcc version it was compiled with, and the release.

8.5.4. The Call to setup_arch

Line 408

The setup_arch() function in arch/i386/kernel/setup.c is cast to the __init type (refer to Chapter 2 for a description of __init) where it runs only once at system initialization time. The setup_arch() function takes in a pointer to any Linux command-line data entered at boot time and initializes many of the architecture-specific subsystems, such as memory, I/O, processors, and consoles:

---------------------------------------------------------------------- arch/i386/kernel/setup.c 1083 void __init setup_arch(char **cmdline_p) 1084 { 1085 unsigned long max_low_pfn; 1086 1087 memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data)); 1088 pre_setup_arch_hook(); 1089 early_cpu_init(); 1090 1091 /* 1092 * FIXME: This isn't an official loader_type right 1093 * now but does currently work with elilo. 1094 * If we were configured as an EFI kernel, check to make 1095 * sure that we were loaded correctly from elilo and that 1096 * the system table is valid. If not, then initialize normally. 1097 */ 1098 #ifdef CONFIG_EFI 1099 if ((LOADER_TYPE == 0x50) && EFI_SYSTAB) 1100 efi_enabled = 1; 1101 #endif 1102 1103 ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV); 1104 drive_info = DRIVE_INFO; 1105 screen_info = SCREEN_INFO; 1106 edid_info = EDID_INFO; 1107 apm_info.bios = APM_BIOS_INFO; 1108 ist_info = IST_INFO; 1109 saved_videomode = VIDEO_MODE; 1110 if( SYS_DESC_TABLE.length != 0 ) { 1111 MCA_bus = SYS_DESC_TABLE.table[3] &0x2; 1112 machine_id = SYS_DESC_TABLE.table[0]; 1113 machine_submodel_id = SYS_DESC_TABLE.table[1]; 1114 BIOS_revision = SYS_DESC_TABLE.table[2]; 1115 } 1116 aux_device_present = AUX_DEVICE_INFO; 1117 1118 #ifdef CONFIG_BLK_DEV_RAM 1119 rd_image_start = RAMDISK_FLAGS & RAMDISK_IMAGE_START_MASK; 1120 rd_prompt = ((RAMDISK_FLAGS & RAMDISK_PROMPT_FLAG) != 0); 1121 rd_doload = ((RAMDISK_FLAGS & RAMDISK_LOAD_FLAG) != 0); 1122 #endif 1123 ARCH_SETUP 1124 if (efi_enabled) 1125 efi_init(); 1126 else 1127 setup_memory_region(); 1128 1129 copy_edd(); 1130 1131 if (!MOUNT_ROOT_RDONLY) 1132 root_mountflags &= ~MS_RDONLY; 1133 init_mm.start_code = (unsigned long) _text; 1134 init_mm.end_code = (unsigned long) _etext; 1135 init_mm.end_data = (unsigned long) _edata; 1136 init_mm.brk = init_pg_tables_end + PAGE_OFFSET; 1137 1138 code_resource.start = virt_to_phys(_text); 1139 code_resource.end = virt_to_phys(_etext)-1; 1140 data_resource.start = virt_to_phys(_etext); 1141 data_resource.end = virt_to_phys(_edata)-1; 1142 1143 parse_cmdline_early(cmdline_p); 1144 1145 max_low_pfn = setup_memory(); 1146 1147 /* 1148 * NOTE: before this point _nobody_ is allowed to allocate 1149 * any memory using the bootmem allocator. 1150 */ 1152 #ifdef CONFIG_SMP 1153 smp_alloc_memory(); /* AP processor realmode stacks in low memory*/ 1154 #endif 1155 paging_init(); 1156 1157 #ifdef CONFIG_EARLY_PRINTK 1158 { 1159 char *s = strstr(*cmdline_p, "earlyprintk="); 1160 if (s) { 1161 extern void setup_early_printk(char *); 1162 1163 setup_early_printk(s); 1164 printk("early console enabled\n"); 1165 } 1166 } 1167 #endif ... 1170 dmi_scan_machine(); 1171 1172 #ifdef CONFIG_X86_GENERICARCH 1173 generic_apic_probe(*cmdline_p); 1174 #endif 1175 if (efi_enabled) 1176 efi_map_memmap(); 1177 1178 /* 1179 * Parse the ACPI tables for possible boot-time SMP configuration. 1180 */ 1181 acpi_boot_init(); 1182 1183 #ifdef CONFIG_X86_LOCAL_APIC 1184 if (smp_found_config) 1185 get_smp_config(); 1186 #endif 1187 1188 register_memory(max_low_pfn); 1188 1190 #ifdef CONFIG_VT 1191 #if defined(CONFIG_VGA_CONSOLE) 1192 if (!efi_enabled || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY)) 1193 conswitchp = &vga_con; 1194 #elif defined(CONFIG_DUMMY_CONSOLE) 1195 conswitchp = &dummy_con; 1196 #endif 1197 #endif 1198 } -----------------------------------------------------------------------

Line 1087

Get boot_cpu_data, which is a pointer to the cpuinfo_x86 struct filled in at boot time. This is similar for PPC.

Line 1088

Activate any machine-specific identification routines. This can be found in arch/xxx/machine-default/setup.c.

Line 1089

Identify the specific processor.

Lines 11031116

Get the system boot parameters.

Lines 11181122

Get RAM disk if set in arch/<arch>/defconfig.

Lines 11241127

Initialize Extensible Firmware Interface (if set in /defconfig) or just print out the BIOS memory map.

Line 1129

Save off Enhanced Disk Drive parms from boot time.

Lines 11331141

Initialize memory-management structs from the BIOS-provided memory map.

Line 1143

Begin parsing out the Linux command line. (See arch/<arch>/kernel/ setup.c.)

Line 1145

Initializes/reserves boot memory. (See arch/i386/kernel/setup.c.)

Lines 11531155

Get a page for SMP initialization or initialize paging beyond the 8M that's already initialized in head.S. (See arch/i386/mm/init.c.)

Lines 11571167

Get printk() running even though the console is not fully initialized.

Line 1170

This line is the Desktop Management Interface (DMI), which gathers information about the specific system-hardware configuration from BIOS. (See arch/i386/kernel/dmi_scan.c.)

Lines 11721174

If the configuration calls for it, look for the APIC given on the command line. (See arch/i386/machine-generic/probe.c.)

Lines 11751176

If using Extensible Firmware Interface, remap the EFI memory map. (See arch/i386/kernel/efi.c.)

Line 1181

Look for local and I/O APICs. (See arch/i386/kernel/acpi/boot.c.) Locate and checksum System Description Tables. (See drivers/acpi/tables.c.) For a better understanding of ACPI, go to the ACPI4LINUX project on the Web.

Lines 11831186

Scan for SMP configuration. (See arch/i386/kernel/mpparse.c.) This section can also use ACPI for configuration information.

Line 1188

Request I/O and memory space for standard resources. (See arch/i386/kernel/std_resources.c for an idea of how resources are registered.)

Lines 11901197

Set up the VGA console switch structure. (See drivers/video/console/vgacon.c.)

A similar but shorter version of setup_arch() can be found in arch/ppc/kernel/setup.c for the PowerPC. This function initializes a large part of the ppc_md structure. A call to pmac_feature_init() in arch/ppc/platforms/pmac_feature.c does an initial probe and initialization of the pmac hardware.

8.5.5. The Call to setup_per_cpu_areas()

Line 409

The routine setup_per_cpu_areas() exists for the setup of a multiprocessing environment. If the Linux kernel is compiled without SMP support, setup_per_cpu_areas() is stubbed out to do nothing, as follows:

---------------------------------------------------------------------- init/main.c 317 static inline void setup_per_cpu_areas(void) { } -----------------------------------------------------------------------

If the Linux kernel is compiled with SMP support, setup_per_cpu_areas() is defined as follows:

---------------------------------------------------------------------- init/main.c 327 static void __init setup_per_cpu_areas(void) 328 { 329 unsigned long size, i; 330 char *ptr; 331 /* Created by linker magic */ 332 extern char __per_cpu_start[], __per_cpu_end[]; 333 334 /* Copy section for each CPU (we discard the original) */ 335 size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES); 336 #ifdef CONFIG_MODULES 337 if (size < PERCPU_ENOUGH_ROOM) 338 size = PERCPU_ENOUGH_ROOM; 339 #endif 340 341 ptr = alloc_bootmem(size * NR_CPUS); 342 343 for (i = 0; i < NR_CPUS; i++, ptr += size) { 344 __per_cpu_offset[i] = ptr - __per_cpu_start; 345 memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); 346 } 347 } -----------------------------------------------------------------------

Lines 329332

The variables for managing a consecutive block of memory are initialized. The "linker magic" variables are defined during linking in the appropriate architecture's kernel directory (for example, arch/i386/kernel/vmlinux.lds.S).

Lines 334341

We determine the size of memory a single CPU requires and allocate that memory for each CPU in the system as a single contiguous block of memory.

Lines 343346

We cycle through the newly allocated memory, initializing each CPU's chunk of memory. Conceptually, we have taken a chunk of data that's valid for a single CPU (__per_cpu_start to __per_cpu_end) and copied it for each CPU on the system. This way, each CPU has its own data with which to play.

8.5.6. The Call to smp_prepare_boot_cpu()

Line 415

Similar to smp_per_cpu_areas(), smp_prepare_boot_cpu() is stubbed out when the Linux kernel does not support SMP:

---------------------------------------------------------------------- include/linux/smp.h 106 #define smp_prepare_boot_cpu() do {} while (0) -----------------------------------------------------------------------

However, if the Linux kernel is compiled with SMP support, we need to allow the booting CPU to access its console drivers and the per-CPU storage that we just initialized. Marking CPU bitmasks achieves this.

A CPU bitmask is defined as follows:

---------------------------------------------------------------------- include/asm-generic/cpumask.h 10 #if NR_CPUS > BITS_PER_LONG && NR_CPUS != 1 11 #define CPU_ARRAY_SIZE BITS_TO_LONGS(NR_CPUS) 12 13 struct cpumask 14 { 15 unsigned long mask[CPU_ARRAY_SIZE]; 16 }; -----------------------------------------------------------------------

This means that we have a platform-independent bitmask that contains the same number of bits as the system has CPUs.

smp_prepare_boot_cpu() is implemented in the architecture-dependent section of the Linux kernel but, as we soon see, it is the same for i386 and PPC systems:

---------------------------------------------------------------------- arch/i386/kernel/smpboot.c 66 /* bitmap of online cpus */ 67 cpumask_t cpu_online_map; ... 70 cpumask_t cpu_callout_map; ... 1341 void __devinit smp_prepare_boot_cpu(void) 1342 { 1343 cpu_set(smp_processor_id(), cpu_online_map); 1344 cpu_set(smp_processor_id(), cpu_callout_map); 1345 } ----------------------------------------------------------------------- ---------------------------------------------------------------------- arch/ppc/kernel/smp.c 49 cpumask_t cpu_online_map; 50 cpumask_t cpu_possible_map; ... 331 void __devinit smp_prepare_boot_cpu(void) 332 { 333 cpu_set(smp_processor_id(), cpu_online_map); 334 cpu_set(smp_processor_id(), cpu_possible_map); 335 } -----------------------------------------------------------------------

In both these functions, cpu_set() simply sets the bit smp_processor_id() in the cpumask_t bitmap. Setting a bit implies that the value of the set bit is 1.

8.5.7. The Call to sched_init()

Line 422

The call to sched_init() marks the initialization of all objects that the scheduler manipulates to manage the assignment of CPU time among the system's processes. Keep in mind that, at this point, only one process exists: the init process that currently executes sched_init():

---------------------------------------------------------------------- kernel/sched.c 3896 void __init sched_init(void) 3897 { 3898 runqueue_t *rq; 3899 int i, j, k; 3900 ... 3919 for (i = 0; i < NR_CPUS; i++) { 3920 prio_array_t *array; 3921 3922 rq = cpu_rq(i); 3923 spin_lock_init(&rq->lock); 3924 rq->active = rq->arrays; 3925 rq->expired = rq->arrays + 1; 3926 rq->best_expired_prio = MAX_PRIO; ... 3938 for (j = 0; j < 2; j++) { 3939 array = rq->arrays + j; 3940 for (k = 0; k < MAX_PRIO; k++) { 3941 INIT_LIST_HEAD(array->queue + k); 3942 __clear_bit(k, array->bitmap); 3943 } 3944 // delimiter for bitsearch 3945 __set_bit(MAX_PRIO, array->bitmap); 3946 } 3947 } 3948 /* 3949 * We have to do a little magic to get the first 3950 * thread right in SMP mode. 3951 */ 3952 rq = this_rq(); 3953 rq->curr = current; 3954 rq->idle = current; 3955 set_task_cpu(current, smp_processor_id()); 3956 wake_up_forked_process(current); 3957 3958 /* 3959 * The boot idle thread does lazy MMU switching as well: 3960 */ 3961 atomic_inc(&init_mm.mm_count); 3962 enter_lazy_tlb(&init_mm, current); 3963 } -----------------------------------------------------------------------

Lines 39193926

Each CPU's run queue is initialized: The active queue, expired queue, and spinlock are all initialized in this segment. Recall from Chapter 7 that spin_lock|_init() sets the spinlock to 1, which indicates that the data object is unlocked.

Figure 8.13 illustrates the initialized run queue.

Figure 8.13. Initialized Run Queue rq

Lines 39383947

For each possible priority, we initialize the list associated with the priority and clear all bits in the bitmap to show that no process is on that queue. (If all this is confusing, refer to Figure 8.14. Also, see Chapter 7 for an overview of how the scheduler manages its run queues.) This code chunk just ensures that everything is ready for the introduction of a process. As of line 3947, the scheduler is in the position to know that no processes exist; it ignores the current and idle processes for now.

Figure 8.14. rq->arrays

Lines 39523956

We add the current process to the current CPU's run queue and call wake_up_forked_process() on ourselves to initialize current into the scheduler. Now, the scheduler knows that exactly one process exists: the init process.

Lines 39613962

When lazy MMU switching is enabled, it allows a multiprocessor Linux system to perform context switches at a faster rate. A TLB is a transaction lookaside buffer that contains the recent page translation addresses. It takes a long time to flush the TLB, so we swap it if possible. enter_lazy_tlb() ensures that the mm_struct init_mm isn't being used across multiple CPUs and can be lazily switched. On a uniprocessor system, this becomes a NULL function.

The sections that were omitted in the previous code deal with initialization of SMP machines. As a quick overview, those sections bootstrap each CPU to the default settings necessary to allow for load balancing, group scheduling, and thread migration. They are omitted here for clarity and brevity.

8.5.8. The Call to build_all_zonelists()

Line 424

The build_all_zonelists()function splits up the memory according to the zone types ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. As mentioned in Chapter 6, "Filesystems," zones are linear separations of physical memory that are used mainly to address hardware limitations. Suffice it to say that this is the function where these memory zones are built. After the zones are built, pages are stored in page frames that fall within zones.

The call to build_all_zonelists() introduces numnodes and NODE_DATA. The global variable numnodes holds the number of nodes (or partitions) of physical memory.

The partitions are determined according to CPU access time. Note that, at this point, the page tables have already been fully set up:

---------------------------------------------------------------------- mm/page_alloc.c 1345 void __init build_all_zonelists(void) 1346 { 1347 int i; 1348 1349 for(i = 0 ; i < numnodes ; i++) 1350 build_zonelists(NODE_DATA(i)); 1351 printk("Built %i zonelists\n", numnodes); 1352 } ----------------------------------------------------------------------

build_all_zonelists() calls build_zonelists() once for each node and finishes by printing out the number of zonelists created. This book does not go into more detail regarding nodes. Suffice it to say that, in our one CPU example, numnodes are equivalent to 1, and each node can have all three types of zones. The NODE_DATA macro returns the node's descriptor from the node descriptor list.

8.5.9. The Call to page_alloc_init

Line 425

The function page_alloc_init() simply registers a function in a notifier chain.^[6] The function-registered page_alloc_cpu_notify() is a page-draining function^[7] associated with dynamic CPU configuration.
^[6] Chapter 2 discusses notifier chains.
^[7] Page draining refers to removing pages that are in use by a CPU that will no longer be used.

Dynamic CPU configuration refers to bringing up and down CPUs during the running of the Linux system, an event referred to as "hotplugging the CPU." Although technically, CPUs are not physically inserted and removed during machine operation, they can be turned on and off in some systems, such as the IBM p-Series 690. Let's look at the function:

---------------------------------------------------------------------- mm/page_alloc.c 1787 #ifdef CONFIG_HOTPLUG_CPU 1788 static int page_alloc_cpu_notify(struct notifier_block *self, 1789 unsigned long action, void *hcpu) 1790 { 1791 int cpu = (unsigned long)hcpu; 1792 long *count; 1793 if (action == CPU_DEAD) { ... 1796 count = &per_cpu(nr_pagecache_local, cpu); 1797 atomic_add(*count, &nr_pagecache); 1798 *count = 0; 1799 local_irq_disable(); 1800 __drain_pages(cpu); 1801 local_irq_enable(); 1802 } 1803 return NOTIFY_OK; 1804 } 1805 #endif /* CONFIG_HOTPLUG_CPU */ 1806 1807 void __init page_alloc_init(void) 1808 { 1809 hotcpu_notifier(page_alloc_cpu_notify, 0); 1810 } -----------------------------------------------------------------------

Line 1809

This line is the registration of the page_alloc_cpu_notify() routine into the hotcpu_notifier notifier chain. The hotcpu_notifier() routine creates a notifier_block that points to the page_alloc_cpu_notify() function and, with a priority of 0, then registers the object in the cpu_chain notifier chain(kernel/cpu.c).

Line 1788

page_alloc_cpu_notify() has the parameters that correspond to a notifier call, as Chapter 2 explained. The system-specific pointer points to an integer that specifies the CPU number.

Lines 17941802

If the CPU is dead, free up its pages. The variable action is set to CPU_DEAD when a CPU is brought down. (See drain_pages() in this same file.)

8.5.10. The Call to parse_args()

Line 427

The parse_args() function parses the arguments passed to the Linux kernel.

For example, nfsroot is a kernel parameter that sets the NFS root filesystem for systems without disks. You can find a complete list of kernel parameters in Documentation/kernel-parameters.txt:

---------------------------------------------------------------------- kernel/params.c 116 int parse_args(const char *name, 117 char *args, 118 struct kernel_param *params, 119 unsigned num, 120 int (*unknown)(char *param, char *val)) 121 { 122 char *param, *val; 123 124 DEBUGP("Parsing ARGS: %s\n", args); 125 126 while (*args) { 127 int ret; 128 129 args = next_arg(args, &param, &val); 130 ret = parse_one(param, val, params, num, unknown); 131 switch (ret) { 132 case -ENOENT: 133 printk(KERN_ERR "%s: Unknown parameter '%s'\n", 134 name, param); 135 return ret; 136 case -ENOSPC: 137 printk(KERN_ERR 138 "%s: '%s' too large for parameter '%s'\n", 139 name, val ?: "", param); 140 return ret; 141 case 0: 142 break; 143 default: 144 printk(KERN_ERR 145 "%s: '%s' invalid for parameter '%s'\n", 146 name, val ?: "", param); 147 return ret; 148 } 149 } 150 151 /* All parsed OK. */ 152 return 0; 153 } -----------------------------------------------------------------------

Lines 116125

The parameters passed to parse_args() are the following:

name. A character string to be displayed if any errors occur while the kernel attempts to parse the kernel parameter arguments. In standard operation, this means that an error message, "Booting Kernel: Unknown parameter X," is displayed.
args. The kernel parameter list of form foo=bar,bar2 baz=fuz wix.
params. Points to the kernel parameter structure that holds all the valid parameters for the specific kernel. Depending on how a kernel was compiled, some parameters might exist and others might not.
num. The number of kernel parameters in this specific kernel, not the number of arguments in args.
unknown. Points to a function to call if a kernel parameter is specified that is not recognized.

Lines 126153

We loop through the string args, set param to point to the first parameter, and set val to the first value (if any, val could be null). This is done via next_args() (for example, the first call to next_args() with args being foo=bar,bar2 baz=fuz wix). We set param to foo and val to bar, bar2. The space after bar2 is overwritten with a \0 and args is set to point at the beginning character of baz.

We pass our pointers param and val into parse_one(), which does the work of setting the actual kernel parameter data structures:

---------------------------------------------------------------------- kernel/params.c 46 static int parse_one(char *param, 47 char *val, 48 struct kernel_param *params, 49 unsigned num_params, 50 int (*handle_unknown)(char *param, char *val)) 51 { 52 unsigned int i; 53 54 /* Find parameter */ 55 for (i = 0; i < num_params; i++) { 56 if (parameq(param, params[i].name)) { 57 DEBUGP("They are equal! Calling %p\n", 58 params[i].set); 59 return params[i].set(val, &params[i]); 60 } 61 } 62 63 if (handle_unknown) { 64 DEBUGP("Unknown argument: calling %p\n", handle_unknown); 65 return handle_unknown(param, val); 66 } 67 68 DEBUGP("Unknown argument '%s'\n", param); 69 return -ENOENT; 70 } -----------------------------------------------------------------------

Lines 4654

These parameters are the same as those described under parse_args() with param and val pointing to a subsection of args.

Lines 5561

We loop through the defined kernel parameters to see if any match param. If we find a match, we use val to call the associated set function. Thus, the set function handles multiple, or null, arguments.

Lines 6266

If the kernel parameter was not found, we call the handle_unknown() function that was passed in via parse_args().

After parse_one() is called for each parameter-value combination specified in args, we have set the kernel parameters and are ready to continue starting the Linux kernel.

8.5.11. The Call to trap_init()

Line 431

In Chapter 3, we introduced exceptions and interrupts. The function TRap_init() is specific to the handling of interrupts in x86 architecture. Briefly, this function initializes a table referenced by the x86 hardware. Each element in the table has a function to handle kernel or user-related issues, such as an invalid instruction or reference to a page not currently in memory. Although the PowerPC can have these same issues, its architecture handles them in a somewhat different manner. (Again, all this is discussed in Chapter 3.)

8.5.12. The Call to rcu_init()

Line 432

The rcu_init() function initializes the Read-Copy-Update (RCU) subsystem of the Linux 2.6 kernel. RCU controls access to critical sections of code and enforces mutual exclusion in systems where the cost of acquiring locks becomes significant in comparison to the chip speed. The Linux implementation of RCU is beyond the scope of this book. We occasionally mention calls to the RCU subsystem in our code analysis, but the specifics are left out. For more information on the Linux RCU subsystem, consult the Linux Scalability Effort pages at http://lse.sourceforge.net/locking/rcupdate.html:

---------------------------------------------------------------------- kernel/rcupate.c 297 void __init rcu_init(void) 298 { 299 rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, 300 (void *)(long)smp_processor_id()); 301 /* Register notifier for non-boot CPUs */ 302 register_cpu_notifier(&rcu_nb); 303 } -----------------------------------------------------------------------

8.5.13. The Call to init_IRQ()

Line 433

The function init_IRQ() in arch/i386/kernel/i8259.c initializes the hardware interrupt controller, the interrupt vector table and, if on x86, the system timer. Chapter 3 includes a thorough discussion of interrupts for both x86 and PPC, where the Real-Time Clock is used as an interrupt example:

---------------------------------------------------------------------- arch/i386/kernel/i8259.c 410 void __init init_IRQ(void) 411 { 412 int i; ... 422 for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { 423 int vector = FIRST_EXTERNAL_VECTOR + i; 424 if (i >= NR_IRQS) 425 break; ... 430 if (vector != SYSCALL_VECTOR) 431 set_intr_gate(vector, interrupt[i]); 432 } ... 437 intr_init_hook(); ... 443 setup_timer(); ... 449 if (boot_cpu_data.hard_math && !cpu_has_fpu) 450 setup_irq(FPU_IRQ, &fpu_irq); 451 } -----------------------------------------------------------------------

Lines 422432

Initialize the interrupt vectors. This associates the x86 (hardware) IRQs with the appropriate handling code.

Line 437

Set up machine-specific IRQs, such as the Advanced Programmable Interrupt Controller (APIC).

Line 443

Initialize the timer clock.

Lines 449450

Set up for FPU if needed.

The following code is the PPC implementation of init_IRQ():

---------------------------------------------------------------------- arch/ppc/kernel/irq.c 700 void __init init_IRQ(void) 701 { 702 int i; 703 704 for (i = 0; i < NR_IRQS; ++i) 705 irq_affinity[i] = DEFAULT_CPU_AFFINITY; 706 707 ppc_md.init_IRQ(); 708 } -----------------------------------------------------------------------

Line 704

In multiprocessor systems, an interrupt can have an affinity for a specific processor.

Line 707

For a PowerMac platform, this routine is found in arch/ppc/platforms/ pmac_pic.c. It sets up the Programmable Interrupt Controller (PIC) portion of the I/O controller.

8.5.14. The Call to softirq_init()

Line 436

The softirq_init() function prepares the boot CPU to accept notifications from tasklets. Let's look at the internals of softirq_init():

---------------------------------------------------------------------- kernel/softirq.c 317 void __init softirq_init(void) 318 { 319 open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); 320 open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); 321 } ... 327 void __init softirq_init(void) 328 { 329 open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); 330 open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); 331 tasklet_cpu_notify(&tasklet_nb, (unsigned long)CPU_UP_PREPARE, 332 (void *)(long)smp_processor_id()); 333 register_cpu_notifier(&tasklet_nb); 334 } -----------------------------------------------------------------------

Lines 319320

We initialize the actions to take when we get a TASKLET_SOFTIRQ or HI_SOFTIRQ interrupt. As we pass in NULL, we are telling the Linux kernel to call tasklet_action(NULL) and tasklet_hi_action(NULL) (in the cases of Line 319 and Line 320, respectively). The following implementation of open_softirq() shows how the Linux kernel stores the tasklet initialization information:

---------------------------------------------------------------------- kernel/softirq.c 177 void open_softirq(int nr, void (*action)(struct softirq_action*), void * data) 178 { 179 softirq_vec[nr].data = data; 180 softirq_vec[nr].action = action; 181 } ----------------------------------------------------------------------

8.5.15. The Call to time_init()

Line 437

The function time_init() selects and initializes the system timer. This function, like TRap_init(), is very architecture dependent; Chapter 3 covered this when we explored timer interrupts. The system timer gives Linux its temporal view of the world, which allows it to schedule when a task should run and for how long. The High Performance Event Timer (HPET) from Intel will be the successor to the 8254 PIT and RTC hardware. The HPET uses memory-mapped I/O, which means that the HPET control registers are accessed as if they were memory locations. Memory must be configured properly to access I/O regions. If set in arch/i386/defconfig.h, time_init() needs to be delayed until after mem_init() has set up memory regions. See the following code:

---------------------------------------------------------------------- arch/i386/kernel/time.c 376 void __init time_init(void) 377 { ... 378 #ifdef CONFIG_HPET_TIMER 379 if (is_hpet_capable()) { 380 late_time_init = hpet_time_init; 381 return; 382 } ... 387 #endif 388 xtime.tv_sec = get_cmos_time(); 389 wall_to_monotonic.tv_sec = -xtime.tv_sec; 390 xtime.tv_nsec = (INITIAL_JIFFIES % HZ) * (NSEC_PER_SEC / HZ); 391 wall_to_monotonic.tv_nsec = -xtime.tv_nsec; 392 393 cur_timer = select_timer(); 394 printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name); 395 396 time_init_hook(); 397 } -----------------------------------------------------------------------

Lines 379387

If the HPET is configured, time_init() must run after memory has been initialized. The code for late_time_init() (on lines 358373) is the same as time_init().

Lines 388391

Initialize the xtime time structure used for holding the time of day.

Line 393

Select the first timer that initializes. This can be overridden. (See arch/i386/ kernel/timers/timer.c.)

8.5.16. The Call to console_init()

Line 444

A computer console is a device where the kernel (and other parts of a system) output messages. It also has login capabilities. Depending on the system, the console can be on the monitor or through a serial port. The function console_init() is an early call to initialize the console device, which allows for boot-time reporting of status:

---------------------------------------------------------------------- drivers/char/tty_io.c 2347 void __init console_init(void) 2348 { 2349 initcall_t *call; ... 2352 (void) tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY); ... 2358 #ifdef CONFIG_EARLY_PRINTK 2359 disable_early_printk(); 2360 #endif ... 2366 call = &__con_initcall_start; 2367 while (call < &__con_initcall_end) { 2368 (*call)(); 2369 call++; 2370 } 2371 } -----------------------------------------------------------------------

Line 2352

Set up the line discipline.

Line 2359

Keep the early printk support if desired. Early printk support allows the system to report status during the boot process before the system console is fully initialized. It specifically initializes a serial port (ttyS0, for example) or the system's VGA to a minimum functionality. Early printk support is started in setup_arch(). (For more information, see the code discussion on line 408 in this section and the files /kernel/printk.c and /arch/i386/kernel/ early_printk.c.)

Line 2366

Initialize the console.

8.5.17. The Call to profile_init()

Line 447

profile_init() allocates memory for the kernel to store profiling data in. Profiling is the term used in computer science to describe data collection during program execution. Profiling data is used to analyze performance and otherwise study the program being executed (in our case, the Linux kernel itself):

---------------------------------------------------------------------- kernel/profile.c 30 void __init profile_init(void) 31 { 32 unsigned int size; 33 34 if (!prof_on) 35 return; 36 37 /* only text is profiled */ 38 prof_len = _etext - _stext; 39 prof_len >>= prof_shift; 40 41 size = prof_len * sizeof(unsigned int) + PAGE_SIZE - 1; 42 prof_buffer = (unsigned int *) alloc_bootmem(size); 43 } -----------------------------------------------------------------------

Lines 3435

Don't do anything if kernel profiling is not enabled.

Lines 3839

_etext and _stext are defined in kernel/head.S. We determine the profile length as delimited by _etext and _stext and then shift the value by prof_shift, which was defined as a kernel parameter.

Lines 4142

We allocate a contiguous block of memory for storing profiling data of the size requested by the kernel parameters.

8.5.18. The Call to local_irq_enable()

Line 448

The function local_irq_enable() allows interrupts on the current CPU. It is usually paired with local_irq_disable(). In previous kernel versions, the sti(), cli() pair were used for this purpose. Although these macros still resolve to sti() and cli(), the keyword to note here is local. These affect only the currently running processor:

---------------------------------------------------------------------- include\asm-i386\system.h 446 #define local_irq_disable() _asm__ __volatile__("cli": : :"memory") 447 #define local_irq_enable() __asm__ __volatile__("sti": : :"memory") ----------------------------------------------------------------------

Lines 446447

Referring to the " Inline Assembly" section in Chapter 2, the item in the quotes is the assembly instruction and memory is on the clobber list.

8.5.19. initrd Configuration

Lines 449456

This #ifdef statement is a sanity check on initrdthe initial RAM disk.

A system using initrd loads the kernel and mounts the initial RAM disk as the root filesystem. Programs can run from this RAM disk and, when the time comes, a new root filesystem, such as the one on a hard drive, can be mounted and the initial RAM disk unmounted.

This operation simply checks to ensure that the initial RAM disk specified is valid. If it isn't, we set initrd_start to 0, which tells the kernel to not use an initial RAM disk.^[8]
^[8] For more information, refer to Documentation/initrd.txt.

8.5.20. The Call to mem_init()

Line 457

For both x86 and PPC, the call to mem_init() finds all free pages and sends that information to the console. Recall from Chapter 4 that the Linux kernel breaks available memory into zones. Currently, Linux has three zones:

Zone_DMA. Memory less than 16MB.
Zone_Normal. Memory starting at 16MB but less than 896MB. (The kernel uses the last 128MB.)
Zone_HIGHMEM. Memory greater than 1GB.

The function mem_init() finds the total number of free page frames in all the memory zones. This function prints out informational kernel messages regarding the beginning state of the memory. This function is architecture dependent because it manages early memory allocation data. Each architecture supplies its own function, although they all perform the same tasks. We first look at how x86 does it and follow it up with PPC:

---------------------------------------------------------------------- arch/i386/mm/init 445 void __init mem_init(void) 446 { 447 extern int ppro_with_ram_bug(void); 448 int codesize, reservedpages, datasize, initsize; 449 int tmp; 450 int bad_ppro; ... 459 #ifdef CONFIG_HIGHMEM 460 if (PKMAP_BASE+LAST_PKMAP*PAGE_SIZE >= FIXADDR_START) { 461 printk(KERN_ERR "fixmap and kmap areas overlap - this will crash\n"); 462 printk(KERN_ERR "pkstart: %lxh pkend:%lxh fixstart %lxh\n", 463 PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE, FIXADDR_START); 464 BUG(); 465 } 466 #endif 467 468 set_max_mapnr_init(); ... 476 /* this will put all low memory onto the freelists */ 477 totalram_pages += __free_all_bootmem(); 478 479 480 reservedpages = 0; 481 for (tmp = 0; tmp < max_low_pfn; tmp++) ... 485 if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp))) 486 reservedpages++; 487 488 set_highmem_pages_init(bad_ppro); 490 codesize = (unsigned long) &_etext - (unsigned long) &_text; 491 datasize = (unsigned long) &_edata - (unsigned long) &_etext; 492 initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin; 493 494 kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT); 495 kclist_add(&kcore_vmalloc, (void *)VMALLOC_START, 496 VMALLOC_END-VMALLOC_START); 497 498 printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init, %ldk highmem)\n", 499 (unsigned long) nr_free_pages() << (PAGE_SHIFT-10), 500 num_physpages << (PAGE_SHIFT-10), 501 codesize >> 10, 502 reservedpages << (PAGE_SHIFT-10), 503 datasize >> 10, 504 initsize >> 10, 505 (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10)) 506 ); ... 521 #ifndef CONFIG_SMP 522 zap_low_mappings(); 523 #endif 524 } -----------------------------------------------------------------------

Line 459

This line is a straightforward error check so that fixed map and kernel map do not overlap.

Line 469

The function set_max_mapnr_init() (arch/i386/mm/init.c) simply sets the value of num_physpages, which is a global variable (defined in mm/memory.c) that holds the number of available page frames.

Line 477

The call to __free_all_bootmem() marks the freeing up of all low-memory pages. During boot time, all pages are reserved. At this late point in the bootstrapping phase, the available low-memory pages are released. The flow of the function calls are seen in Figure 8.15.

Figure 8.15. __free_all_bootmem() Call Hierarchy

Let's look at the core portion of free_all_bootmem_core() to understand what is happening:

---------------------------------------------------------------------- mm/bootmem.c 257 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat) 258 { 259 struct page *page; 260 bootmem_data_t *bdata = pgdat->bdata; 261 unsigned long i, count, total = 0; ... 295 page = virt_to_page(bdata->node_bootmem_map); 296 count = 0; 297 for (i = 0; i < ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE; i++,page++) { 298 count++; 299 ClearPageReserved(page); 300 set_page_count(page, 1); 301 __free_page(page); 302 } 303 total += count; 304 bdata->node_bootmem_map = NULL; 305 306 return total; 307 } -----------------------------------------------------------------------

For all the available low-memory pages, we clear the PG_reserved flag^[9] in the flags field of the page struct. Next, we set the count field of the page struct to 1 to indicate that it is in use and call __free_page(), thus passing it to the buddy allocator. If you recall from Chapter 4's explanation of the buddy system, we explain that this function releases a page and adds it to a free list.
^[9] Recall from Chapter 6 that this flag is set in pages that are to be pinned in memory and that it is set for low memory during early bootstrapping.

The function __free_all_bootmem() returns the number of low memory pages available, which is added to the running count of totalram_pages (an unsigned long defined in mm/page_alloc.c).

Lines 480486

These lines update the count of reserved pages.

Line 488

The call to set_highmem_pages_init() marks the initialization of high-memory pages. Figure 8.16 illustrates the calling hierarchy of set_highmem_pages_init().

Figure 8.16. highmem_pages_init Calling Hierarchy

Let's look at the bulk of the code performed in one_highpage_init():

---------------------------------------------------------------------- arch/i386/mm/init.c 253 void __init one_highpage_init(struct page *page, int pfn, int bad_ppro) 254 { 255 if (page_is_ram(pfn) && !(bad_ppro && page_kills_ppro(pfn))) { 256 ClearPageReserved(page); 257 set_bit(PG_highmem, &page->flags); 258 set_page_count(page, 1); 259 __free_page(page); 260 totalhigh_pages++; 261 } else 262 SetPageReserved(page); 263 } ----------------------------------------------------------------------

Much like __free_all_bootmem(), all high-memory pages have their page struct flags field cleared of the PG_reserved flag, have PG_highmem set, and have their count field set to 1. __free_page() is also called to add these pages to the free lists and the totalhigh_pages counter is incremented.

Lines 490506

This code block gathers and prints out information regarding the size of memory areas and the number of available pages.

Lines 521523

The function zap_low_mappings flushes the initial TLBs and PGDs in low memory.

The function mem_init() marks the end of the boot phase of memory allocation and the beginning of the memory allocation that will be used throughout the system's life.

The PPC code for mem_init() finds and initializes all pages for all zones:

---------------------------------------------------------------------- arch/ppc/mm/init.c 393 void __init mem_init(void) 394 { 395 unsigned long addr; 396 int codepages = 0; 397 int datapages = 0; 398 int initpages = 0; 399 #ifdef CONFIG_HIGHMEM 400 unsigned long highmem_mapnr; 402 highmem_mapnr = total_lowmem >> PAGE_SHIFT; 403 highmem_start_page = mem_map + highmem_mapnr; 404 #endif /* CONFIG_HIGHMEM */ 405 max_mapnr = total_memory >> PAGE_SHIFT; 407 high_memory = (void *) __va(PPC_MEMSTART + total_lowmem); 408 num_physpages = max_mapnr; /* RAM is assumed contiguous */ 410 totalram_pages += free_all_bootmem(); 412 #ifdef CONFIG_BLK_DEV_INITRD 413 /* if we are booted from BootX with an initial ramdisk, 414 make sure the ramdisk pages aren't reserved. */ 415 if (initrd_start) { 416 for (addr = initrd_start; addr < initrd_end; addr += PAGE_SIZE) 417 ClearPageReserved(virt_to_page(addr)); 418 } 419 #endif /* CONFIG_BLK_DEV_INITRD */ 421 #ifdef CONFIG_PPC_OF 422 /* mark the RTAS pages as reserved */ 423 if ( rtas_data ) 424 for (addr = (ulong)__va(rtas_data); 425 addr < PAGE_ALIGN((ulong)__va(rtas_data)+rtas_size) ; 426 addr += PAGE_SIZE) 427 SetPageReserved(virt_to_page(addr)); 428 #endif 429 #ifdef CONFIG_PPC_PMAC 430 if (agp_special_page) 431 SetPageReserved(virt_to_page(agp_special_page)); 432 #endif 433 if ( sysmap ) 434 for (addr = (unsigned long)sysmap; 435 addr < PAGE_ALIGN((unsigned long)sysmap+sysmap_size) ; 436 addr += PAGE_SIZE) 437 SetPageReserved(virt_to_page(addr)); 439 for (addr = PAGE_OFFSET; addr < (unsigned long)high_memory; 440 addr += PAGE_SIZE) { 441 if (!PageReserved(virt_to_page(addr))) 442 continue; 443 if (addr < (ulong) etext) 444 codepages++; 445 else if (addr >= (unsigned long)&__init_begin 446 && addr < (unsigned long)&__init_end) 447 initpages++; 448 else if (addr < (ulong) klimit) 449 datapages++; 450 } 452 #ifdef CONFIG_HIGHMEM 453 { 454 unsigned long pfn; 456 for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) { 457 struct page *page = mem_map + pfn; 459 ClearPageReserved(page); 460 set_bit(PG_highmem, &page->flags); 461 set_page_count(page, 1); 462 __free_page(page); 463 totalhigh_pages++; 464 } 465 totalram_pages += totalhigh_pages; 466 } 467 #endif /* CONFIG_HIGHMEM */ 469 printk("Memory: %luk available (%dk kernel code, %dk data, %dk init, %ldk highmem)\n", 470 (unsigned long)nr_free_pages()<< (PAGE_SHIFT-10), 471 codepages<< (PAGE_SHIFT-10), datapages<< (PAGE_SHIFT-10), 472 initpages<< (PAGE_SHIFT-10), 473 (unsigned long) (totalhigh_pages << (PAGE_SHIFT-10))); 474 if (sysmap) 475 printk("System.map loaded at 0x%08x for debugger, size: %ld bytes\n", 476 (unsigned int)sysmap, sysmap_size); 477 #ifdef CONFIG_PPC_PMAC 478 if (agp_special_page) 479 printk(KERN_INFO "AGP special page: 0x%08lx\n", agp_special_page); 480 #endif 482 /* Make sure all our pagetable pages have page->mapping 483 and page->index set correctly. */ 484 for (addr = KERNELBASE; addr != 0; addr += PGDIR_SIZE) { 485 struct page *pg; 486 pmd_t *pmd = pmd_offset(pgd_offset_k(addr), addr); 487 if (pmd_present(*pmd)) { 488 pg = pmd_page(*pmd); 489 pg->mapping = (void *) &init_mm; 490 pg->index = addr; 491 } 492 } 493 mem_init_done = 1; 494 } -----------------------------------------------------------------------

Lines 399410

These lines find the amount of memory available. If HIGHMEM is used, those pages are also counted. The global variable totalram_pages is modified to reflect this.

Lines 412419

If used, clear any pages that the boot RAM disk used.

Lines 421432

Depending on the boot environment, reserve pages for the Real-Time Abstraction Services and AGP (video), if needed.

Lines 433450

If required, reserve some pages for system map.

Lines 452467

If using HIGHMEM, clear any reserved pages and modify the global variable totalram_pages.

Lines 469480

Print memory information to the console.

Lines 482492

Loop through page directory and initialize each mm_struct and index.

8.5.21. The Call to late_time_init()

Lines 459460

The function late_time_init() uses HPET (refer to the discussion under "The Call to time_init" section). This function is used only with the Intel architecture and HPET. This function has essentially the same code as time_init(); it is just called after memory initialization to allow the HPET to be mapped into physical memory.

8.5.22. The Call to calibrate_delay()

Line 461

The function calibrate_delay() in init/main.c calculates and prints the value of the much celebrated "BogoMips," which is a measurement that indicates the number of delay() iterations your processor can perform in a clock tick. calibrate_delay() allows delays to be approximately the same across processors of different speeds. The resulting valueat most an indicator of how fast a processor is runningis stored in loop_pre_jiffy and the udelay() and mdelay() functions use it to set the number of delay() iterations to perform:

---------------------------------------------------------------------- init/main.c void __init calibrate_delay(void) { unsigned long ticks, loopbit; int lps_precision = LPS_PREC; 186 loops_per_jiffy = (1<<12); printk("Calibrating delay loop... "); 189 while (loops_per_jiffy <<= 1) { /* wait for "start of" clock tick */ ticks = jiffies; while (ticks == jiffies) /* nothing */; /* Go .. */ ticks = jiffies; __delay(loops_per_jiffy); ticks = jiffies - ticks; if (ticks) break; 200 } /* Do a binary approximation to get loops_per_jiffy set to equal one clock (up to lps_precision bits) */ 204 loops_per_jiffy >>= 1; loopbit = loops_per_jiffy; 206 while ( lps_precision-- && (loopbit >>= 1) ) { loops_per_jiffy |= loopbit; ticks = jiffies; while (ticks == jiffies); ticks = jiffies; __delay(loops_per_jiffy); if (jiffies != ticks) /* longer than 1 tick */ loops_per_jiffy &= ~loopbit; 214 } /* Round the value and print it */ 217 printk("%lu.%02lu BogoMIPS\n", loops_per_jiffy/(500000/HZ), 219 (loops_per_jiffy/(5000/HZ)) % 100); } ----------------------------------------------------------------------

Line 186

Start at 0x800.

Lines 189200

Keep doubling loops_per_jiffy until the amount of time it takes the function delay(loops_per_jiffy) to exceed one jiffy.

Line 204

Divide loops_per_jiffy by 2.

Lines 206214

Successively add descending powers of 2 to loops_per_jiffy until tick equals jiffy.

Lines 217219

Print the value out as if it were a float.

8.5.23. The Call to pgtable_cache_init()

Line 463

The key function in this x86 code block is the system function kmem_cache_create(). This function creates a named cache. The first parameter is a string used to identify it in /proc/slabinfo:

---------------------------------------------------------------------- arch/i386/mm/init.c 529 kmem_cache_t *pgd_cache; 530 kmem_cache_t *pmd_cache; 531 532 void __init pgtable_cache_init(void) 533 { 534 if (PTRS_PER_PMD > 1) { 535 pmd_cache = kmem_cache_create("pmd", 536 PTRS_PER_PMD*sizeof(pmd_t), 537 0, 538 SLAB_HWCACHE_ALIGN | SLAB_MUST_H WCACHE_ALIGN, 539 pmd_ctor, 540 NULL); 541 if (!pmd_cache) 542 panic("pgtable_cache_init(): cannot create pmd c ache"); 543 } 544 pgd_cache = kmem_cache_create("pgd", 545 PTRS_PER_PGD*sizeof(pgd_t), 546 0, 547 SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_A LIGN, 548 pgd_ctor, 549 PTRS_PER_PMD == 1 ? pgd_dtor : NULL); 550 if (!pgd_cache) 551 panic("pgtable_cache_init(): Cannot create pgd cache"); 552 } ---------------------------------------------------------------------- ---------------------------------------------------------------------- arch/ppc64/mm/init.c 976 void pgtable_cache_init(void) 977 { 978 zero_cache = kmem_cache_create("zero", 979 PAGE_SIZE, 980 0, 981 SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_A LIGN, 982 zero_ctor, 983 NULL); 984 if (!zero_cache) 985 panic("pgtable_cache_init(): could not create zero_cache !\n"); 986 } ----------------------------------------------------------------------

Lines 532542

Create the pmd cache.

Lines 544551

Create the pgd cache.

On the PPC, which has hardware-assisted hashing, pgtable_cache_init() is a no-op:

---------------------------------------------------------------------- include\asmppc\pgtable.h 685 #define pgtable_cache_init() do { } while (0)

8.5.24. The Call to buffer_init()

Line 472

The buffer_init() function in fs/buffer.c holds data from filesystem devices:

---------------------------------------------------------------------- fs/buffer.c 3031 void __init buffer_init(void) { int i; int nrpages; 3036 bh_cachep = kmem_cache_create("buffer_head", sizeof(struct buffer_head), 0, 0, init_buffer_head, NULL); 3039 for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++) init_waitqueue_head(&bh_wait_queue_heads[i].wqh); 3044 nrpages = (nr_free_buffer_pages() * 10) / 100; max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); hotcpu_notifier(buffer_cpu_notify, 0); 3048 } ----------------------------------------------------------------------

Line 3036

Allocate the buffer cache hash table.

Line 3039

Create a table of buffer hash wait queues.

Line 3044

Limit low-memory occupancy to 10 percent.

8.5.25. The Call to security_scaffolding_startup()

Line 474

The 2.6 Linux kernel contains code for loading kernel modules that implement various security features. security_scaffolding_startup() simply verifies that a security operations object exists, and if it does, calls the security module's initialization functions.

How security modules can be created and what kind of issues a writer might face are beyond the scope of this text. For more information, consult Linux Security Modules (http://lsm.immunix.org/) and the Linux-security-module mailing list (http://mail.wirex.com/mailman/listinfo/linux-security-module).

8.5.26. The Call to vfs_caches_init()

Line 475

The VFS subsystem depends on memory caches, called SLAB caches, to hold the structures it manages. Chapter 4 discusses SLAB caches detail. The vfs_caches_init() function initializes the SLAB caches that the subsystem uses. Figure 8.17 shows the overview of the main function hierarchy called from vfs_caches_init(). We explore in detail each function included in this call hierarchy. You can refer to this hierarchy to keep track of the functions as we look at each of them.

Figure 8.17. vfs_caches_init() Call Hierarchy

Table 8.3 summarizes the objects introduced by the vfs_caches_init() function or by one of the functions it calls.

---------------------------------------------------------------------- fs/dcache.c 1623 void __init vfs_caches_init(unsigned long mempages) 1624 { 1625 names_cachep = kmem_cache_create("names_cache", 1626 PATH_MAX, 0, 1627 SLAB_HWCACHE_ALIGN, NULL, NULL); 1628 if (!names_cachep) 1629 panic("Cannot create names SLAB cache"); 1630 1631 filp_cachep = kmem_cache_create("filp", 1632 sizeof(struct file), 0, 1633 SLAB_HWCACHE_ALIGN, filp_ctor, filp_dtor); 1634 if(!filp_cachep) 1635 panic("Cannot create filp SLAB cache"); 1636 1637 dcache_init(mempages); 1638 inode_init(mempages); 1639 files_init(mempages); 1640 mnt_init(mempages); 1641 bdev_cache_init(); 1642 chrdev_init(); 1643 } -----------------------------------------------------------------------

Table 8.3. Objects Introduced by vfs_caches_init
Object Name
Description
names_cachep
Global variable
filp_cachep
Global variable
inode_cache
Global variable
dentry_cache
Global variable
mnt_cache
Global variable
namespace
Struct
mount_hashtable
Global variable
root_fs_type
Global variable
file_system_type
Struct (discussed in Chapter 6)
bdev_cachep
Global variable

Line 1623

The routine takes in the global variable num_physpages (whose value is calculated during mem_init()) as a parameter that holds the number of physical pages available in the system's memory. This number influences the creation of SLAB caches, as we see later.

Lines 16251629

The next step is to create the names_cachep memory area. Chapter 4 describes the kmem_cache_create() function in detail. This memory area holds objects of size PATH_MAX, which is the maximum allowable number of characters a pathname is allowed to have. (This value is set in linux/limits.h as 4,096.) At this point, the cache that has been created is empty of objects, or memory areas of size PATH_MAX. The actual memory areas are allocated upon the first and potentially subsequent calls to getname().

As discussed in Chapter 6 the getname() routine is called at the beginning of some of the file-related system calls (for example, sys_open()) to read the file pathname from the process address space. Objects are freed from the cache with the putname() routine.

If the names_cache cache cannot be created, the kernel jumps to the panic routine, exiting the function's flow of control.

Lines 16311635

The filp_cachep cache is created next, with objects the size of the file structure. The object holding the file structure is allocated by the get_empty_filp() (fs/file_table.c) routine, which is called, for example, upon creation of a pipe or the opening of a file. The file descriptor object is deallocated by a call to the file_free() (fs/file_table.c) routine.

Line 1637

The dcache_init() (fs/dcache.c) routine creates the SLAB cache that holds dentry descriptors.^[10] The cache itself is called the dentry_cache. The dentry descriptors themselves are created for each hierarchical component in pathnames referred by processes when accessing a file or directory. The structure associates the file or directory component with the inode that represents it, which further facilitates requests to that component for a speedier association with its corresponding inode.
^[10] Recall that dentry is short for directory entry.

Line 1638

The inode_init() (fs/inode.c) routine initializes the inode hash table and the wait queue head array used for storing hashed inodes that the kernel wants to lock. The wait queue heads (wait_queue_head_t) for hashed inodes are stored in an array called i_wait_queue_heads. This array gets initialized at this point of the system's startup process.

The inode_hashtable gets created at this point. This table speeds up the searches on inode. The last thing that occurs is that the SLAB cache used to hold inode objects gets created. It is called inode_cache. The memory areas for this cache are allocated upon calls to alloc_inode (fs/inode.c) and freed upon calls to destroy_inode() (fs/inode.c).

Line 1639

The files_init() routine is called to determine the maximum amount of memory allowed for files per process. The max_files field of the files_stat structure is set. This is then referenced upon file creation to determine if there is enough memory to open the file. Let's look at this routine:

---------------------------------------------------------------------- fs/file_table.c 292 void __init files_init(unsigned long mempages) 293 { 294 int n; ... 299 n = (mempages * (PAGE_SIZE / 1024)) / 10; 300 files_stat.max_files = n; 301 if (files_stat.max_files < NR_FILE) 302 files_stat.max_files = NR_FILE; 303 } ----------------------------------------------------------------------

Line 299

The page size is divided by the amount of space that a file (along with associated inode and cache) will roughly occupy (in this case, 1K). This value is then multiplied by the number of pages to get the total amount of "blocks" that can be used for files. The division by 10 shows that the default is to limit the memory usage for files to no more than 10 percent of the available memory.

Lines 301302

The NR_FILE (include/linux/fs.h) is set to 8,192.

Line 1640

The next routine, called mnt_init(), creates the cache that will hold the vfsmount objects the VFS uses for mounting filesystems. The cache is called mnt_cache. The routine also creates the mount_hashtable array, which stores references to objects in mnt_cache for faster access. It then issues calls to initialize the sysfs filesystem and mounts the root filesystem. Let's closely look at the creation of the hash table:

---------------------------------------------------------------------- fs/namespace.c 1137 void __init mnt_init(unsigned long mempages) { 1139 struct list_head *d; 1140 unsigned long order; 1141 unsigned int nr_hash; 1142 int i; ... 1149 order = 0; 1150 mount_hashtable = (struct list_head *) 1151 __get_free_pages(GFP_ATOMIC, order); 1152 1153 if (!mount_hashtable) 1154 panic("Failed to allocate mount hash table\n"); ... 1161 nr_hash = (1UL << order) * PAGE_SIZE / sizeof(struct list_head); 1162 hash_bits = 0; 1163 do { 1164 hash_bits++; 1165 } while ((nr_hash >> hash_bits) != 0); 1166 hash_bits--; ... 1172 nr_hash = 1UL << hash_bits; 1173 hash_mask = nr_hash-1; 1174 1175 printk("Mount-cache hash table entries: %d (order: %ld, %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order)); ... 1179 d = mount_hashtable; 1180 i = nr_hash; 1181 do { 1182 INIT_LIST_HEAD(d); 1183 d++; 1184 i--; 1185 } while (i); .. 1189 } ----------------------------------------------------------------------

Lines 11391144

The hash table array consists of a full page of memory. Chapter 4 explains in detail how the routine __get_free_pages() works. In a nutshell, this routine returns a pointer to a memory area of size 2 order pages. In this case, we allocate one page to hold the hash table.

Lines 11611173

The next step is to determine the number of entries in the table. nr_hash is set to hold the order (power of two) number of list heads that can fit into the table. hash_bits is calculated as the number of bits needed to represent the highest power of two in nr_hash. Line 1172 then redefines nr_hash as being composed of the single leftmost bit. The bitmask can then be calculated from the new nr_hash value.

Lines 11791185

Finally, we initialize the hash table through a call to the INIT_LIST_HEAD macro, which takes in a pointer to the memory area where a new list head is to be initialized. We do this nr_hash times (or the number of entries that the table can hold).

Let's walk through an example: We assume a PAGE_SIZE of 4KB and a struct list_head of 8 bytes. Because order is equal to 0, the value of nr_hash becomes 500; that is, up to 500 list_head structs can fit in one 4KB table. The (1UL << order) becomes the number of pages that have been allocated. For example, if the order had been 1 (meaning we had requested 21 pages allocated to the hash table), 0000 0001 bit-shifted once to the left becomes 0000 0010 (or 2 in decimal notation). Next, we calculate the number of bits the hash key will need. Walking through each iteration of the loop, we get the following:

Beginning values are hash_bits = 0 and nr_hash = 500.

Iteration 1: hash_bits = 1, and (500 >> 1) ! = 0
(0001 1111 0100 >> 1) = 0000 1111 1010
Iteration 2: hash_bits = 2, and (500 >> 2) ! = 0
(0001 1111 1010 >> 2) = 0000 0111 1110
Iteration3: hash_bits = 3, and (500 >> 3) ! = 0
(0001 1111 1010 >> 3) = 0000 0011 1111
Iteration 4: hash_bits = 4, and (500 >> 4) ! = 0
(0001 1111 1010 >> 4) = 0000 0001 1111
Iteration 5: hash_bits = 5, and (500 >> 5) ! = 0
(0001 1111 1010 >> 5) = 0000 0000 1111
Iteration 6: hash_bits = 6, and (500 >> 6) ! = 0
(0001 1111 1010 >> 6) = 0000 0000 0111
Iteration 7: hash_bits = 7, and (500 >> 7) ! = 0
(0001 1111 1010 >> 7) = 0000 0000 0011
Iteration 8: hash_bits = 8, and (500 >> 8) ! = 0
(0001 1111 1010 >> 8) = 0000 0000 0001
Iteration 9: hash_bits = 9, and (500 >> 9) ! = 0
(0001 1111 1010 >> 9) = 0000 0000 0000

After breaking out of the while loop, hash_bits is decremented to 8, nr_hash is set to 0001 0000 0000, and the hash_mask is set to 0000 1111 1111.

After the mnt_init() routine initializes mount_hashtable and creates mnt_cache, it issues three calls:

---------------------------------------------------------------------- fs/namespace.c ... 1189 sysfs_init(); 1190 init_rootfs(); 1191 init_mount_tree(); 1192 } ----------------------------------------------------------------------

sysfs_init() is responsible for the creation of the sysfs filesystem. init_rootfs() and init_mount_tree() are together responsible for mounting the root filesystem. We closely look at each routine in turn.

---------------------------------------------------------------------- init_rootfs() fs/ramfs/inode.c 218 static struct file_system_type rootfs_fs_type = { 219 .name = "rootfs", 220 .get_sb = rootfs_get_sb, 221 .kill_sb = kill_litter_super, 222 }; ... 237 int __init init_rootfs(void) 238 { 239 return register_filesystem(&rootfs_fs_type); 240 } ----------------------------------------------------------------------

The rootfs filesystem is an initial filesystem the kernel mounts. It is a simple and quite empty directory that becomes overmounted by the real filesystem at a later point in the kernel boot-up process.

Lines 218222

This code block is the declaration of the rootfs_fs_type file_system_type struct. Only the two methods for getting and killing the associated superblock are defined.

Lines 237240

The init_rootfs() routine merely register this rootfs with the kernel. This makes available all the information regarding the type of filesystem (information stored in the file_system_type struct) within the kernel.

---------------------------------------------------------------------- init_mount_tree() fs/namespace.c 1107 static void __init init_mount_tree(void) 1108 { 1109 struct vfsmount *mnt; 1110 struct namespace *namespace; 1111 struct task_struct *g, *p; 1112 1113 mnt = do_kern_mount("rootfs", 0, "rootfs", NULL); 1114 if (IS_ERR(mnt)) 1115 panic("Can't create rootfs"); 1116 namespace = kmalloc(sizeof(*namespace), GFP_KERNEL); 1117 if (!namespace) 1118 panic("Can't allocate initial namespace"); 1119 atomic_set(&namespace->count, 1); 1120 INIT_LIST_HEAD(&namespace->list); 1121 init_rwsem(&namespace->sem); 1122 list_add(&mnt->mnt_list, &namespace->list); 1123 namespace->root = mnt; 1124 1125 init_task.namespace = namespace; 1126 read_lock(&tasklist_lock); 1127 do_each_thread(g, p) { 1128 get_namespace(namespace); 1129 p->namespace = namespace; 1130 } while_each_thread(g, p); 1131 read_unlock(&tasklist_lock); 1132 1133 set_fs_pwd(current->fs, namespace->root, namespace->root->mnt_root); 1134 set_fs_root(current->fs, namespace->root, namespace->root->mnt_root); 1135 } -----------------------------------------------------------------------

Lines 11161123

Initialize the process namespace. This structure keeps pointers to the mount tree-related structures and the corresponding dentry. The namespace object is allocated, the count set to 1, the list field of type list_head is initialized, the semaphore that locks the namespace (and the mount tree) is initialized, and the root field corresponding to the vfsmount structure is set to point to our newly allocated vfsmount.

Line 1125

The current task's (the init task's) process descriptor namespace field is set to point at the namespace object we just allocated and initialized. (The current process is Process 0.)

Lines 11341135

The following two routines set the values of four fields in the fs_struct associated with our process. fs_struct holds field for the root and current working directory entries set by these two routines.

We just finished exploring what happens in the mnt_init function. Let's continue exploring vfs_mnt_init.

---------------------------------------------------------------------- 1641 bdev_cache_init() fs/block_dev.c 290 void __init bdev_cache_init(void) 291 { 292 int err; 293 bdev_cachep = kmem_cache_create("bdev_cache", 294 sizeof(struct bdev_inode), 295 0, 296 SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, 297 init_once, 298 NULL); 299 if (!bdev_cachep) 300 panic("Cannot create bdev_cache SLAB cache"); 301 err = register_filesystem(&bd_type); 302 if (err) 303 panic("Cannot register bdev pseudo-fs"); 304 bd_mnt = kern_mount(&bd_type); 305 err = PTR_ERR(bd_mnt); 306 if (IS_ERR(bd_mnt)) 307 panic("Cannot create bdev pseudo-fs"); 308 blockdev_superblock = bd_mnt->mnt_sb; /* For writeback */ 309 } ----------------------------------------------------------------------

Lines 293298

Create the bdev_cache SLAB cache, which holds bdev_inodes.

Line 301

Register the bdev special filesystem. It has been defined as follows:

---------------------------------------------------------------------- fs/block_dev.c 294 static struct file_system_type bd_type = { 295 .name = "bdev", 296 .get_sb = bd_get_sb, 297 .kill_sb = kill_anon_super, 298 }; ----------------------------------------------------------------------

As you can see, the file_system_type struct of the bdev special filesystem has only two routines defined: one for fetching the filesystem's superblock and the other for removing/freeing the superblock. At this point, you might wonder why block devices are registered as filesystems. In Chapter 6, we saw that systems that are not technically filesystems can use filesystem kernel structures; that is, they do not have mount points but can make use of the VFS kernel structures that support filesystems. Block devices are one instance of a pseudo filesystem that makes use of the VFS filesystem kernel structures. As with bdev, these special filesystems generally define only a limited number of fields because not all of them make sense for the particular application.

Lines 304308

The call to kern_mount() sets up all the mount-related VFS structures and returns the vfsmount structure. (See Chapter 6 for more information on setting the global variables bd_mnt to point to the vfsmount structure and blockdev_superblock to point to the vfsmount superblock.)

This function initializes the character device objects that surround the driver model:

---------------------------------------------------------------------- 1642 chrdev_init fs/char_dev.c void __init chrdev_init(void) { 433 subsystem_init(&cdev_subsys); 434 cdev_map = kobj_map_init(base_probe, &cdev_subsys); 435 } ----------------------------------------------------------------------

8.5.27. The Call to radix_tree_init()

Line 476

The 2.6 Linux kernel uses a radix tree to manage pages within the page cache. Here, we simply initialize a contiguous section of kernel space for storing the page cache radix tree:

---------------------------------------------------------------------- lib/radix-tree.c 798 void __init radix_tree_init(void) 799 { 800 radix_tree_node_cachep = kmem_cache_create("radix_tree_node", 801 sizeof(struct radix_tree_node), 0, 802 SLAB_PANIC, radix_tree_node_ctor, NULL); 803 radix_tree_init_maxindex(); 804 hotcpu_notifier(radix_tree_callback, 0); ----------------------------------------------------------------------- ---------------------------------------------------------------------- lib/radix-tree.c 768 static __init void radix_tree_init_maxindex(void) 769 { 770 unsigned int i; 771 772 for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++) 773 height_to_maxindex[i] = __maxindex(i); 774 } -----------------------------------------------------------------------

Notice how radix_tree_init() allocates the page cache space and radix_tree_init_maxindex() configures the radix tree data store, height_to_maxindex[].

hotcpu_notifier() (on line 804) refers to Linux 2.6's capability to hotswap CPUs. When a CPU is hotswapped, the kernel calls radix_tree_callback(), which attempts to cleanly free the parts of the page cache that were linked to the hotswapped CPU.

8.5.28. The Call to signals_init()

Line 477

The signals_init() function in kernel/signal.c initializes the kernel signal queue:

---------------------------------------------------------------------- fs/buffer.c 2565 void __init signals_init(void) 2566 { 2567 sigqueue_cachep = 2568 kmem_cache_create("sigqueue", 2569 sizeof(struct sigqueue), 2570 __alignof__(struct sigqueue), 2571 0, NULL, NULL); 2572 if (!sigqueue_cachep) 2573 panic("signals_init(): cannot create sigqueue SLAB cache"); 2574 } -----------------------------------------------------------------------

Lines 25672571

Allocate SLAB memory for sigqueue.

8.5.29. The Call to page_writeback_init()

Line 479

The page_writeback_init() function initializes the values controlling when a dirty page is written back to disk. Dirty pages are not immediately written back to disk; they are written after a certain amount of time passes or a certain number or percent of the pages in memory are marked as dirty. This init function attempts to determine the optimum number of pages that must be dirty before triggering a background write and a dedicated write. Background dirty-page writes take up much less processing power than dedicated dirty-page writes:

---------------------------------------------------------------------- mm/page-writeback.c 488 /* 489 * If the machine has a large highmem:lowmem ratio then scale back the default 490 * dirty memory thresholds: allowing too much dirty highmem pins an excessive 491 * number of buffer_heads. 492 */ 493 void __init page_writeback_init(void) 494 { 495 long buffer_pages = nr_free_buffer_pages(); 496 long correction; 497 498 total_pages = nr_free_pagecache_pages(); 499 500 correction = (100 * 4 * buffer_pages) / total_pages; 501 502 if (correction < 100) { 503 dirty_background_ratio *= correction; 504 dirty_background_ratio /= 100; 505 vm_dirty_ratio *= correction; 506 vm_dirty_ratio /= 100; 507 } 508 mod_timer(&wb_timer, jiffies + (dirty_writeback_centisecs * HZ) / 100); 509 set_ratelimit(); 510 register_cpu_notifier(&ratelimit_nb); 511 } -----------------------------------------------------------------------

Lines 495507

If we are operating on a machine with a large page cache compared to the number of buffer pages, we lower the dirty-page writeback thresholds. If we choose not to lower the threshold, which raises the frequency of writebacks, at each writeback, we would use an inordinate amount of buffer_heads. (This is the meaning of the comment before page_writeback().)

The default background writeback, dirty_background_ratio, starts when 10 percent of the pages are dirty. A dedicated writeback, vm_dirty_ratio, starts when 40 percent of the pages are dirty.

Line 508

We modify the writeback timer, wb_timer, to be triggered periodically (every 5 seconds by default).

Line 509

set_ratelimit() is called, which is documented excellently. I defer to these inline comments:

---------------------------------------------------------------------- mm/page-writeback.c 450 /* 451 * If ratelimit_pages is too high then we can get into dirty-data overload 452 * if a large number of processes all perform writes at the same time. 453 * If it is too low then SMP machines will call the (expensive) 454 * get_writeback_state too often. 455 * 456 * Here we set ratelimit_pages to a level which ensures that when all CPUs are 457 * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory 458 * thresholds before writeback cuts in. 459 * 460 * But the limit should not be set too high. Because it also controls the 461 * amount of memory which the balance_dirty_pages() caller has to write back. 462 * If this is too large then the caller will block on the IO queue all the 463 * time. So limit it to four megabytes - the balance_dirty_pages() caller 464 * will write six megabyte chunks, max. 465 */ 466 467 static void set_ratelimit(void) 468 { 469 ratelimit_pages = total_pages / (num_online_cpus() * 32); 470 if (ratelimit_pages < 16) 471 ratelimit_pages = 16; 472 if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024) 473 ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE; 474 } -----------------------------------------------------------------------

Line 510

The final command of page_writeback_init() registers the ratelimit notifier block, ratelimit_nb, with the CPU notifier. The ratelimit notifier block calls ratelimit_handler() when notified, which in turn, calls set_ratelimit(). The purpose of this is to recalculate ratelimit_pages when the number of online CPUs changes:

---------------------------------------------------------------------- mm/page-writeback.c 483 static struct notifier_block ratelimit_nb = { 484 .notifier_call = ratelimit_handler, 485 .next = NULL, 486 }; -----------------------------------------------------------------------

Finally, we need to examine what happens when the wb_timer (from Line 508) goes off and calls wb_time_fn():

---------------------------------------------------------------------- mm/page-writeback.c 414 static void wb_timer_fn(unsigned long unused) 415 { 416 if (pdflush_operation(wb_kupdate, 0) < 0) 417 mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */ 418 } -----------------------------------------------------------------------

Lines 416417

When the timer goes off, the kernel triggers pdflush_operation(), which awakens one of the pdflush threads to perform the actual writeback of dirty pages to disk. If pdflush_operation() cannot awaken any pdflush thread, it tells the writeback timer to trigger again in 1 second to retry awakening a pdflush tHRead. See Chapter 9, "Building the Linux Kernel," for more information on pdflush.

8.5.30. The Call to proc_root_init()

Lines 480482

As Chapter 2 explained, the CONFIG_* #define refers to a compile-time variable. If, at compile time, the proc filesystem is selected, the next step in initialization is the call to proc_root_init():

---------------------------------------------------------------------- fs/proc/root.c 40 void __init proc_root_init(void) 41 { 42 int err = proc_init_inodecache(); 43 if (err) 44 return; 45 err = register_filesystem(&proc_fs_type); 46 if (err) 47 return; 48 proc_mnt = kern_mount(&proc_fs_type); 49 err = PTR_ERR(proc_mnt); 50 if (IS_ERR(proc_mnt)) { 51 unregister_filesystem(&proc_fs_type); 52 return; 53 } 54 proc_misc_init(); 55 proc_net = proc_mkdir("net", 0); 56 #ifdef CONFIG_SYSVIPC 57 proc_mkdir("sysvipc", 0); 58 #endif 59 #ifdef CONFIG_SYSCTL 60 proc_sys_root = proc_mkdir("sys", 0); 61 #endif 62 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE) 63 proc_mkdir("sys/fs", 0); 64 proc_mkdir("sys/fs/binfmt_misc", 0); 65 #endif 66 proc_root_fs = proc_mkdir("fs", 0); 67 proc_root_driver = proc_mkdir("driver", 0); 68 proc_mkdir("fs/nfsd", 0); /* somewhere for the nfsd filesystem to be mounted */ 69 #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) 70 /* just give it a mountpoint */ 71 proc_mkdir("openprom", 0); 72 #endif 73 proc_tty_init(); 74 #ifdef CONFIG_PROC_DEVICETREE 75 proc_device_tree_init(); 76 #endif 77 proc_bus = proc_mkdir("bus", 0); 78 } -----------------------------------------------------------------------

Line 42

This line initializes the inode cache that holds the inodes for this filesystem.

Line 45

The file_system_type structure proc_fs_type is registered with the kernel. Let's closely look at the structure:

---------------------------------------------------------------------- fs/proc/root.c 33 static struct file_system_type proc_fs_type = { 34 .name = "proc", 35 .get_sb = proc_get_sb, 36 .kill_sb = kill_anon_super, 37 }; ----------------------------------------------------------------------

The file_system_type structure, which defines the filesystem's name simply as proc, has the routines for retrieving and freeing the superblock structures.

Line 48

We mount the proc filesystem. See the sidebar on kern_mount for more details as to what happens here.

Lines 5478

The call to proc_misc_init() is what creates most of the entries you see in the /proc filesystem. It creates entries with calls to create_proc_read_entry(), create_proc_entry(), and create_proc_seq_entry(). The remainder of the code block consists of calls to proc_mkdir for the creation of directories under /proc/, the call to the proc_tty_init() routine to create the tree under /proc/tty, and, if the config time value of CONFIG_PROC_DEVICETREE is set, then the call to the proc_device_tree_init() routine to create the /proc/device-tree subtree.

8.5.31. The Call to init_idle()

Line 490

init_idle() is called near the end of start_kernel() with parameters current and smp_processor_id() to prepare start_kernel() for rescheduling:

---------------------------------------------------------------------- kernel/sched.c 2643 void __init init_idle(task_t *idle, int cpu) 2644 { 2645 runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle)); 2646 unsigned long flags; 2647 2648 local_irq_save(flags); 2649 double_rq_lock(idle_rq, rq); 2650 2651 idle_rq->curr = idle_rq->idle = idle; 2652 deactivate_task(idle, rq); 2653 idle->array = NULL; 2654 idle->prio = MAX_PRIO; 2655 idle->state = TASK_RUNNING; 2656 set_task_cpu(idle, cpu); 2657 double_rq_unlock(idle_rq, rq); 2658 set_tsk_need_resched(idle); 2659 local_irq_restore(flags); 2660 2661 /* Set the preempt count _outside_ the spinlocks! */ 2662 #ifdef CONFIG_PREEMPT 2663 idle->thread_info->preempt_count = (idle->lock_depth >= 0); 2664 #else 2665 idle->thread_info->preempt_count = 0; 2666 #endif 2667 } -----------------------------------------------------------------------

Line 2645

We store the CPU request queue of the CPU that we're on and the CPU request queue of the CPU that the given task idle is on. In our case, with current and smp_processor_id(), these request queues will be equal.

Line 26482649

We save the IRQ flags and obtain the lock on both request queues.

Line 2651

We set the current task of the CPU request queue of the CPU that we're on to the task idle.

Lines 26522656

These statements remove the task idle from its request queue and move it to the CPU request queue of cpu.

Lines 26572659

We release the request queue locks on the run queues that we previously locked. Then, we mark task idle for rescheduling and restore the IRQs that we previously saved. We finally set the preemption counter if kernel preemption is configured.

8.5.32. The Call to rest_init()

Line 493

The rest_init() routine is fairly straightforward. It essentially creates what we call the init thread, removes the initialization kernel lock, and calls the idle tHRead:

---------------------------------------------------------------------- init/main.c 388 static void noinline rest_init(void) 389 { 390 kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND); 391 unlock_kernel(); 392 cpu_idle(); 393 } -----------------------------------------------------------------------

Line 388

You might have noticed that this is the first routine start_kernel() calls that is not __init. If you recall from Chapter 2, we said that when a function is preceded by __init, it is because all the memory used to maintain the function variables and the like will be memory that is cleared/freed once initialization nears completion. This is done through a call to free_initmem(), which we see in a moment when we explore what happens in init(). The reason why rest_init() is not an __init function is because it calls the init thread before its completion (meaning the call to cpu_idle). Because the init tHRead executes the call to free_initmem(), there is the possibility of a race condition occurring whereby free_initmem() is called before rest_init() (or the root thread) is finished.

Line 390

This line is the creation of the init thread, which is also referred to as the init process or process 1. For brevity, all we say here is that this thread shares all kernel data structures with the calling process. The kernel thread calls the init() functions, which we look at in the next section.

Line 391

The unlock_kernel() routine does nothing if only a single processor exists. Otherwise, it releases the BKL.

Line 392

The call to cpu_idle() is what turns the root thread into the idle thread. This routine yields the processor to the scheduler and is returned to when the scheduler has no other pending process to run.

At this point, we have completed the bulk of the Linux kernel initialization. We now briefly look at what happens in the call to init().

8.6. The init Thread (or Process 1)

We now explore the init thread. Note that we skip over all SMP-related routines for brevity:

---------------------------------------------------------------------- init/main.c 601 static int init(void * unused) 602 { 603 lock_kernel(); ... 612 child_reaper = current; ... 627 populate_rootfs(); 629 do_basic_setup(); ... 635 if (sys_access((const char __user *) "/init", 0) == 0) 636 execute_command = "/init"; 637 else 638 prepare_namespace(); ... 645 free_initmem(); 646 unlock_kernel(); 647 system_state = SYSTEM_RUNNING; 649 if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) 650 printk("Warning: unable to open an initial console.\n"); 651 652 (void) sys_dup(0); 653 (void) sys_dup(0); ... 662 if (execute_command) 663 run_init_process(execute_command); 664 665 run_init_process("/sbin/init"); 666 run_init_process("/etc/init"); 667 run_init_process("/bin/init"); 668 run_init_process("/bin/sh"); 669 670 panic("No init found. Try passing init= option to kernel."); 671 } -----------------------------------------------------------------------

Line 612

The init thread is set to reap any thread whose parent has died. The child_reaper variable is a global pointer to a task_struct and is defined in init/main.c. This variable comes into play in "reparenting functions" and is used as a reference to the thread that should become the new parent. We refer to functions such as reparent_to_init() (kernel/exit.c), choose_new_parent() (kernel/exit.c), and forget_original_parent() (kernel/exit.c) because they use child_reaper to reset the calling thread's parent.

Line 629

The do_basic_setup() function initializes the driver model, the sysctl interface, the network socket interface, and work queue support:

---------------------------------------------------------------------- init/main.c 551 static void __init do_basic_setup(void) 552 { 553 driver_init(); 554 555 #ifdef CONFIG_SYSCTL 556 sysctl_init(); 557 #endif ... 560 sock_init(); 561 562 init_workqueues(); 563 do_initcalls(); 564 } ----------------------------------------------------------------------

Line 553

The driver_init() (drivers/base/init.c) function initializes all the subsystems involved in driver support. This is the first part of device driver initializations. The second comes on line 563 with the call to do_initcalls().

Lines 555557

The sysctl interface provides support for dynamic alteration of kernel parameters. This means that the kernel parameters that sysctl supports can be modified at runtime without the need for recompiling and rebooting the kernel. sysctl_init() (kernel/sysctl.c) initializes the interface. For more information on sysctl, read the man page (man sysctl).

Line 560

The sock_init() function is a dummy function with a simple printk if the kernel is configured without net support. In this case, sock_init() is defined in net/nonet.c. In the case that network support is configured then sock_init() is defined in net/socket.c, it initializes the memory caches to be used for network support and registers the filesystem that supports networking.

Line 562

The call to init_workqueues sets up the work queue notifier chain. Chapter 10, "Adding Your Code to the Kernel," discusses work queues.

Line 563

The do_initcalls() (init/main.c) function constitutes the second part of device driver initialization. This function sequentially calls the entries in an array of function pointers that correspond to built-in device initialization functions.^[11]
^[11] Refer to http://geek.vtnet.ca/doc/initcall/ for an excellent distillation of the initcall mechanism by Trevor Woerner.

Lines 635638

If an early user space init exists, the kernel does not prepare the namespace; it allows it to perform this function. Otherwise, the call to prepare_namespace() is made. A namespace refers to the mount point of a filesystem hierarchy:

---------------------------------------------------------------------- init/do_mounts.c 383 void __init prepare_namespace(void) 384 { 385 int is_floppy; 386 387 mount_devfs(); ... 391 if (saved_root_name[0]) { 392 root_device_name = saved_root_name; 393 ROOT_DEV = name_to_dev_t(root_device_name); 394 if (strncmp(root_device_name, "/dev/", 5) == 0) 395 root_device_name += 5; 396 } 397 398 is_floppy = MAJOR(ROOT_DEV) == FLOPPY_MAJOR; 399 400 if (initrd_load()) 401 goto out; 402 403 if (is_floppy && rd_doload && rd_load_disk(0)) 404 ROOT_DEV = Root_RAM0; 405 406 mount_root(); 407 out: 408 umount_devfs("/dev"); 409 sys_mount(".", "/", NULL, MS_MOVE, NULL); 410 sys_chroot("."); 411 security_sb_post_mountroot(); 412 mount_devfs_fs (); 413 } ----------------------------------------------------------------------

Line 387

The mount_devfs() function creates the /dev mount-related structures. We need to mount /dev because we use it to refer to the root device name.

Lines 391396

This code block sets the global variable ROOT_DEV to the indicated root device as passed in through kernel boot-time parameters.

Line 398

A simple comparison of major numbers indicates whether the root device is a floppy.

Lines 400401

The call to initrd_load() mounts the RAM disk if a RAM disk has been indicated as the kernel's root filesystem. If this is the case, it returns a 1 and executes the jump to the out label, which undoes all we've done in preparation of a root filesystem from a device.

Line 406

The call to mount_root does the majority of the root-filesystem mounting. Let's closely look at this function:

---------------------------------------------------------------------- init/do_mounts.c 353 void __init mount_root(void) 354 { 355 #ifdef CONFIG_ROOT_NFS 356 if (MAJOR(ROOT_DEV) == UNNAMED_MAJOR) { 357 if (mount_nfs_root()) 358 return; 359 360 printk(KERN_ERR "VFS: Unable to mount root fs via NFS, trying floppy.\n"); 361 ROOT_DEV = Root_FD0; 362 } 363 #endif 364 #ifdef CONFIG_BLK_DEV_FD 365 if (MAJOR(ROOT_DEV) == FLOPPY_MAJOR) { ... 367 if (rd_doload==2) { 368 if (rd_load_disk(1)) { 369 ROOT_DEV = Root_RAM1; 370 root_device_name = NULL; 371 } 372 } else 373 change_floppy("root floppy"); 374 } 375 #endif 376 create_dev("/dev/root", ROOT_DEV, root_device_name); 377 mount_block_root("/dev/root", root_mountflags); 378 } ----------------------------------------------------------------------

Lines 355358

If the kernel has been configured to mount an NFS filesystem, we execute mount_nfs_root(). If the NFS mount fails, the kernel prints out the appropriate message and then proceeds to try to mount the floppy as the root filesystem.

Lines 364375

In this code block, the kernel tries to mount the root floppy.^[12]
^[12] A note on rd_doload: This global variable holds a value of 0 if no RAM disk is to be loaded, a value of 1 if a RAM disk is to be loaded, and a value of 2 for a "dual initrd/ramload setup."

Line 377

This function performs the bulk of the root device mounting. We now return to init().

Line 645

The call to free_initmem() frees all memory segments that the routines used up with the __init precursor. This marks our exit from pure kernel space and we begin to set up user mode data.

Lines 649650

Open up the initial console.

Lines 662668

The execute_command variable is set in init_setup() and holds the value of a boot-time parameter that contains the name of the init program to call if we do not want the default /sbin/init to be called. If an init program name is passed, it takes priority over the usual /sbin/init. Note that the call to run_init_process() (init/main.c) does not return because it ends with a call to execve(). Thus, the first init function call to run successfully is the only one run. In the case that an init program is not found, we can use the bash shell to start up.

Line 670

This panic statement should be reached only if all of our tries to execute various init program fails.

This concludes kernel initialization. From here on out, the init process involves itself with system initialization and starting all the necessary processes and daemon support required for user login and support.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

Автор Комментарий к данной статье