Сегмент может состоять из множества страниц-говорят,что память постранично сегментированная.
В этом случае дескриптором сегмента является таблица адресов страниц.
На старых мак-ах использовались страницы размером 64 КБ,
для каждого приложения выделялась одна стековая страница и одна кодовая.
Процессор
Эффективное использование ресурсов процессора в бОльшей степент зависит от логики процессов,
нежели от физики самого процессора.
Технология разделения ресурсов процессора была разработана еще в 60-х годах.
С тех пор мало что изменилось.
Термины process и task суть синонимы,
с точки зрения распределения ресурсов процессора также нет принципиальной разницы
между thread и process.
Термины timesharing и multitasking появились в 1960-х
и подразумевают поддержку множества процессов на минимальном количестве процессоров.
Термины multiprocessing и multitasking не являются синонимами;
речь в данном случае идет о системе с множеством CPU.
Юниксовый пользователь в далеком 1970-м на удаленном терминале мог себе позволить роскошь
запустить одновременно не более 4 процессов.
В более поздних виндовс-ориентированных системах к каждому окну уже
прикрепляются один или более процессов.
Но на большинстве персональных компьютеров в период с 70-х до середины 90-х годов
многозадачность находилась в зачаточном положении.
Поддержка многозадачности реализуется с помощтю CPU scheduler.
Шедулер обслуживает очередь процессов.
Каждый раз когда из этой очереди выбирается процесс,
происходит переключение контекста - т.н. context switch.
Input/Output
Когда к одному устройству приходит несколько запросов одновременно-
как например к сетевой карте-
система должна обработать их.
Для одно-пользовательской системы сойдет и алгоритм first-in-first-out I/O.
На многопользовательских серверах малейший нюанс в шедулере может положить систему.
Дисковое простанство
Дисковое простанство-один из наиболее важных ресурсов,управляемых системой.
Основной функцией файловой системы является выделение места для сохранения файла.
Механизм выделения дискового пространства в корне отличается от механизма выделения памяти.
Дисковые устройства структурированы по блочному принципу,
тут есть свои ограничения на количество операций чтения-записи.
Устойчивость и надежность
Механизм защиты является негобходимым компонентом операционной системы,
хотя некоторые ее представители обходятся без нее.
Security подразумевает способность системы противостоять внешним попыткам взлома.
Надежность определяет способность системы предотвращать ее внешние и внутренние повреждения.
Защита ресурсов необходима пользователю там,где вообще возможен взлом.
Каждый пользователь в системе должен иметь защиту доступа к своим ресурсам - памяти,файловой системе и т.д.
Доступ к ресурсам возникает тогда,когда нескольким пользователям
или процессам необходим доступ к общим ресурсам.
Надежность подразумевает защиту системы как от пользовательских приложений,
так и от любых других системного плана.
На ранних этапах развития операционных систем одна ошибка вызывала крах всей системы.
В новейших мульти-процессорных системах,при падении одного процесора,
остальная система продолжает работать.
В масштабированных системах,возникновение ерроров является нормой.
exercise: Consider the internet, with around one
million machines. Assuming each machine has a mean-time between
failures of one month, what is the rate of failures, per unit time?
Управление процессами
Состояния процесса :
--- Waiting <---
signal / \ wait
/ \
V <---preempt--- |
Ready Running
^ ---schedule---> |
\ /
create \ / terminate
----- Dead <-----
- Running
-
Процесс может ожидать событий,может быть вытеснен другим процессом.
В данный момент может выполняться только один процесс.
- Ready
-
Один процесс может находиться в состоянии готовности стать активным,
если этого захочет шедулер.
- Waiting
-
Все остальные процессы находятся в состоянии ожидания.
- Dead
-
Умерший процесс.
Если в системе нет ни одного процесса,то остается еще нулевой процесс-
т.н. idle process.
Этот процесс может быть занят системными делами-
диагностировать память,собирать мусор и т.д.
Ожидающие процессы находятся в очереди - queue structure.
Это может быть linked-list FIFO queue,выполняющий 2 функции,
в котором все процессы имеют одинаковый приоритет.
Если процессам назначается приоритет,то это уже не FIFO.
Большинство real-time систем присваивают каждому процессу свой приоритет.
В юниксе процесс может изменять и увеличивать свой приоритет,
если он например делает I/O,или наоборот,процесс может терять свой приоритет,
если он вытеснен.
Preemption Mechanism
Выполняемый процесс может быть вытеснен другим процессом.
Это может сделать например real-time-clock прерывание по истечению времени time-slice,
что показано в следующем псевдо-коде:
Clock interrupt service routine:
begin
Disable Interrupts (if not already disabled)
-- Save process state of interrupted process
current.pc := interrupt-return.pc
current.registers := interrupt-return.registers
-- Change interrupted process state to ready
Enqueue( current, ready queue )
-- Make new process into the current process
current := Dequeue( ready queue )
-- Restore process state of new process
interrupt-return.pc := current.pc
interrupt-return.registers := current.registers
Enable Interrupts (if necessary)
return
end
В старых системах таймер задавался с определенной периодичностью-
например 60 раз в секунду.
В современных системах шедулер может менять значение time-slice
(интервал между прерываниями) в зависимости от поведения процессов.
Прерывания должны блокироваться в тот момент,
когда происходит смена активного процесса.
Семафоры
Базовая классическая реализация шедулера поддерживает 2 основных состояния процесса -готов и выполняется.
Встает отдельный вопрос-что делать с процессами,которые стоят в очереди ожидания.
Его можно положить в цикл:
loop while desired condition not yet met
relinquish
endloop
Для решения этой проблемы были использованы семафоры
(E. Dijkstra, Hierarchical Ordering of Sequential Processes,
Acta Informatica Vol. 1, 1971, pages 115-138.)
Семафоры появились в версии System V Unix.
Семафор-обьект,используемый для координации процессов.
Для каждого условия процессу нужен отдельный семафор.
Процесс может ожидать события, к которому привязан уникальный семафор.
Когда процесс получит такое событие,он будет блокирован до тех пор,
пока все другие приложения ,повешенные на этот семафор,не ответят ему.
Семафор описывается целым числом:
wait( sem ) or P( sem ) or sem.wait or sem.P:
while sem <= 0 do nothing
sem = sem - 1
return
signal( sem ) or V( sem ) or sem.signal or sem.V:
sem = sem + 1
P и V - это ожидающая и сигнальная операции.
P - это пауза.
Более комплексное представление семафоров :
semaphore representation
queue: a queue of processes blocked on this semaphore
count: an integer
Если счетчик равен нулю,очередь будет состоять из списка процессов,
блокированных семафором.Если счетчик не равен нулю,эта очередь будет пуста.
Типичная схема реализации семафоров:
wait( sem ) or P( sem ) or sem.wait or sem.P:
Disable Interrupts
if sem.counter > 0
sem.counter = sem.counter - 1
else
-- Save process state
current.pc = proc-return.pc
current.regs = proc-return.regs
-- Change state to waiting
enqueue( current, sem.queue )
-- Make new process current process
current = Dequeue( ready queue )
-- Restore process state
proc-return.pc = current.pc
proc-return.regs = current.regs
endif
Enable Interrupts
return
signal( sem ) or V( sem ) or sem.signal or sem.V:
Disable Interrupts
if empty( sem.queue )
sem.count := sem.count + 1
else
temp := dequeue( sem.queue );
enqueue( temp, ready queue )
endif
Enable Interrupts
return
Эта схема корректна в том случае,если все процессы имеют одинаковый приоритет.
В противном случае будут выполняться процессы с низким приоритетом,а с высоким
будут в очереди.
Это называется инверсией приоритетов.
Exercise: Fix signal so it does not cause priority inversions
when processes have an integer attribute called priority, and assuming
that all queues of processes honor this priority field.
Сети Petri
Conway в 1961 году, рассуждая о мультипроцессорных системах,
придумал следующую схему,описывающую параллелизм в прикладных программах.
Она описывает сети Petri:
___ ___
| | | | |
| A |---->|----| B | A occurs before B
|___| | |___|
___ ___
| | | | |
| A |---->|----| B |
|___| | |___|
|
__V__ A occurs before either B or C, but
| not both; A could be a conditional
_|_ test, with control flowing to eithe
| | B or C depending on the condition.
| C | ___
|___| | |
| A |
|___|
|
__V__
| C will occur after A
___ _V_ and C will occur after B
| | | | | entirely independently.
| B |---->|----| C |
|___| | |___|
Сеть Петри-это граф,состоящий из квадратов и вертикальных разделителей-переходов.
Мы ставим цель в стартовую позицию и затем перемещаемся
от квадрата к квадрату через сетевые переходы.
Имеется параллельная программа:
___
| |
| A | The fork operation:
|___| A occurs before both
| B and C in parallel.
_______V_______
| |
_V_ _V_
| | | |
| B | | C |
|___| |___|
___ ___
| | | |
| A | | B |
|___| |___|
| |
__V_________V__
|
_V_ The join operation:
| | A and C must both
| C | occur before C happens.
|___|
Перед нами пример N-way forks и N-way joins,
Параллелизм в программе
Рассмотрим программу,состоящую из 4 выражений - A, B, C и D.
Мы хотим ,чтобы сначала было выполнено A,
затем параллельно B и C , и затем D.
Conway использует следующую графическую нотацию:
___
| |
| A | The fork operation:
|___| A occurs before both
| B and C in parallel.
_______V_______
| |
_V_ _V_
| | | |
| B | | C |
|___| |___|
| |
__V_________V__
|
_V_
| |
| D |
|___|
Мы будем использовать блоки begin end из Алгола.
Все,что выполняется внутри этого блока,делается последовательно.
Для параллельного выполнения используем блок cobegin и coend:
A;
cobegin
B;
C;
coend
D;
Имеется ввиду,что это параллельное выполнение происходит на мультипроцессорной системе.
То же самое можно записать с помощью другой нотации :
A;
B, C;
D;
Реализация параллелизма
Имеется 2 процесса - родительский и дочерний.
В родительском процессе выполняется ветка A-B-D, в дочернем- C.
Parent process:
Child process:
A;
fork(L); L: C
B; join(V);
await-join(V);
D;
По схеме Conway,у обоих процессов имеется общая расшареная память.
Conway ввел специальную общую переменную для синхронизации обоих процессов.
По нонешним временам эта переменная похожа на семафор.
Следующий UNIX-код решает поставленную задачу:
A;
if (fork() == 0) {
B; /* child */
exit(); /* kill child, signal parent */
} else {
C; /* parent */
wait(); /* await signal from child */
}
D;
Здесь A, C и D выполняются в родительском процессе,
fork() создает дочерний процесс, в которм выполняется B.
В юниксе каждый процесс имеет свое собственное адресное пространство.
Операция "B" увидит все изменения,которые будут сделаны в "A",
но все изменения,сделанные в "B",не будут видны в "C" и в "D".
Exercise: Look up the fork(), exit() and wait() primitives in the
Unix Programmer's Reference Manual. They are in section 2 of the manual
(system calls); the Unix man command under the Unix shell searches
this manual. Try man man for help on this command.
Exercise: A; cobegin B; C; D; coend; E;
Write this in C to run under Unix. There are two classes of solutions,
differing in whether one parent has two children, or whether the parent
has one child that in turn has a child (the grandchild of the parent).
Read the manual carefully! One of these variants is harder to
program than the other.
Mutual Exclusion
Consider the problem of counting the nonzero elements of a 1,000,000 element
array. Here is a parallel solution, expressed informally using two processes:
cobegin
-- process A
for i := 1 to 500,000 do
if a[i] <> 0
then count := count + 1;
-- process B
for j := 500,001 to 1,000,000 do
if a[j] <> 0
then count := count + 1;
coend
After processes A and B have completed, you'd expect that count would
correctly reflect the desired quantity, but there is a significant chance
that it won't, because the operation of incrementing count in each of these
processes will most likely be accomplished by some sequence of operations
such as the following:
load register with count
increment register
store register to count
Unfortunately, even on a uniprocessor, it is possible that both
processes A and B will attempt to execute the this same instruction
sequence at the same time, and one of the possible instruction sequences
will be something like the following:
process A process B
load rA with count
preempt process A
load rB with count
increment rB
store rB in count
...
preempt process B
increment rA
store rA to count
It is not hard to see that each time something like this occurs, one or
more increment operations will be miscounted. On a uniprocessor using
preemptive scheduling once per timeslice, once the CPU is preempted from
process A and given to B as shown above, one full time-slice of computation
by process B will be lost when process A regains control!
To prevent this, it is
essential that only one process at a time attempt to increment count.
Therefore, we say that this sequence of instructions is a
critical section, or that access by processors to count must
be mutually exclusive. In general, whenever there is a shared variable,
code that accesses that shared variable will potentially involve critical
sections, and each of these must be guarded!
Simple Mutual Exclusion on Uniprocessors
On a uniprocessor, mutual exclusion may be provided by disabling interrupts
on entry to a critical section and reenabling them on exit.
disable interrupts -- begin critical section
load register with count
increment register
store register to count
enable interrupts -- end critical section
Disabling interrupts prevents preemption, and so long as the
code within the critical section includes no relinquish or signal
operations, we have a guarantee that no other processes will run.
This approach is very common in systems such as small microcontrollers
that have no operating system and it is very common inside the kernel of
an operating system, but there are two problems that limit its use:
First, the instructions to enable and disable interrupts are usually
privileged, which is to say, the operating system kernel and hardware
cooperate to prevent users from using these instructions. If this were
not done, a user could execute the disable interrupt instruction and
then go into an infinite loop, shutting down the system. Therefore,
this approach cannot be used by user programs on protected systems.
Second, this approach is not selective. If one process is incrementing
a particular shared variable, it only needs to exclude other processes
that are intent on modifying the same variable. If there are many programs
that operate on shared variables, and if each of them shuts down all
scheduler activity when it enters a critical section, the net result
can be serious performance problems.
Exercise: The code given in the previous lecture for implementing a
scheduler relied on exactly this method! Find the critical sections in that
code.
Simple Mutual Exclusion on Multiprocessors
Many CPUs designed for use in multiprocessors, particularly in the
1960's and 1970's, included instructions such as exchange(r,m) or
test-and-set(m). These are atomic, which means that, if one CPU is
executing one of these operations, no other CPU may intervene.
The exchange operation swaps the contents of a register with the contents
of a location in memory, while the test-and-set operation sets a memory
location to zero while testing its previous value and reporting the
results in the condition codes.
Given one or another of these operations, the variable count
used in our running example may be protected with a new variable
called a spin lock. As a convention, the spin lock protecting a
variable, for example, count will be called count-lock.
Given this, we can implement our critical section as:
lock( count-lock ) -- begin critical section
load register with count
increment register
store register to count
unlock( count-lock ) -- end critical section
The two states of a spin lock are locked and unlocked; any
two values may be used to represent these states, however, we will use
zero to represent locked and one to represent unlocked because of the way
these values generalize to the use of semaphores.
Initially, the lock should be one; lock sets it to zero, and unlock
resets it to one. Lock and unlock can be implemented as follows:
lock( v )
register := 0;
repeat
exchange( register, v );
until register = 1;
unlock( v )
v := 1;
Spin locks have the disadvantage that they involve busy waiting.
Nonetheless, they are the preferred mutual exclusion mechanism
for very short critical sections on shared-memory multiprocessors.
Specifically, if the expected waiting time for entry to the critical
section in the event of contention is shorter than the time it would
take to do two context switches (switching the waiting process out of
the CPU and back in again, under the control of the scheduler), this is
the preferred method.
On some uniprocessor operating systems, including much of the older code
in UNIX, the poor design of the scheduler or compatability with older versions
of the system forces the use of spin-locks.
Typically, in high level low-priority code, a relinquish operation of
some type is
inserted in the busy waiting loop in order to let other ready processes use
the CPU while one process waits for the lock.
In modern computer architectures, those designed in the 1980's or later,
both the test-and-set and exchange operations are uncommon. Instead, it
is common to find what is known as a snooping mechanism in each CPU,
along with a pair of special instructions, load-locked and store-conditional.
The load-locked instruction loads from a memory location to a register
and sets the snooping mechanism to monitor the bus and note any attempt
to change the contents of that location. The store-conditional instruction
stores from register to memory, but only if the snooping mechanism is
currently monitoring the destination address and there has been no change
to that address since the most recent load-locked on this CPU. The
store-conditional instruction always reports it success or failure, for
example, using a condition code.
Exercise: Write code for the lock and unlock operations using
test-and-set.
Exercise: Write code for the lock and unlock operations using
load-locked and store-conditional.
Mutual Exclusion Without (much) Help from Hardware
The problem of mutual exclusion on multiprocessors where there are no
atomic exchange or test-and-set operations has been the subject of years
of study, during which many ingeneous solutions to the
critical section problem have been found. Dekker's solution is oldest
of these, solving the problem for 2 processes with ID's 0 and 1, and
making no assumption about special hardware support other than the
fact that the operations of reading or writing a variable are atomic.
Dekker represented each spin-lock by three variables:
state[0..1] -- process i sets state[i] to enter section
turn -- indicates who's turn it is to try next
Given this, and that all variables are initially zero, the code for
the lock and unlock operations is as follows, assuming that each process
passes it's process ID to the lock and unlock operations:
other(pid)
return( 1 - pid );
lock(pid):
state[pid] = 1;
while state[other(pid)] = 1 do
state[pid] = 0;
while turn <> pid do nothing;
state[pid] = 1;
end loop;
return;
unlock(pid):
state[pid] = 0;
turn = other(pid);
return;
When A is inside the critical section, state[A] is one and
state[B] is zero. If both try to enter at the same time,
so both state variables are set to one, they both enter the loop
waiting for the other to reset their state to zero. One will immediately
reassert their claim, while the other is blocked a waiting its turn.
Furthermore, on exit from the critical section, each process always
sets turn to give priority to the other process, preventing one or
the other from hogging the critical section.
This code is difficult to understand! Furthermore, it is difficult to
generalize to more than two processes. One interesting generalization
uses a binary tree of processes, where processes enter the tree at the
leaves and use a different process ID at each internal node, for example,
using 0 if they came to that node from the left branch and 1 if they
came to that node from the right branch. There are many more solutions
to this problem, some of which we will discuss as the semester continues.
A General Mutual Exclusion Mechanism
Semaphores can be used to provide mutual exclusion!
The variable count used in our running example may be
protected with a semaphore; we'll call this count-sem. Given
this, we can implement our example critical section as:
P( count-sem ) -- begin critical section
load register with count
increment register
store register to count
V( count-sem ) -- end critical section
Initially, the semaphore should have a count of one; the P operation sets
it to zero, and the V operation resets it to one.
The parallel between this and the spin-lock solution should be obvious!
In fact, spin-locks are considered to one implementation of a special
case of semaphores called binary semaphores. Binary semaphores are
those that never take on values other than zero and one.
Semaphores implemented using the process scheduler are the preferred
implementation of user-level and some system-level mutual exclusion on
uniprocessors, and they are the preferred implementation on multiprocessors
for those critical sections where the expected waiting time for entry
to the section in the event of contention is greater than the time
taken for two context switches.
Exercise:
Suggese a rule for system programmers on uniprocessors: When should
the system programmer use semaphores to guard a critical section, and when
should the system programmer simply disable interrupts?
Межпроцессное взаимодействие
Producers and Consumers
Mutual exclusion is not enough! All it provides is the illusion that
entire critical sections are atomic, but effective parallel programming
rests on more than this.
Consider a situation where one process produces data to be consumed by
another:
Producer --> Consumer
The producer can easily fill a shared variable, then let the consumer
empty the variable, and obviously, both the filling and emptying of the
shared variable should appear atomic from the perspective of other processes.
Mutual exclusion is sufficient for this, but it is not sufficient
There must be some way for the producer to signal to the consumer that
data is available!
Signalling as an Abstraction
One way to view this signalling problem is to think purely in terms of
the sequence of computations. We want the producer to produce data, and
only when the producer has produced do we want to allow the consumer to
consume. If the producer produces, but the consumer is not yet ready, we
do not want to make the producer wait, but if the consumer is ready and
the producer has not yet produced, we must block the consumer.
Using Petri-net notation, we can express these constraints as follows:
Process A Process B
| |
_____V_____ |
| Produce | |
| some data | |
|___________| |
| X |
___V______________ |
| | |
| ___V___ |
| | | |
| | S | |
| |_______| |
| | |
| ___V__________V___
| |
| Y _____V_____
| | Comsume |
| | the data |
| |___________|
| |
This is just a Petri-net fragment because it shows nothing of
the producer or consumer control flow other than the actual produce
and consume steps. If the two transitions in this net fragment are
viewed as fork and join transitions, then we have not only a producer
and consumer process, but also a short process from X to Y with only
one place, S, and no internal transitions. This is not a productive
view of the place S!
For a better understanding of S, notice that, after each time the
producer performs the produce step, a token is added to the place S,
and before each time the consumer performs the consume, a token is removed
from the place S. The count of tokens in S is therefore a count of the
number of items that have been produced but not yet consumed. Therefore,
we can model S in a program by a semaphore, and we can model the
transition X as a signal operation on this sempaphore, while we model
the transition Y as a wait operation on this semaphore.
In fact, this model generalizes. All semaphores can be modeled as
places in a petri net, with each signal operation modeled by a fork
that deposits a token in that place, and each wait operation modeled
by a join that consumes a token from that place.
This understanding of synchronization as being theoretically equivalent
to control flow is a relatively new insight!
Bounded Buffers
The simplest general implementation of a FIFO queue between a producer and a
consumer is called a bounded buffer. A bounded buffer has a fixed maximum
data capacity, but at any time, it may be empty, it may contain fewer items
than its capacity, or it may be full. The items in the queue are stored
in an array, the buffer, between the head and tail pointers, as follows:
________________________________
| | |\\|\\|\\|\\|\\|\\| | | |
|__|__|\\|\\|\\|\\|\\|\\|__|__|__|
Empty | Data | Empty
| |
Head Tail
In the illustration above, the head pointer refers to the first item in
the queue, the item that will be returned by the next call to the dequeue
operation, and the tail pointer refers to the free space that will be
filled by the next call to enqueue. This arrangement assumes that the
pointers are incremented after each use. A second possibility is to
increment before use, in which case, the head pointer would reference
the free location from which data was most recently dequeued, and the
tail pointer would reference the data most recently enqueued.
The operations on this structure are simple, if we ignore problems with
preventing enqueueing data in a full queue or dequeueing data from an
empty queue they are:
enque(d)
buffer[tail] = d
tail = tail + 1 (modulo queue size)
dequeue(d)
d = buffer[head]
head = head + 1 (modulo queue size)
We can detect that the queue is empty or full fairly easily:
empty := (head = tail)
full := (head = tail)
Unfortunately, these conditions are the same, and we can only distinguish
them by noting the history. If, after enqueue, the head and tail pointers
are the same, the queue is full. If, after dequeue, this happens, the queue
is empty.
In applications where busy waiting is appropriate, the full and empty
conditions are sometimes distinguised by the simple expedient of wasting
one slot in the buffer; in effect, we declare the queue to be full when
there is acctually one free spot remaining:
full := (head + 1 = tail) (modulo queue size)
Exercise: Propose other solutions to detecting the full and empty
conditions in a circular buffer, perhaps by counting the data elements in
the buffer. Compare your solution with the one proposed above! Does it
take more or less space for a bounded buffer of 256 characters. How about
a bounded buffer of 256 floating point numbers? Does your alternative have
a higher or lower run-time cost than the approach outlined above for
detecting the queue full and empty conditions.
Shared Queues
When a queue implemented as a bounded buffer is shared between multiple
processes, there are two different problems. One is assurance of mutual
exclusion, so that, for instance, no two processes try to enqueue an item
of data in the same queue slot, and the other is keeping track of the
number of available data items and free spaces. All of these can be
accomplished using semaphores:
Binary semaphores used to provide mutual exclusion:
headlock -- protects the head pointer
taillock -- protects the tail pointer
General semaphores used to count available resources:
free -- counts number of free spaces in the buffer
data -- counts number of data items in the buffer
Initially,
headlock.count = 0
taillock.count = 0
free.count = buffer capacity
data.count = 0
Given these semaphores, the operations on a shared bounded buffer queue
can be implemented as follows:
enque(d)
P(free) -- wait for free space to put data
P(taillock) -- wait for exclusive use of tail
buffer[tail] = d
tail = tail + 1 (modulo queue size)
V(taillock) -- release the tail
V(data) -- signal that data is available
return
dequeue(d)
P(data) -- wait for data to be available
P(headlock) -- wait for exclusive use of head
d = buffer[head]
head = head + 1 (modulo queue size)
V(headlock) -- release the head
V(free) -- signal free space available
return
This implementation of shared FIFO queues is universally applicable whenever
queues are used to communicate between processes that run under the process
manager. There are some variations worth noting:
First, the mutual exclusion locks are only needed if there are multiple
producers or consumers. When there is only one producer process, the
producer is the only process that will inspect and manipulate the tail
pointer, so no mutual exclusion semaphore is needed to protect the tail.
The same holds for the head.
Second, the critical sections in this code are very brief, so it is
unlikely that any process will wait long for entry. As a result,
in system-level code on a uniprocessor, these can be protected by
disabling interrupts, and in a multiprocessor setting, spin-locks are
appropriate.
It is also worth noting that if queue items are very large, we do not
need to read or write the item from the queue inside the critical section;
all we need to do is sample and increment the head or tail pointer in the
section, and then do the data copy outside, as follows:
enque(d)
local variable t
P(free) -- wait for free space to put data
P(taillock) -- wait for exclusive use of tail
t = tail
tail = tail + 1 (modulo queue size)
V(taillock) -- release the tail
buffer[t] = d -- copy data here!
V(data) -- signal that data is available
return
Variations on Queues
If the buffer has a capacity of one, the head and tail pointers are
superfluous, as are the locks protecting them. In this case, the
producer and consumer processes would be expected to operate in lock-step,
producing one data item, then consuming it before the next is produced.
Exercise: Write the code for this limiting case for a bounded
buffer queue. How many semaphores are required? What purposes do the
remaining semaphores serve, and what range of values do they take on?
Lock step operation of the producer and consumer is frequently innefficient
because unless the computational requirements of the two processes are
identical, either the producer will spend time waiting for the consumer,
or vicea versa. Nonetheless, this one-item queue allows the actual
computations of the producer and consumer to overlap, thus allowing
parallelism that would not have been possible in a purely sequential
setting.
From elementary queueing theory, we can predict the following: If the average
time taken to produce an item is the same as the average time taken to
consume an item, and the times vary randomly, the performance of the system
will be maximized with an infinite queue capacity and an infinite average
number of items in the queue. Furthermore, as the size of the bounded
buffer is decreased, the percentage of the time that the producer spends
waiting for space or the consumer spends waiting for data will increase
smoothly. Put briefly, so long as there is any available memory, performance
can be increased by using this memory to increase the queue capacity.
The bounded buffer with a capacity of two is a commonly used special case,
particularly in 1960's vintage input/output subsystems. In that context,
it is frequently described as "Double Buffered I/O".
Exercise: Write the code for this common case for a bounded
buffer queue. How many semaphores are required if you assume a single
consumer and a single producer? What purposes do the remaining semaphores
serve, and what range of values do they take on?
When multiple FIFO queues are implemented using linked lists instead of
arrays, it is common to have a single shared pool of free list items
instead of a bounded supply of list items for each queue. For a given
amount of memory, this can markedly improve performance, but it adds slightly
to the complexity of the code. Specifically, instead of having a separate
space semaphore to count the amount of free space in each queue, there is
a single free space semaphore.
Exercise: Write the code for linked list queues in the general
case where there are multiple queues, where each queue
may serve multiple producers and multiple consumers. Assume that there
is only one type of object enqueued in any of the many queues -- call it
a buffer, and assume that each buffer has a data field and however many
links are required for the list structuring scheme you use.
How many semaphores are required per queue, and how many additional
semaphores are shared by all queues? Identify the purpose of each!
Other problems
Do not be misled into thinking that the mutual exclusion and producer-consumer
problems are the only problems in parallel programming. They are the most
important and most common problems, but there are many others:
The readers-writers problem is a common problem in file-system design. A
file may either be closed, open for read access by any number of users, or
open for read-write access by exactly one user.
Problem: Implement the synchronization
required for the open-for-read, open-for-write and close operations
under the constraints of the readers-writers problem, using semaphores
and shared variables.
Deadlock, the situation that arises when multiple processes are blocked,
each awaiting some action by another blocked process, is a common failing
in parallel programs. There are complex algorithms for both deadlock
avoidance and deadlock detection. We will talk about these later.
Управление памятью
Key Issues
The key questions for any resource management system are:
- WHO requested the resource?
- WHAT resource was requested?
- WHERE should the resource be allocated?
- HOW should the resource be allocated?
- WHY should the resource be allocated?
- WHEN should the resource be allocated?
WHO requested the resource?
This is not trivial! On most systems, resources are allocated to processes,
but this is not universal!
If the process is the object to which all allocation is attributed, then the
process becomes a very heavy object, and changing processes becomes expensive.
As a result, on systems where this is done, the system's notion of process
is frequently augmented with lightweight processes, to which no resources
other than the CPU are allocated. These are called threads.
On the old Berkeley Timesharing System, in the late 1960's, resources were
allocated to interactive users. Each interactive user had a single virtual
address space containing up to 64 pages of memory, and the user could have
any number of processes running in this address space, where at any time,
each process could directly address 8 pages of this 64-page address space.
Berkeley Timesharing System processes were very lightweight objects, with
a total process description containined in only 8 words of memory! Using
today's terminology, we would say that a Berkeley Timesharing System user
was a process, while the processes in this system correspond to threads
in modern terminology. In fact, the Berkeley system was the first to
incorporate this modern idea of multithreaded processes.
Aside: The word process originally always meant a locus of control, or some
entity in a system that could execute a sequential program. Another
good definition (though not usually used) is that each process runs on
a virtual CPU created by the scheduler.
This definition has been largely forgotten in the UNIX community,
where it is common to use the word process to refer
to the unit of activity to which allocated resources are bound, even
if that unit contains many lightweight threads, each of which is
technically a process in the older sense of the word. For now, we
must live with the resulting confusion.
WHAT resource was requested?
Memory can be allocated by the byte, the word, the page, or the segment.
In any particular context, the system typically only provides one unit
of allocation, and it is up to the user or the applications library to
manufacture the desired units of memory out of those allocated by the
system.
Bytes and Words are typically the smallest addressable units
of memory. Allocating by the byte or word is quite rare, except in systems
where the word is sufficiently large to accomodate two pointers. In that
case, randomly allocated words are sufficient to build arbitrary
binary-linked data structures, which in turn, can be used to mimic any
other data structure. This is the foundation of the LISP worldview.
For those unfamiliar with LISP, the world (containing both programs and
data structures) is represented entirely with S-expressions. The simplest
S-expressions are known as atoms; atoms include numbers (represented in 1
word) and textual atoms (represented using a special word that includes
the property list for the atom; one property in this list is the printable
name of the atom). Complex S-expressions may be formed by composing
S-expressions into lists. The compose operator (CONS) takes 2 S-expressions
as parameters and returns a new S-expression that joins these. The new
S-expression is represented by a word holding 2 pointers, one to the left
component (the CAR) and one to the right component (the CDR).
Pages are fixed sized blocks of memory, with the size typically
determined by some characteristic of the addressing hardware on the machine.
Because the page size is determined by the hardware and not by the application,
it is generally necessary to construct large logical objects out
of multiple pages, or to divide single pages into multiple logical objects.
Typically, the memory management unit hardware allows multiple pages to
occupy logically contiguous addresses independent of their physical
locations in memory, so that large objects can easily span many pages.
Segments are variable sized blocks of memory, with the size typically
determined by some characteristic of the application. Segments are typically
supported by hardware, while otherwise identical blocks of memory that are
purely software artifacts are usually referred to simply as Blocks.
Each block or segment typically contains one logical object, but sometimes,
the economics of allocation are such that the application sub-allocates
a block or segment.
WHERE should the memory be allocated?
A naive user typically assumes that blocks of memory requested by an
application are allocated directly in main memory, so that the application
directly uses main memory addresses. This was true in first-generation
systems, including both the mainframe systems of the 1950's and the
first generation of microcomputer operating systems in the 1970's and
1980's, but by 1970, most mainframe vendors were moving away from this
view, and by 1995, the operating systems on microcomputers were well on
their way to abandoning this simple view.
Most interesting operating systems only use physical memory addresses for
internal data structures of the operating system itseelf.
All user data structures are allocated in memory segments allocated to the
users by the system. This allows the system, by means of the memory
management unit, to protect itself from errant user programs, and it allows
user programs to be isolated from each other, with the only permitted
communication being through authorized channels.
The question of where memory is allocated has another interpretation:
Within any given address space from which memory may be be allocated,
where should it be taken. The answer is trivial if all but the requested
amount has already been allocated. Another trivial case is when each
block of allocatable memory is identical and interchangable with all others,
as is the case when the unit of allocation is the page or any other fixed
size block, for example, the LISP S-expression node.
A final aspect of this question arises in machines with non-uniform memory
access times (so-called NUMA architectures). On such machines, it is
important to try to allocate memory "near" the CPU that is the primary user
of that memory, where distance is measured in terms of expected access time.
HOW should the memory be allocated?
Allocation for all time may be different from temporary allocation, and
some systems allow preemption of memory, either with the cooperation
of the user to which the memory was allocated, or without notice.
Almost all preemptable resources are used for implementing virtual
resources. Preemptive schedulers create one virtual CPU per process.
Demand paged storage managers preempt page frames with every page
fault in order to create virtual address spaces. Window managers
preempt regions on the screen to support different overlapped virtual
screens (windows).
Non preemptable resources exist at the bottom level in each such
virtual resource hierarchy. The lowest level backing store in a
virtual memory environment cannot easily be preempted. The (possibly
virtual) memory used to implement a window manager is similar, as is
the memory used to hold the states of the ready and waiting processes
under a process manager.
Note that there are many different types of virtual resources in the
above example, but all are implemented in terms of the same underlying
non preemptable resource. This is a common pattern.
WHY should the resource be allocated?
Why not! It is up to the user to decide why they want a resource.
Operating systems have no basis for asking this question, although
it should be asked by program designers.
WHEN should the resource be allocated?
Can the user wait? Can the user give advance notice? Must the user give
notice? If we attempt to allocate physical resources on demand, and if
we block users who ask for resources that are currently unavailable,
we run the risk that one process may be blocked waiting for a resource
currently held by another process, while the other process is blocked waiting
for a resource held by the first. This is one example of the more
general problem of deadlock.
With preemptable memory resources, such as page frames, demand
allocation is common, but performance can frequently be improved
by either forcing some users to wait before their demands are met,
or by asking users to provide advanced notice of their expected demand.
Sometimes the system can anticipate a user's allocation request and
allocate resources before they are needed.
Memory management algorithms
Consider the simple case of allocating blocks of physical memory to system
or user code on a system with no protection mechanisms. With small changes,
the same mechanism may be used to allocate blocks from the virtual memory
segments of a user program. The interface to such
a memory manager is frequently specified by the programming language.
The definition of Pascal, for example, includes the special procedure
new(p) that sets the pointer p to point to a new block of memory
sized to fit an object of the type referenced by p. Most implementations of
Pascal include the nonstandard procedure dispose(p) that deallocates
the block referenced by p and sets p to nil.
The C standard library, similarly, includes the function
malloc(s) that returns a pointer to a new block of s bytes of
memory, and the function free(p) that deallocates the block referenced
by p, if that block was previously allocated by malloc. C is a
weakly typed language, so the programmer is responsible for making sure
that the size, in bytes, passed to malloc matches the size of the object
required.
It is easy to see that the C and Pascal memory management interfaces are
basically the same. Both allocate memory from a memory space called the
heap. It is noteworthy that C++ was originally implemented
by a preprocessor that converted the C++ program to an equivalent C program.
When the C++ preprocessor encounters code requiring the instantiation of a
new object, it generates C code calling malloc to create
the structure representing the object.
There are a huge number of algorithms for implementing the basic allocate
and deallocate operations. Two representative examples will be given here,
the binary buddy system and a boundary tag method.
Note that memory management was once a central problem on operating systems,
when user processes shared real memory with each other. In modern
uniprocessors, it may seem far less critical because user code runs in
virtual memory, and allocation and preemption of page frames is
relatively trivial.
The use of virtual memory may seem to push the burden of memory
management into the user processes, where it shows up either as an
application programming problem or as part of the language run-time
system, for example, in the implementation of new and dispose or
malloc and free in Pascal and C, or as the garbage collector
underlying an implementation of LISP, Simula, Smalltalk or Java.
Of course, systems must manage the backing store, but this is typically
allocated in units of interchangable disk tracks or sectors. There
are performance penalties for allocating disk space poorly, but the
problem is one of performance, not absolute necessity. Simple
allocation heuristics such as allocating that free sector closest to
the other allocated sectors of the same file (or virtual address space)
can give excellent results.
In fact, much of the message of the above paragraphs is incorrect. When an
application chooses the wrong allocation algorithm, it can paralyze the
system. On some UNIX systems from the late 1980's, for example, the system
needed to be periodically rebooted because the X-windows window manager,
an applications program, consumed all of the available virtual memory.
The problem was severe external fragmentation within the memory allocated
to the window manager by the system.
Problems faced in memory allocation:
Fragmentation is the term used to describe how the free space in
a pool from which memory is allocated is slowly broken up into many small
pieces. There may be hundreds of words of free space, and yet a request to
allocate a ten word block of memory may fail because there are no ten
consecutive free words!
Overhead is the term used to describe additional memory that must
be allocated, above and beyond that requested by the users, in order to
provide for management of the pool of memory. For example, it may require
a few words of memory to describe the location of each allocated memory block.
The consequence of fragmentation and overhead is that, although the total
user demand for memory may be less than the number of words available, the
system may be unable to meet that demand because the largest available block
of free memory is smaller than the smallest user demand that has yet to be
satisfied.
Fragments of free space, and fragmentation in general, comes from two
sources:
Internal fragments are fragments of free space that are allocated
to users even though the user does not request them. For example, on many
modern processors, each allocated block of memory must be aligned on
modern processors, byte addressing is quitee expensive, so each allocated
block of memory must be aligned on a word boundary. If a user wants an odd
number of bytes, this will require rounding up the user request to the next
word boundary, allocating space to that user that the user does not want.
External fragments are fragments of free space between allocated
blocks. External fragmentation can be eliminated by a compacting garbage
collector; one that can move blocks of memory around in the heap
in order to consolidate the fragments, but these are uncommon outside the
specialized area of run-time support for Java.
A trivial memory management algorithm
Allocate fixed size blocks with a size greater than the maximum
possible user request.
This leads to awful internal fragmentation unless all user requests are
for the same sized block. Allocation and deallocation are very fast,
requiring a single linked list, free-list, to keep track of free blocks.
If there are a variety of block sizes being requested, this algorithm can
be improved by using a separate free list for each popular block size.
The problem with this is that it introduces a new fragmentation problem,
there may be plenty of free space, but only in sizes that don't match the
user's request. If blocks of one size can be combined to make larger blocks
or broken up to make smaller blocks, this problem is solved. The result
is a class of algorithms known as buddy systems.
The Binary Buddy System
In the simplest of the buddy systems, all block sizes are powers of two.
Any large block may be split in half to make two smaller blocks, and any
two blocks that were originally split from the same parent may be combined
to reconstruct that parent.
The central data structure used in all buddy systems is an array of free-lists:
Array of
free-list
heads ______ ______
________ ---->|_____o----->|______|
128K |________| | | 64k | | 64k |
64K |_______o---- | | | |
32K |_______o------ |______| |______|
16K |________| | ______ ______
8K |_______o---- -->|_____o----->|_____o---
4K |________| | | 32k | | 32k | |
2K |________| | |______| |______| |
1K |________| | ______ ______ |
512 |________| | |______|<-----o_____|<-
256 |________| | | 32k | | 32k |
(indexed | |______| |______|
by block | ______ ______
size) ---->|_____o----->|______|
|___8k_| |___8k_|
Note that the above picture shows the state of a buddy system storage allocator
with an array of free lists that allows for blocks of size
up to 128k bytes. The system currently has two blocks of size
64k bytes, 3 blocks of size 32k bytes, and 2 blocks of 8k bytes.
When a block is in use, the block is entirely available to the
user. When the block is free, the first few bytes of the block
are used to hold a pointer to the next free block of the same size.
The minimum block size is determined by the size of a pointer.
If a user asks for less than this, there will be internal fragmentation
because we cannot form blocks into free lists if we cannot store
one pointer in each block.
The number of table entries required is determined by the maximum
and minimum block sizes. In the above example, the maximum was
128k (217), and the minimum on a 32 bit machine would
typically be 4 (22); in this case, the table itself
requires 17-2 or 15 entries. This is the storage overhead of the
binary buddy system.
Binary Buddy System Algorithms
to allocate a block of size n:
pick i such that 2(i-1) < n <= 2i
if freelist[i] not empty
return the first element of freelist[i] to the user
and set freelist[i] to point to the next element
else
allocate a block of size 2(i+1) (this is a recursive)
and split it in half. Return half to the user
and put the other half on freelist[i].
end
The allocation code outlined above is recursive! Furthermore, because
we are constrained to allocate powers of two, if user requests are
for uniformly distributed random block sizes, we expect that the
typical request will only use 3/4 of each block, resulting in an average
loss to internal fragmentation of 1/4 of the total memory capacity.
to deallocate a block of size n:
pick i such that 2(i-1) < n <= 2i
compute b, the buddy of the deallocated block.
if b is already in freelist[i]
remove b from freelist[i].
combine b and its buddy, and deallocate the
resulting block of size 2(i+1) (this is recursive)
else
add the deallocated block to freelist[i].
Again, this code is expressed in recursive form. It should be noted
that this version of deallocate uses what is referred to as prompt
merger of free blocks. That is, free blocks are merged with their
buddies as soon as possible after deallocation. The result of prompt
merger is that there will always be the best possible supply of large
free blocks, but if most of the traffic is in one particular block size,
there will be unnecessary computaiton involved in the repeated merger
and splitting of blocks.
Prompt merger is a good solution for real-time systems, where our concern
is a bounded worst-case time per operation. If, instead, we are more
worried about minimizing the total computation time, we should use what
is called lazy merger. That is, we should avoid merging free blocks for
as long as possible, letting the free lists for popular block sizes grow
as long as possible and exhausting the free lists for unpopular block sizes.
In lazy merger, free blocks are only merged when an allocation request
comes in for a block size where the free lists for that size and all larger
sizes are empty. At that point, the allocate routine begins performing
mergers of smaller free blocks until it builds a block large enough to
satisfy the request. This means that there may be a huge number of mergers
in response to one allocation request.
Exercise: For a total heap size of N words, what is the worst case
number of splits and mergers for allocate and deallocate, and
number of splits and mergers for allocate and deallocate; do this first
for the prompt version, and then the lazy version.
Finding the buddy of a block i of size s is easy in the binary buddy
system:
buddy(i) = i xor s
Note that the above rule violates most people's ideas of strong type checking,
since we compute the pointer to the buddy of a block by exclusive-oring the
size of the block (an integer) with the pointer to the block!
The Fibonacci Buddy System
The Fibonacci buddy system is an alternative where block sizes are
Fibonacci numbers. This reduces internal fragmentation by providing
for a larger variety of free block sizes.
The Fibonacci numbers are 1 1 2 3 5 8 13 21 ...
Each Fibonacci number is the sum of its two predecessors. The problem
with this system is that there is no trivial way to compute the address
of the buddy of a block given the address of that block.
Exercise: Implement the Fibonacci buddy system.
Boundary Tag Storage Allocation
Boundary tags are data structures on the boundary between blocks in
the heap from which storage is allocated. The use of such tags allows
blocks of arbitrary size to be used -- the tag describes, for each block,
how big it is, its status, and its neighbors:
HEAP________
|________| tag
| used |
| |
| |
|________|
|________| tag
| used |
|________|
|________| tag
| free |
| |
|________|
|________| tag
| used |
|________|
|________| tag
| free |
|________|
|________| tag
The consequence of allocating arbitrary sized blocks on the heap is
that the bookkeeping needed to split blocks, merge adjacent free
blocks, find the right size block, and find the neighbors of a
block all become more complex. The boundary tag data structure
permits this, but we pay a price, the overhead of the tag itself.
The decision to use a boundary tag method poses many choices. Among these
are the lazy versus prompt merger decision that was present with the
buddy systems. Another important choice is that between best fit
and first fit allocation: When searching a collection of free blocks,
should we take the first block we find that is big enough to satisfy a user's
demand, or should we do a more careful search, looking for the best fit?
Intution suggests that best-fit allocation should reduce fragmentation, but
in fact, if an exact fit is unavailable and there is a wide variety of
commonly requested block sizes, empirical evidence shows that first fit
generally works better because it produces larger fragments that are more
likely to prove useful at a later time. The problem with best-fit
allocation is that is produces large numbers of very small fragments.
It is very important to note that in most heap management environments, an
exact fit is very likely to be available! Most applications repeatedly
allocate and deallocate objects of only a few sizes, and a good storage
manager should take advantage of this. On the other hand, it is important
to remember that there are applications that have genuinely random
sequences of allocations. One example of such a random sequence of requests
comes when opening a succession of windows to view images, where each image
was hand cropped (trimmed to size) to eliminate distracting or irrelevant
content. Some web pages are full of such randomly sized images, and viewing
such pages can lead to significant heap fragmentation.
Much of the early research in the area of heap management was done by
the Burroughs Corporation in the early 1960's, but research in this area
continues. Some of the best recent papers in the area were published in
a the Proceedings of the ACM SIGPLAN International Symposium on
Memory Management, published as SIGPLAN Notices, Vol 34, No 3,
March 1999. The most relevant paper in this proceedings is
"The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson,
pages 26-48.
Boundary Tags
|_______|
|_______| <--
| | |-- Block boundary tags
| | | give
| block | | 1) Block status
| | | (used or free)
| | | 2) Block size
|_______| | 3) Addresses of neighbors
|_______| <--
| | (2 & 3 are correlated!)
With the buddy system, the heap manager needed no record of the
blocks allocated to users, but the arbitrary sized blocks used in
boundary tag methods require that the manager keep a record of each
block in the heap, whether it is allocated or free. These records
are easily maintained in the heap itself, in the form of a prefix or
suffix on each block. From the user's point of view, these prefixes
and suffixes are not part of the block -- they are part of the
overhead of the storage manager.
When the heap is viewed as a sequence of blocks, the suffix on one
block is adjacent to the prefix on the next block, and every boundary
between blocks, whether they are used blocks or free blocks, contains
a prefix-suffix pair. This pair is the boundary tag from which the
entire class of storage allocation methods takes its name.
For each block, the boundary tag must contain the status of that block
so that, at the appropriate time, it can be determined whether the
block is free, available for allocation or merger with another free
block, or in use, and therefore unavailable. Another piece of
necessary information is the size of the block, so that the allocation
algorithm can determine whether the block is big enough to satisfy some
allocation request. Finally, when it is necessary to merge free
blocks, it must be possible to find the neighbors of a free block in
order to see whether they are free.
The size of a block and the addresses of the neighbors are redundant,
since if you know the address of the following block and the size
of the boundary tag fields, you can compute the size of a block, and
if you know the size of a block, you can compute the address of either
end if you are given the address of the other end.
Boundary Tag Alternatives
First Alternative Second Alternative
| | | ^ |________|
| | | | |__size__|
->->|________| | | |
| |_status_| | | block |
| |__next_o-- |________|
| |__prev_o---- |__size__|
| | | | |_status_|
| | | v | |
There are two popular ways of storing the information needed in the
boundary tags. In the left method shown above, the boundary tags
are arranged into a doubly linked list. Each tag contains a pointer
to the next tag and to the previous tag, and each tag contains the
status of the block it immediately precedes. As shown in the
illustrations, all pointers point to the first word of a block in the
heap, so positive offsets from the pointer are indices into the block
and negative offsets are indices into the boundary tag. This
convention is convenient but not essential.
The right method shown above records the size of each block as a
prefix and suffix on that block, and includes the status in either
the prefix or suffix or, on some systems, both (above, it is shown
in the prefix).
Boundary Tag Free Lists
^ | | In the worst case,
| | free | free blocks must
| |________| be arranged into
--o_______| a doubly-linked
-----o_______| free list. This
| ^ |////////| tag allows fast
| | | | allocation and eager
| | | used | merger.
| | |________|
| | |////////| tag For some other
| | | | choices of allocation
| | | free | method, a simpler
| | |________| free list structure
| --o_______| will suffice.
-> --o_______|
| |////////| tag
v | |
The free list is maintained in the free blocks
themselves, not the bounary tags, so the minimum free block
size depends on the number of pointers per list element, two if
the list is doubly linked, typically
one word each. The boundary tag itself requires two pointers or
two block sizes (typically, the block size is the same size as a
pointer), and one status bit (typically a whole word), so in order
to allocate a two word block on the heap, it is typically necessary
to find a total of five free words.
The above argument implies that boundary tag methods cannot eliminate
internal fragmentation! When a user requests a block of size X and
the only block available is size X + E (E is the extra part), then
the available block must be split into a block of size exactly X and
a block of size E - T (where T is the size of a boundary tag). If
E - T is smaller than needed to store the pointers needed to link it
into the free-list, then the split cannot be done. For the example
data structure shown above, E must be at least 5 words, so there will
be occasional internal fragmentation involving fragments of 4 or fewer
words.
In the extreme case, no free list is needed. When a user wishes to do
allocation, a linear search over the list of boundary tags will always
manage to find a free block, but if the heap is largely full, this
search will be unnecessarily slow as it examines all of the allocated
blocks in the heap in the search for a free block of the right size.
If free block merger is lazy, then the merger can be done by a linear
scan of the entire heap, rebuilding the free list as free blocks are
found and merged. In this case, the free list can be singly linked.
If free block merger is done on an eager basis (at deallocate time)
or on an eager basis (at allocate time) during the scan of the free
list, where the free list is randomly linked with no attention to the
addresses of blocks in the heap, then double linking is needed because
when two adjacent free blocks are merged, one has to be removed from
the free list and without a double link, it's successor in the list
cannot be found.
The so-called fastest fit storage allocation algorithm uses a simple
heap manager with a singly linked free list as its foundation and lazy
block mergers; free blocks of popular sizes are stored in secondary
free lists, one for each popular block size. Allocations for commonly
used sizes check the appropriate secondary free list first, before
using the underlying heap manager. Deallocations of common block
sizes simply place the blocks in the secondary list (uncommon sizes can
be returned to the main free list).
Only when an allocation request to the underlying heap manager cannot
be satisfied are free blocks merged -- this is done by scanning the
entire heap and rebuilding the main free list, leaving all the
secondary free lists empty.
The fastest fit algorithm gives what may be the best average-case
performance, but its worst case performance is not good. For the worst
case, first-fit is better! This is one of many examples of situations
where the best algorithm for real-time use is not the same as the best
algorithm for minimizing the total run-time!
Exercise: Write code for a boundary tag storage manager, using
prompt merger.
Виртуальная память
What if the aggregate memory requirements of the system plus applications
excede the available main memory. If there is auxiliary memory available,
for example, on disk, there are at least three ways of using this as a
secondary memory or backing storage:
Overlays
Overlays are sections of an application's code or data that are
stored on secondary memory when not in use. It is up to the application
to read overlays into main memory when they are needed,
and it is up to the application to write overlays back to secondary memory,
if necessary, when they are unneeded.
Typically, each overlay contains a procedure or function. The main program
must load the overlay before calling the routine, and the routine in the
overlay may call any routines that were loaded with the main program. In
addition, an overlay may contain secondary routines called only from within
that overlay, and it may load other overlays.
In the most primitive overlay systems, memory to hold the overlays is
allocated statically. In more advanced overlay systems, memory to hold
overlays is allocated dynamically. In such a system, a call to a routine
contained in an overlay might be coded as follows:
{ /* load overlay X and then call function x within it */
int X; /* file descriptor */
int Xsiz; /* size of file */
void * Xpt; /* pointer to overlay */
/* first find out about the file */
X = open("X", ...);
Xsiz = lseek( X, 0, L_XTND );
/* second, make space for the contents */
Xpt = malloc( Xsiz );
/* third, load the file in memory */
Xsiz = lseek( X, 0, L_SET );
read( X, Xpt, Xsiz);
/* then, call the function */
(*Xpt)(params)
/* finally, dispose of the memory it occupied */
free( Xpt )
}
The above C code assumes that the overlay can be loaded into memory
by a simple call to read(). This will be true if the file contains
position independent code, but a more complicated loader is required
if any relocation or linkage needs to be performed.
A common form of overlay is the load-on-call procedure. This purpose of
this is to hide the details of overlay management from the user of the
overlay, packaging these details in a special procedure called an overlay
stub. The only part of the procedure initially loaded in main memory is
the load-on-call stub; a call to this loads the procedure into main memory and
then transfers control to it. Thus, the act of calling the procedure
causes it to be loaded.
The word stub is widely used in discussions of program development,
dynamic linkage and distributed software, so it is important to
make it clear that this word is not just technical jargon, but also
common English.
A stub is the remainder of a branch or limb that has been cut or broken
off of a tree or animal. In the case of software, the tree that is of
interest is the abstract tree describing the history of procedure and
function calls during the execution of a program. The root of this tree
is the main program, and each routine called from the root corresponds
to a branch. A routine that calls no other routines is a leaf of the
tree, and if some branch of the tree is run on a different machine, or loaded
by a different loader, or involves code that has not yet been written,
the code at the base of that branch is referred to as a stub.
Consider the procedure X. In a program calling X where the memory is not
sufficient for all routines needed by that program, a call to X might be
linked to a stub instead of being linked to X. The following stub is
designed to mimic the external interface of X, and when it is called, it
loads X into memory, calls it, returns the results of X, and then deallocates
the memory used so that it may be used for other routines.
Xstub(params)
Xpt = malloc( sizeof(X) )
load( X, Xpt )
(*Xpt)(params)
free( Xpt )
return
Here, sizeof(f) returns the size of the memory region needed by the object
file f, and load(f,p) is a system routine that loads object file f into
the memory region pointed to by p.
More intelligent overlay management stubs can swap in a routine only if
that routine was not already in memory. Consider the following stub for X
that shares memory with routines Y and Z:
Xstub(params)
if Xpt is null then
Xpt = malloc( sizeof(X) )
if Xpt is null then -- the allocate failed
if Ypt is not null then
free(Ypt)
else if Zpt is not null then
free(Zpt)
endif
Xpt = malloc( sizeof(X) )
endif
load( X, Xpt )
endif
(*Xpt)(params)
return
Here, when X is first called, the stub will allocate space and load the
code. On subsequent calls, the code will remain loaded so no new allocation
is needed. If, however, space is needed for routines Y or Z, X may be
deallocated. If X is needed ans space is unavailable, the stub for X will
deallocate Y or Z to make space.
The above scheme rests on the assumption that the stubs for X, Y and Z
cooperate to share memory, and it rests on the assumption that the code
for X can be deallocated safely when Y or Z need the space! This assumption
can be met if no procedure called from X calls Y or Z, no procedure called
from Y calls X or Z, and no procedure called from X cally X or Y.
By the
late 1960's, most computer manufacturers offered elaborate linkage editors
that would analyze the procedure call graph of a program to determine
what procedures could operate as peers in overlay management schemes such
as the above. Today, such overlay managers are almost obsolete because
of the widespread use of virtual memory.
Overlays allow a single program to run in a main memory region smaller than
that program. The next technique we will discus allows multiple processes
to run in a meory that cannot hold all of them at once. Note, however, that
it is almost always possible to break a single program into multiple
communicating processes, where the purpose of the breakup is not to achieve
parallelism, but merely to reduce the size of the address space needed by
any single component of the original program. Therefore, we can also use
the next idea as an alternative to overlays!
Swapping
Swapping involves the process scheduler. The term was originally
based on the common English verb to swap meaning to exchange, because
the basic idea involved exchanging a process (or process image) stored on
disk for one stored in main memory. When swapping is used,
memory assigned to ready processes is considered preemptable. If memory
is needed, a ready process is selected to be swapped out. Swapping a process
out of main memory involves copying the data segment(s) of that process to a
swapping device, usually a disk. When a process is selected to be run, if
that process is swapped out, the system must first find or make space in
main memory, then swap that process in, before running it.
This leads to a new process state swapped. Ready processes become
swapped when their memory resources are moved to secondary memory, and
swapped processes become ready when their memory resources are moved
back to primary memory. Swapped processes cannot be run because the CPU
is unable to execute instructions from secondary memory and unable to
load or store operands of machine instructions in secondary memory!
preempt
or swap
relinquish out
---->-- ---->-
/ \ / \
RUNNING READY SWAPPED
\ / \ /
--<---- -<----
schedule swap
in
Swapping systems are at their best if main memory can be divided into
at least two partitions, each able to hold one process.
In this case, the process in one partition can be running while
processes in other partitions are being copied to or from disk.
The classic IBM mainframe operating systems of the 1960's operated
this way, as did early timesharing systems for the DEC PDP-8 and PDP-10
(a minicomputer and a mainframe, respectively). The first versions of
UNIX, on the DEC PDP-9 and PDP-11 were also based on this idea.
It is worth noting that, on these systems of the 1960's, swapping was
frequently done to a drum memory, so early system descriptions frequently
speak of the swapping drum.
If memory on a uniprocessor was divided into N partitions, there could
at most N-1 ready processes and one running process. If the total number
of processes that could be run was greater than N, there would
typically be were N-2 ready processes because at any given instant,
the process in one of the partitions was in transit to or from disk.
Paging
Virtual memory based on
paging and the related technique of segmenting involves
hardware address translation mechanisms that allow individual pages
of the main memory used by a program to be preemptively claimed
the system for use by other programs.
The subject of virtual memory lies on one of the most important
interface between operating systems and computer architecture because
it depends on the nature of the memory management unit hardware, and
the memory management unit hardware exists primarily to support
virtual memory. The memory management unit sits between the
processor and main memory:
_____ data _____
| | |<------------------------>| |main |
| CPU |--| _____ |--| RAM |
|_____| |-------->| |--------->| |_____|
virtual | MMU | physical
address |_____| address
Addresses issued by the CPU are considered to be virtual addresses.
The memory management unit translates these to physical addresses
or it complains (with a trap request) that the translation is
impossible or that the proposed access is not allowed.
Two primary models of address translation have been explored by the
designers of memory management units:
Paged memory management units operate in terms of fixed-size
pages of memory, while segmented memory management units operate
in terms of variable-size segments of memory. Mixed models are common
in modern hardware; most of these are based on variable sized segments where
each segment is composed of one or more fixed size pages.
We will concentrate on paged memory, but there are many segmented
machines. Notably, the 80x86 family of microprocessors support a
segmented memory model that is most clearly expressed in the 80286.
This was not effectively exploited by most operating systems developed
for the x86 family.
Pages are typically used without regard to the logical arrangement
of memory -- that is, data structures are not typically arranged
in any particular way with respect to page boundaries. Programmers
on paged systems do not usually pay any attention to the presence of
pages. The exception to this is typically the heap manager; better
heap managers know of the presence of pages in the heap and enlarge
the heap by requesting extra pages when it fills, and also sometimes
relinquish unneeded pages in the middle of large free blocks.
In a segmented system, each data structure is typically assigned to
a particular segment. Thus, a segments might be dedicated to
each procedure or to each activation record on the stack in a fine-
grained segmented system. In a coarse-grained use of segmenting,
one segment might be reserved for the stack, another for the code,
and a third for the heap and global data of a program.
On paged systems, users generally view the virtual address
is viewed as a single integer that indexes a word (or byte) in
a single linear virtual address space. In contrast,
on segmented systems, the virtual address space may be a single
linear space, or it may be divided into separate spaces. The 80x86
is an example of the latter, where the segment being addressed is
determined by the context. By default, PUSH and POP reference the stack
segment, most loads and stores reference the data segment, and
instruction fetches reference the code segment, but this default
behavior may be overridden by special instruction prefixes.
Paged Address Translation
Here is a diagram of the data flow through the memory management
unit hardware used to translate a virtual address to a physical
address in a simple paged virtual address space.
Virtual Address from CPU
______________________
|__________|___________| Word in Page
| Page |___________
| ___ _________ |
| | | | | |
array|__| | Page | |
index | | Table | |
V | | |
-->| | | |
|_|_______| |
| |Frame |
__| |__ |
| | |
v _____v_________v______
Access |__________|___________|
Control
Physical Address to Memory
The picture shown here shows the logical function of the
memory management unit, but there are many possible physical
organizations -- if the page table is large, it is not practical
to store it entirely in a table within the MMU! However the page
table is stored, the virtual address is divided into two fixed
size fields, where the most significant field gives the page number
in the address space, and the least significant gives the word number
in that page (or byte-in-page, for byte addressable memory). The
physical address is similary divided into a frame number field and
a word -in-frame field. The translation function, done by a lookup
in the page table, only deals with translating the page number to a
frame number; the least significant bits of the address are not changed.
The Contents of a Page Table Entry
The format of one entry in the page table will typically be
something similar to the following:
_ _ _ _______ ______________________
| _ _ _ |_______|______________________|
| | |
| | Frame -- where is the
| | page stored.
| Access
| Control ------- what actions are allowed,
| by hardware on this page:
Soft || invalid
fields -- used by software || read-only
to record other || read-write
(optional) attributes of the || execute-only
page. || dirty
Some hardware allows extra bits in each page-table entry where those bits
have no hardware function; these bits are available for use by software
for any purpose the software may have, so they are called the soft fields
of the page table entry. If the software needs extra space beyond what
the hardware may have provided, it is always possible to allocate this
in a parallel array unknown to the MMU hardware but also indexed by the page
number.
Page Table Storage Options
On systems with a small virtual address space, the page table can be
stored entirely in hardware registers within the memory management unit.
Consider the MODCOMP IV computer, a minicomputer first sold in around 1973.
The MODCOMP Classic series of computers descended from the MODCOMP IV remained
in production for many years, and in fact, www.modcomp.com still exists and
still offers some support for their Classic line of processors.
The MODCOMP IV had a 16-bit virtual address, 8 bits giving the page number
and 8 bits giving the 16-bit word in page. The address map for this
machine therefore occupied 256 internal registers in the MMU. This allowed
very fast address translation, at the cost of slow context switching.
Loading or storing one address map took as much as 256 memory cycles!
The designers of this machine built the MMU with 1K internal registers,
so that the MMU could hold 4 complete address maps. As a result, so long
as the number of ready processes remained small, it was rarely necessary
to load and store maps when context switching between processes. All that
was needed was to change a small map-select register.
If the page table is too large to store in special purpose registers inside
the MMU, it can be stored in main memory. In this case, the MMU needs
to contain a page-table base register plus a memory data register. Context
switching between virtual address spaces, can be very fast, all that is needed
is a change to the page-table base register. Unfortunately, this has its
cost! For each address to be translated, the MMU must read a page table
from main memory to the internal memory data register, and then use this
plus information from the virtual address to compute a physical address.
Thus, each memory cycle of the CPU takes two memory cycles, one to translate
the address and one to do useful work.
Despite this overhead, this solution is very common! The key to efficient
implementation of this is to add a special cache memory called the translation
lookaside buffer. This is a cache memory dedicated to memory address
translation and integrated into the memory management unit. So long as
the address being translated is in this cache, no extra memory cycles are
needed and the MMU operates as quickly as a simple MMU holding the entire
page table in internal registers. From a software perspective, context
switching is simple, just change the page-table base register in the MMU,
but this has a run-time cost because each time this register is changed,
the entire contents of the cache are invalidated by hardware.
The Ferranti Atlas computer, introduced in 1960, was the very first
machine to support paged address translation, and the MMU scheme it
used was remarkably close to a TLB based scheme, although the MMU
on the Atlas was actually implemented in a distributed manner, distributed
over all of the physical memory modules of the machine.
Each entry in the TLB of a typical moder machine has the following form:
Cache Entry
_______________________________
|____________|__________________|
| Page | Page Table
| Number | Entry
| |
Subject to Returned by
Associative Successful
Search Search
In its simplest form, the associative search involved in searching
the cache involves a simple parallel comparison of the page number
in each cache entry with the page number issued by the CPU. This
can be done very quickly with appropriate hardware. Cache hardware
to support fully associative searches is expensive, but the
alternatives are the subject of an architecture course and are
will not be discussed here.
If the cache size is the same as the number of page frames in main
memory, a cache miss signals a page fault, and the software can be
put in charge of replacing the cache entry as it replaces the page
in the page frame. This is what was done on the Atlas.
This approach results in the simplest possible
cache hardware, since the hardware need not manage cache misses and
replacement.
Several modern RISC processors use variations on this simple model,
but most MMU designs from the 1970's and early 1980's (including that of the
Pentium) rest on hardware cache management.
The Page-Fault Trap Service Routine
The page-fault trap service routine
gains control when the MMU detects an attempt to use a
page-table entry that is invalid or an attempt to perform
an operation that is not permitted by the access-control
bits in the page-table entry.
A trap is very similar to an interrupt. The traditional
distinction is that interrupts are caused by asynchronous
events, while traps are caused by the actions of the CPU.
Other traps include such things as arithmetic exception traps
(caused by the values of operands to arithmetic instructions),
and bus-fault traps (caused by attempts to access physical
addresses to which no physical memory is attached).
On some machines, there is an additional convention -- When
an interrupt is requested, the current instruction is allowed
to complete before the PC is saved and the interrupt service
routine is entered. When a trap occurs, the current instruction
is aborted and the trap service routine is entered directly.
For it to be possible to serve a page fault, the instruction
must be aborted before it has any side effects, or (as was done on the
DEC PDP-11) all side effects must be recorded so that they can be
undone by the trap service routine prior to any attempt to
restart the instruction.
The exception to the above rule is on machines such as the
Intel 80x86 family, where some instructions are designed to
be interruptable -- typically, this applies to block copy
instructions. These instructions, if interrupted, save their
state in CPU registers in such a way that they can be restarted
from the point of interruption instead of restarting from the
beginning.
The memory management unit hardware requests a page-fault trap when
an untranslatable virtual address is encountered. The response to this
trap is entirely under the control of the operating system software,
specifically the page fault trap service routine.
For all interrupt and trap service routines, the hardware and software
cooperate to save the program counter and other key registers as control
is transferred to the trap service routine. The details of how this is
done vary considerably from machine to machine. On completion of the
entry to the page-fault trap service routine, the saved PC is the
address of the instruction that caused the trap.
Before return, the trap service routine must adjust
the page table so that re-executing the instruction that caused the trap
will not cause the same page fault. On return, the state of the CPU
is returned exactly as it was prior to executing the instruction that
caused the trap. As a result, from the perspective of the program, it
will be as if there was no page fault trap!
The PDP-11, sold from 1970 into the 1980's, had a complex instruction set
where some instructions could have several side effects before they computed
a memory address that could cause a trap. The memory management unit on
this machine included a special register that recorded these side effects
(what registers had been incremented or decremented by how much); on entry
to the page-fault service routine, the first computation required was to
undo these side effects, restoring the CPU state to what it had been before
the instruction that caused the trap.
A single instruction can potentially cause many page faults. Consider a
memory to memory move double word instruction on a machine like the PDP-11
or the Motorola M68000. The instruction itself will typically be more
than one word long, so the instruction may span a page boundary.
Furthermore, both the source and destination addresses may reference double
words that span page boundaries. Thus, an attempt to execute this single
instruction could cause as many as 6 page faults!
Clearly, if there are fewer than 6 page frames on the machine, this
instruction cannot be executed. In general, the instruction set of
a virtual machine must be examined to determine the maximum number of
pages a single instruction can execute, and then that number of page
frames must be guaranteed.
Each entry to the page fault trap service routine must bring in one
of the pages needed to complete an instruction. It need not bring
in all of the required pages, since it is frequently faster to simply
retry the instruction again and again so that the CPU can compute the
successive addresses and let the MMU check each to see if it causes a
fault.
Постраничная память
Page Faults
When a program references a page that the memory management unit is unable
to handle, or when the reference requires access rights that are not allowed
by the memory management unit, the memory management unit reports a
page fault by forcing the CPU to enter a trap service routine, usually
a part of the operating system, but there are systems such as Mach where
page faults may be handled by application-level code.
The instruction that was executing at the time of the trap is aborted!
On entry to any trap service routine or interrupt service routine, a properly
designed CPU will cooperate with the prologue (the introductory sequence of
instructions) of the service routine to save the values of all of the
registers of the interrupted program, including the PC. In the case of
trap handlers, the saved values should be the values that were in effect
immediately prior to the instruction that caused the trap.
Note:
On some machines, the values for some saved registers include changes made
by the instruction that caused the trap. For example, the program counter
may have been incremented as a side effect of the instruction fetch prior to
the memory cycle that caused a page fault, or the stack pointer may have been
decremented prior to the memory cycle that caused the fault. If such a
machine is to be used to support demand-paged virtual memory, there must be
a way to determine what registers have been changed so that the prologue
on the page fault trap service routine can undo these changes!
In theory, given the PC and registers prior to the instruction that caused the
trap, the trap service routine can simulate the execution of the next
instruction to determine all of the memory addresses it might use; for each
address, it can check
to see which of these might have caused the trap. This is simple enough that
on some machines, the memory management units do not save the actual
virtual address that was used to cause the trap; in this case, the address
must be computed by the prologue to the trap service routine. The software
is simplified, however, if the MMU saves the virtual address that caused the
trap, thus saving the trap service routine from having to simulate an
instruction cycle.
Here, in outline form, is a generic page fault service routine:
page fault service routine:
-- prologue
save the state prior to the instruction that caused the trap
-- find necessary information about cause of fault
va = virtual address that was referenced
op = operation that was attempted (read or write)
-- is this an access violation?
if map[va.pagenumber].rights = read_only
and if op = write
handle as user access violation (bus trap!)
else
-- find a free page frame
if there are free page frames
pa = free frame address
else
-- page replacement policy
pa = address of page to replace
va' = virtual address of page in that frame
write( pa, diskaddress( va' ))
update virtual address map, etc to reflect change
endif
-- bring in the required page
read( pa, diskaddress( va ))
update virtual address map, etc to reflect change
-- epilogue
restore saved state
return from page fault service routine
endif
This outline leaves us with two important choices; what page(s) should be
copied into main memory from disk, and, assuming that main memory is full,
what page(s) should be copied out from main memory to disk in order to
make space.
There is also the question of what disk address should be
mapped to each virtual address. Some systems simply set aside a disk device
or a partition of a disk device (a virtual disk) to hold the entire range
of legal virtual addresses. On these systems, if we assume that the page
size equals the size of a disk sector, we can use the page number directly as
a sector number on the disk or disk partition. Another approach is to set
aside a disk file in the file system for each virtual address space. This
makes sense in systems that support multiple virtual address spaces; when this
is done, we can use the virtual address of a page as the offset into the file
for reading and writing the saved copy of each page.
What page should be copied into main memory?
When a page fault occurs because a program tried to access a specific
virtual address that was not currently mapped into main memory, we must
bring that page into memory. If this is the only way pages come into main
memory, we say that our virtual memory system uses pure
demand paging. That is, pages are only brought into memory
on demand.
If, on the other hand, our system attempts to guess the identity of pages
that will be needed in the near future and brings these into main memory
along with the pages that were demanded, we say it uses
anticipatory paging. That is, the system tries to anticipate pages
that will be needed soon.
A typical anticipation scheme is based on the observation that most
instructions are executed sequentially, and many operand references are
located in sequential memory locations. Therefore, we can estimate the
likelihood that a memory reference to page X will be followed in the near
term by a reference to page X+1. If the likelihood is under 1/2, it is
obvious we should not bring in page X+1; if the likelihood is sufficiently
greater than half, there is a good chance that we will gain by an anticipatory
transfer.
If we study program execution patterns to find the expected size of a function,
call this S, we can use this as an estimate of the likelihood that executing
location X will predict the execution of locations up to X+S in the near
future. More generally, we can study the sequences of operand addresses
issued by running programs to get an estimate of the likelihood of references
to X+S within the next T time units, for all S and all T, and we can use this
as a basis for a sound anticipatory paging policy.
Of course, not all programs behave typically. Experience on systems that
support anticapitory paging shows that it does not always pay off.
Furthermore, it is sometimes possible to predict, as you write a program,
that the program will behave poorly under an anticipatory paging system,
for example, when traversing linked lists that were constructed at random.
One way of dealing with this is to allow applications programs to turn off
anticipation when the programmer believes it will behave poorly, and turn it
back on again.
What page in main memory should be replaced?
What page in main memory should be replaced? We refer to this as the
question of page replacement policy. In the above outline of a
page-fault service routine, this issue was glossed over this by asking for
the address of page to replace.
There are a number page replacement policies that have widely accepted names:
- Random page replacement:
- Random replacement requires no intelligence and works
moderately well. When a page frame is needed, the random algorithm simply
selects any page frame at random, dumps the page from that frame onto disk,
and then uses it. If the memory reference pattern of the application program
is very irregular, for example, as occurs during some list processing
applications, it may even be the best solution.
- FIFO page replacement
- First-in first-out replacement is an obvious policy;
here, the page that has been in main memory the longest is the one
replaced. For most applications, this turns out to be as
bad as random replacement in practice. The fact that a page has been
in memory for a long time does not generally imply that it will not
be needed soon!
- LIFO page replacement
- Last-in first-out replacement is included in this list
only for the sake of completeness. This policy is among the worst possible
policies! The page most recently brought into main memory is frequently
among those that will be used very soon, and sending it back to disk is
almost always a very bad idea.
- LRU page replacement
- The least-recently-used replacement policy is usually the best
choice, if appropriate hardware is available to support it.
This policy selects for replacement the page in main memory that
was least-recently used. That is, the page that has not been used for
the longest time. If a page has not been recently
used, it is reasonably likely it will not be needed anytime soon.
Unfortunately, LRU replacement requires that the MMU perform some fairly
complex record keeping, for example, by keeping a constant record of when
each page was most recently accessed.
- Clock page replacement
- Clock replacement is almost as good as LRU replacement. This
scheme divides the set of pages in main memory into those that have recently
been used, and those that are less recently used,
and selects for replacement one of the less recently used
pages using an algorithm to be discussed later.
Where the pure LRU replacement scheme needs several bits per page frame
to select the page for replacement, for example, enough bits to store a
timestamp, the clock algorithm needs only one bit to distinguish between
those pages that have been recently used and those that have not.
Therefore, the hardware support for clock replacement is relatively simple,
and in fact, special hardware can be entirely dispensed with.
- Belady's optimal page replacement policy
- Belady, working at IBM in the late 1960's, developed an optimal page
replacement policy. This policy is entirely impractical in practice,
but because the number of page faults required by this policy can be worked
out retrospectively, it is possible to determine how close the practical
policies come to optimality. The optimal policy is simple: Replace that
page that will not be needed for the longest time in the future!
LRU Replacement
Required Hardware
The best of the practical page replacement policies is to replace the
least recently used page, that is, the page which has not been referenced
by the CPU for the longest time in the past.
This requires special hardware to track every memory access. One way to
do this would be to have the memory management unit manage a table of page
frames, where each page frame contains a timestamp that is updated with each
reference to the page in that frame.
This nearly doubles the complexity of the memory management unit, but
if we move the timestamp to the page table entries, recognizing that
this field will only have use for those pages that are currently in frames
in main memory, we can simplify the hardware. The resulting page table
structure might be as follows:
A Page Table Entry
_________________________________
|__________|______|_______________|
| Time |Rights Frame |
| Stamp | |
| | |
| added | original fields |
Why Does LRU Replacement Work Well?
Demand paged virtual memory works, that is, it offers access times
comparable to main memory, at a price comparable to
secondary memory because programs exhibit what is called
a working set. The working set of a program is the set of
pages it needs in order to make progress at some instant in time.
If enough page frames are available in main memory to store the entire
working set of a program, that program will execute almost as quickly
as it would if it was executing out of main memory with no paging.
In practice, most programs exhibit a well defined working set that
changes only slowly with time. That is, they exhibit strong
locality of reference. The locality property involves two
components: In programs that exhibit strong temporal locality,
the fact that a memory location has recently been referenced strongly
suggests that it will soon be referenced again. In programs that
exhibit strong spatial locality, the fact that some memory
location has recently been referenced strongly suggests that nearby
locations will soon be needed. Squential execution of programs and
sequential processing of arrays lead to spatial locality, while
iterative control structures lead temporal locality for both code and
data.
The LRU policy directly addresses the issue temporal locality, and the
organization of memory into pages that contain more than one
consecutive memory location addresses spatial locality. Programs with
good locality have well defined working sets, while programs with
poor locality have poorly defined working sets.
Belady's optimal page replacement algorithm is optimal because it
examines the future to make page replacement decisions. The LRU
policy, in contrast, uses the predictions it can make by assuming that
the program has good locality, and usually, these are good predictions.
LRU is usually considered to be optimal among practical algorithms,
although there are contexts where some anticipatory paging can lead to
marginal improvements.
MULTICS on the GE 600 used true LRU replacement, back in the late
1960's, but few systems since that time have used true LRU.
The MULTICS implementation used a special high-speed core
memory to implement its cache of all currently available page frames.
The Following sequential algorithm was used by the address translation
unit, implemented in fast hardware:
I = 0
Fetch entry(I) into temp
while I < cache_size
and temp.page /= virtual_address.page do
I = I + 1;
exchange temp with entry(I)
endloop
if temp.page = virtual_address.page
Store temp into entry(0)
else
Report a page fault for virtual_address.page
Report that temp.frame is the least recently used page frame
Copy the map entry from main memory into cache entry(0).
This keeps the entries in the cache ordered by recency of use.
This ordering guarantees that the cache will be fast for references
to recently used frames (those that are likely to be frequently used)
while it is slow for frames that weren't recently used (those that
are unlikely to be frequently used).
The MULTICS scheme is an elegant historical curiosity, of only
limited practical value today because it is predicated on core
memory technology.
MULTICS was the first design for a large-scale timesharing system, developed
with federal funds at Project MAC at MIT starting in the mid 1960's. The
original development involved a consortium involving MIT, GE and Bell Labs,
but Bell Labs dropped out of the project, and Honeywell bought GE's computer
business. In the 1970's, MULTICS became one of two standard operating
systems offered on Honeywell mainframes.
MULTICS was originally conceived of as a computer utility. The idea was
that such a utility would be operated in much the same way as the telephone
and power utilities, but instead of selling power or audio communication
services, the computer utility would sell CPU time and disk storage to its
customers. Today, this isn't a radical idea -- there are many computer
utilities, ranging from national services like AOL to smaller local
providers. The concept was a radical one in the mid 1960's, and this idea
drove a number of technical developments.
Clock page replacement
The clock algorithm allows a good approximation of LRU replacement, although
if there are too few available page frames, it degenerates to FIFO replacement.
Clock replacement requires that there be one bit per page frame to record
whether that page has been recently referenced. The mark bit is set by
the memory management unit hardware whenever the page in some frame is
referenced, and it is reset by the page-fault service routine, in the process
of deciding what page frame to replace.
To select a page frame to replace, the page fault service routine software
sweeps through the page frames, resetting the referenced bit on each page frame
and stopping when it finds a page frame with this bit already reset.
The page in this frame is the one that is replaced (copied out to disk to
make an empty frame).
The clock algorithm is usually pictured with a clock face,
where each minute on the clock face corresponds to a page frame.
The clock hand remains fixed except during the execution of the
page fault service routine.
When a page fault occurs, the clock hand begins its sweep. The
hand stops as soon as it sees an unmarked page frame. The clock hand
removes the marks from each page frame it sweeps over.
Data Structures and Code for the Clock Algorithm
The Clock Algorithm requires a frame table, that is, a table indexed by
physical memory page frame number and containing, for each page frame, the
virtual address or page number of the page in that frame, along with at
least one status bit, the mark bit. The mark bit can be moved to the page
table, at the expense of a level of indirection in each test of the mark bit
by software in the replacement algorithm. This is typically done in order
to simplify the MMU by having it deal only with the page table.
Frame Table
________________
|_X__|___________|
|____|___________|
Clock Hand - |____|___________|
\ |____|___________|
----->|____|___________|
| |_X__|___________|
Sweep Direction| |_X__|___________|
| |____|___________|
V |_X__|___________|
|mark| what page |
|bit | is in this|
| | frame? |
find page to replace:
loop while frame_table[clock_hand].mark is set
reset frame_table[clock_hand].mark
clock_hand = (clock_hand + 1) mod table_size
endloop
-- now, frame_table[clock_hand].what_page
-- is the page number to be written back to disk,
-- and clock_hand is the frame number to use with
-- the page being brought in from disk.
Clock replacement has been widely used on many computer systems, including
the IBM System 370 and its descendants in the IBM mainframe family,
later versions of the Multics system, and many others.
Modified Clock Replacement
It turns out that the clock algorithm can be modified to work without
any special hardware support. The first large-scale use of this
modification was in the DEC VAX family of computers introduced in the
late 1970's, but many modern RISC systems also take advantage of this
to simplify their memory management units.
As the clock hand sweeps by a page frame, all page table entries referencing
that frame are marked as invalid. In fact, this page is still in main memory,
but by marking it as invalid, we force the the memory management unit to
force a page-fault trap the next time the application references that page.
We use this combination, page-in-main-memory but marked-as-invalid in the page
table, to be equivalent to mark-bit-reset in the original clock algorithm,
while page-in-main-memory and marked-as-valid in the page table is used
to mean mark-bit-set.
When the CPU attempts to reference a page marked as invalid in the page table,
there is a page fault. If the page is actually in memory, this is called
a soft page fault. The page table entry is simply changed from
invalid to valid when this occurs, so it is a very simple fault to handle.
Of course, we must add at least one new bit to each page table entry to
distinguish between hard and soft page faults. This bit is never tested
or set by the hardware, but is only used by the page-fault service routine.
Therefore, it is one of the soft bits in the page table entry.
One of the most interesting models for representing the soft bits in the
page table uses a complete copy of the access-rights field in the soft bits.
The soft rights field gives the real access rights the user has to the
page, while the hardware access-rights field gives the rights that the memory
management unit will enforce, requesting a page-fault interrupt if they are
violated.
_____________________________
|______|______|_______________|
| Soft |Rights Frame |
|Rights| |
| | |
| soft | hardware fields |
| bits | |
The state of the page, from the point of view of the modified clock algorithm,
is represented by the relationship between the hardware rights field and the
soft rights field:
- Soft Rights = Rights = accessable
Marked. This page is not a candidate for clock replacement;
it has been recently used. Legal references to the page by the applicaton
program will take place at full speed. Illegal references will cause
page faults that should be interpreted as acccess rights violations.
- Soft Rights = accessible; Rights = inaccessable
Not Marked. A candidate for replacement. When there is a page fault
involving this page, it is a soft page fault and the only action is to change
the hardware rights field to match the soft rights field, changing the state
to marked.
- Soft Rights = Rights = inaccessable
Paged out! Faults involving this page are "hard"
-- page replacement and disk I/O are needed to resolve this fault.
- Soft Rights = inaccessible; Rights = accessable
Strange! This combination should never occur.
Dirty Pages
A dirty page is a page that has been modified since it was copied
from disk to main memory. Dirty pages must be copied back to disk when they
are replaced so that the frames that hold them may be used to hold other
pages. Clean pages, those that are not dirty, need not be copied
back to disk if original copy of the page on disk is still available.
This can cut the time needed to service some page faults in
half.
Tracking clean and dirty pages requires one bit per page frame; typically,
we assume that the hardware sets this bit to mark the page as dirty whenever
there is a write operation on that page, and we let the software reset this
bit whenever it writes the contents of a page frame to disk. The latter
operation is called cleaning the page frame.
This can be combined with the clock algorithm as follows:
As the clock hand sweeps by dirty page frames they are scheduled for cleaning;
the actual write operations to disk take place in parallel with computation
and are typically done in whatever order is convenient for the disk interface
hardware and software.
The clock hand only stops its sweep when it encounters a page frame that
is both unreferenced and clean.
This scheme (modified clock replacement)
is typical of modern virtual memory operating
systems. This scheme requires minimal hardware support and yet
delivers performance comparable to LRU replacement.
If the computer does not include hardware to record the fact that a page is
dirty, this may be done by software by marking clean pages as read-only.
When a write operation is made to a read-only page, the resulting soft
page fault marks the page as read-write, which is interpreted to mean that
it is a dirty page.
Some implementations of such schemes are enhanced with anticipatory
paging -- for example, when there is a page fault for page i, they
might bring in pages i, i+1 and i+2. This assumes that a program
that references page i is highly likely to use the following pages.
Since this assumption is not always a good one, such systems usually
provide a way to turn off the anticipation logic in the page
replacement algorithm. Anticipatory paging works very well for
programs that use numerous large arrays and it provides few benefits
for programs dominated by complex linked lists jumbled into the heap.
Exercise: Outline a procedure to be called from the page-fault service
routine that will search the set of page frames and select a page to replace.
Do this for the simple clock algorithm, the clock algorithm with clean-dirty
page management, and each of the above with memory management unit hardware
that does not mark used and dirty pages.
Before you do this, you may want to solve the following:
Subsidiary Exercise: What are the states of a page-frame under the
modified (software-only) clock algorithm with clean-dirty replacement.
Subsidiary Exercise: The clock algorithm cyclicaly sweeps through the
table of all page frames. This is not the same as the address map of the
virtual address space! Describe the relationship between these two data
structures!
A digression into architecture
Memory management units are fairly simple to design only if the entire
page table is held in the MMU and if the MMU does not support the mark
and dirty bits needed by the clock algorithm. The latter fact explains
why, starting with the DEC VAX architecture, many modern systems, mostly
associated with RISC architectures, have
omitted hardware support for the clock and dirty bits, forcing the
use of software solutions to these problems.
If the entire page table cannot fit into the MMU, it is common for the
MMU to use a cache memory, so that the most recently used page table
entries are in the MMU while the entire page table is in main memory.
This is quite effective, except that cache design is itself a fairly
messy business. As a result, some MMU designs in use today take this
one step further; the MMU cache, called the translation lookaside
buffer, is simplified so that, if the desired page table entry is not
in the TLB, the MMU issues a page fault! Thus, the page-fault service
routine is responsible for both MMU TLB management and page replacement
in main memory.
This introduces a new kind of soft page fault, where the desired
page is in main memory but the desired page table entry is not in the
TLB, so the page fault service routine merely loads the page table
entry into the MMU and returns, with no actual paging activity required.
You can find an example MMU specification illustrating this extremity in
the fictional Hawk architecture, described in
http://www.cs.uiowa.edu/~jones/arch/hawk/overview.html#mmu
Page Tables and Frame Tables
Note that many implementations of paged virtual memory rely on two distinct
data structures, the page table and the frame table.
The page table describes the state of each page in the virtual address space.
There is one entry in the page table for every addressable page in the
virtual address space. This table frequently grows very large for large
address spaces.
The page table is indexed by the virtual address of the page, and
each page-table entry has a field that describes where that
page is currently stored -- for pages currently not in main memory, this
would typically be the disk address where the page is stored; for pages
in main memory, this would be the page frame holding the page.
The frame table is indexed by the physical address of the frame, and the
entries describe the state of each page frame in main
memory. For those frames currently in use, it describes what pages are
is them. There is one entry for each page frame in the physical
address space. At any instant, the frame table can typically be derived
from the page table, but this derivation is computationally expensive.
The minimum frame table is just an inverted index, holding the index of
the page table entry holding the page currently stored in each page frame.
On some systems, more information is stored in the frame table; such
information as whether the frame has been recently used and whether the
contents of the frame have been changed by the CPU are all as useful if
stored in the frame table as they are if stored in the page table.
Page table Frame table
Page Number __________ Frame ___________
| |_|________| number |_|_________|
| |_|________| | |_|_________|
| |_|________| ------>|_|_________|
-------->|_|________| |_|_________|
|_|________| |_|_________|
|_|________| | |
|_|________| <-------- ------->
| | mark page number
<---------- -------->
rights frame number
Address translation problems
Recall that the easiest to understand virtual to physical address translation
hardware holds the entire page table in the memory management unit:
Word in page
Virt addr |----------------------->| Phys Addr
---------->| Page ______ Frame |--------->
|--- | | ---->|
| | Page | |
| |Table | |
--->| | |
address |______| |
invalid | | |
<----------------- -----
For large page tables, this is not implementable. For example, on the
Intel 80x86 (for x > 2), the virtual address is 32 bits, of which
12 bits specify byte-in-page. Thus, the page table must contain a million
entries.
The entries themselves turn out to be 32 bits each, so this approach to
implementing the memory management unit would require that the memory
management unit contain 4 megabytes of special purpose memory for the
page table, and a context switch from one virtual address space to another
would require loading and storing this entire special memory area!
The original Atlas virtual memory system used an alternative approach
to translation. The address translation hardware held a copy of the
frame table. When a virtual address was presented to the hardware, it
compared the desired page number with the page number stored in each
frame table entry. If there was no match, a page fault was reported;
if there was a match, the index of the frame table entry holding the match
was the frame number. This comparison was done using fast
parallel hardware, in what we would now call an associative memory, so the
time taken for the search did not depend on the number of page frames in
the system.
The Atlas scheme only works if the number of frames is itself relatively
small. On a modern machine with many megabytes of memory, the number of
page frames may itself grow into the tens of thousands, so neither the
entire frame table nor the entire page table can be held in the memory
management unit registers.
The usual solution to this problem is to hold both the page table and
the frame table in main memory, so only those entries that reference
recently used pages reside in registers within the memory management unit.
In this case, we say that the memory management unit contains a virtual
address translation cache or an address translation lookaside buffer.
From the viewpoint of system software, we then view the address translation
process as if it was done through an address translation table in main memory.
The problem of large page tables
A typical modern machine, whether an Intel Pentium, or an IBM RS6000
has a virtual address space of 4 gigabytes, but the main memory is only
measured in megabytes. The typical virtual address map needed
to describe such an address space contains over 1 million map entries, and
requires on the order of 4 megabytes to store. Users with 4 megabytes of main
memory (today, an admittedly small size) would be quite unhappy if the system
required the entire memory just to store the address map!
The solution to this problem is to break up the virtual address map and
store most of it in secondary memory. In effect, the map itself
is stored in virtual memory, and only those pages of the address map that
contain map entries for recently used pages occupy page frames in main memory
at any given time. Each page of the virtual address map is called a
segment of the map, so the data structure describing the virtual
address space of the program can be pictured as follows:
_ _ _ _
Segment Table |_|_|_|_|
_______________| | | |________
Page | ___| X |
Tables _ _|_ _ _ _|_ _ _ _|_ _
for |_|_|_|_| |_|_|_|_| |_|_|_|_|
Segments_| | | |_ _| | | |_ _| | | |_
| | | | | | | | | | | |
Pages |_||_| |_||_| |_||_| |_||_| |_||_| |_||_|
The page table for a segment may itself be moved to disk when all pages in
that segment have been moved to disk, and once this is done, a reference to
any page in that segment requires that the page table for that segment be
brought back into main memory first, prior to bringing in the referenced
page.
When such a system is used, virtual addresses are typically broken up
as follows:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
| | | |
segment page in segment byte in page
Thee segment field specifies what segment of the address map contains
the desired map entry, while the page field specifies what entry in
that segment of the map should be used. The sizes of the fields shown above
is typical of many 32-bit machines. With 12-bits for the byte-in-page field,
pages are 4K bytes or 1K words. With 10 bits in the segment and
page-in-segment fields, both the root segment table and the page table for
each segment have 1K entries, and if each page-table entry occupies 1 word,
each segment of the page table occupies exactly one page-frame of main
memory.
The term segment is used here deliberately because, on many machines
that use paged-segmented virtual memory, the designers intended that each
segment of the page table correspond to a single logical segment of the
address space. For example, a program might have a stack segment, a code
segment and a heap segment, each composed of from 1 to 210 pages,
while other segments might be set aside for any data structures too
large to fit in the heap, secondary parts of the program or other purposes.
Today, many systems divide the
virtual address space into segments that are much larger than the
segments of the page table established by the hardware.
Assuming that the virtual address translation hardware uses a cache
(translation lookaside buffer),
there are now four different outcomes possible for each address translation
operation:
Hit: The desired page table entry was in the hardware address
translation cache. Address translation takes place as expected.
Miss: The desired page table entry is not in the hardware address
translation cache. The address translation unit performs one memory cycle
to fetch the page table entry for the desired page. Address translation
takes place as expected by the software, but it is slowed by the additional
hardware activity.
Segment Fault: A miss occurs, but the desired page table entry cannot
be fetched because the desired segment of the page table is not in main
memory. The address translation unit therefore requests a fault, aborting
the current instruction and allowing the operating system segment fault trap
service routine to bring in the desired segment of page table.
Page Fault: A miss occurs, the desired page table entry is available,
but the page is marked as invalid. The address translation unit requests
a fault, aborting the current instruction and allowing the page fault trap
service routine to bring in the desired page.
The Software View
With a paged segmented virtual address space, the software view of
address translation is summarized by the following picture:
___ ____
Seg | | | Segment
Virt Addr |----->| | | Table
---------->|--- | |____| ____
|- | --->|____|----->| | Page
| | |____| | | | Table
| | Page | | |
| ------------->| |____| ____
| --->|____|----->| | Page
| |____| | | |
| Byte in Page | | |
--------------------------->| |____|
physical addr--->|____|
|____|
To translate one virtual address to a physical address, first
a word must be fetched from the segment table; this holds the
address of the page table for that segment. Next, a word must
be fetched from the page table for that segment; this holds the
frame number of the desired page, from which the physical address
may be computed.
A cache is always included in the address translation hardware so
that these extra memory accesses are only required when there is
a cache miss. In fact, if the cache entries may be directly manipulated
by software, the cache itself can be entirely managed by software, in
response to page faults, and the memory management unit need never
actually initiate a memory cycle to fetch page table entries into the
unit. In this case, all cache misses initiate page faults, but some
faults, called soft page faults, can be serviced by merely bringing
the indicated page-table-entry into the memory management unit with no need
for I/O activity.
Address Translation and Protection
Virtual address translation hardware does more than just translate addresses.
It can also serve a central role in protection. A protection domain
describes, for one logical resource user, for example, a process or a person,
what resources that user should be able to access and how. Each logical
resource user typically operates in a different domain.
When there is one big virtual address space, shared by all users,
a change of protection domains can be accomplished by changing the markings
on the page table entries. Those table entries describing pages that the
new user should not access can be marked inaccessible; those
entries describing pages the user should not be able to modify can be
marked read-only, and so on.
Editing address maps can be expensive, particularly if they are large, so
it is more common to assign a separate virtual address space to each domain.
In this case, all that needs to be done when the domain changes is to change
the pointer to the currently active page table. The page table
then becomes the primary record of what pages are in each domain and what
access each domain allows to each page.
With one virtual address space per domain, it is common to find that most
domains are fairly small; if a full page table occupies a few
megabytes, this can lead to a significant waste of resources, so many virtual
addressing systems allow some mechanism to eliminate storage of irrelevant
parts of the page table. For example, entire segments of the page table may be
marked as invalid in the segment table, so that no space must be allocated
to the corresponding page tables, or there may be a limit register that
describes size of the page table, or segments of the page table may be variably
sized.
With one virtual address space per domain, sharing of pages between domains
is accomplished by replicating the page table entries for the shared pages
between the page tables for each domain that shares those pages. When the
page table is segmented, shared segments are sometimes used instead of
shared pages.
Locality and the Working Set
In all cases, virtual memory rests on the fundamental empirical observation
that most programs exhibit a property called locality of reference.
A program exhibits strong temporal locality if the set of addresses
most recently referenced is a strong predictor of the set of addresses that
will be referenced soon. The fact that most programs are iterative leads
naturally to temporal locality.
A program exhibits strong spatial locality if most memory references
are to locations near other referenced locations. The fact that instructions
are usually executed sequentially and the fact that sequential search is
common both naturally lead to spatial locality. It is important to remember
that locality is an empirical property, and that there are a few programs that
exhibit very bad locality.
As a consequence of the locality of programs, we can talk about the
working set of a process or group of processes. This is
the set of pages that process or group of processes needs in order to
make progress. When a process comes into execution in a demand paged
environment, there will typically be a sequence of page faults as its
working set is brought into memory. Once the working set is in memory,
that process will typically run steadily, with very few page faults; these
page faults typically reflect changes in the working set as the
program enters different phases of its computation.
When the number of page frames on a computer is smaller than
hold the number of pages in the working set of the set of ready and running
processes on a system, performance is typically very bad.
A large fraction of all memory accesses on such a system cause page
faults, and with each page fault, some page that will be
needed in the near future is forced out to disk, guaranteeing that there
will be yet another page fault very soon. A system with this behavior
is said to be thrashing.
The application or mix of applications determines the working set size.
For any particular application, a graph of page-faults per second versus
number of page frames made available to that application will tend to
look like the following:
|
|______ Working set size
| --_ |
page | -|
faults | _
per |Thrashing CPU bound
second | or |-
|I/O bound | _
| | -__
| W -------
--+------------------------
| page frames available
In this graph, the number of page frames in the working set is empirically
determined by the location of the transition from normal CPU bound operation
to the thrashing state.
Some application exhibit a sharp thrashing transition -- these are programs
with high locality and a well defined working set. Others exhibit a gentle
transition to thrashing -- these are programs with poor locality and an ill
defined working set.
There are two cures for thrashing: First, we can buy more memory in order
to enlarge the set of available page frames. Second, if there are multiple
processes, we can shrink the working set by stopping some of these processes.
If running both processes A and B in parallel causes the system to thrash,
so long as A and B are independent, we are better off finishing A before we
start B (or visa versa). Many Unix systems include a process called the
swapper that monitors the system's performance and stops low priority processes
when it detects thrashing, allowing the page replacement algorithm to reclaim
the page frames belonging to those processes and use them to complete the
higher priority processes.
Input/Output
Review
This lecture focuses on a review of three different aspects of
input output:
- User level I/O requests
- Virtual devices -- the abstractions provided by the system
- I/O drivers -- the implementation of those abstractions
The term virtual device is not standard, but it is realistic!
The device interface presented to applications running under an operating
system is usually quite different from the actual device interface, at least
as different as the virtual memory abstraction is from the actual physical
memory resources, and therefore, it seems quite reasonable to use the
adjective virtual to distinguish the view of the device supported
by the system from the actual resources provided by the device.
The Physical Device Interfaces
The physical device interface of a real system typically consists of
a number of device registers. For example, on a typical IBM PC compatable,
parallel port zero is presented to the software by the hardware as follows:
__ __ __ __ __ __ __ __
0378 data |D7|D6|D5|D4|D3|D2|D1|D0|
|__|__|__|__|__|__|__|__|
__ __ __ __ __ __ __ __
0379 status |BS|AK|PE|OL|ER|x |x |x |
|__|__|__|__|__|__|__|__|
BS = 1 when printer not busy
AK = 1 when data transfer in progress
PE = 1 when printer is out of paper
OL = 1 when printer is on line
ER = 1 when there is no error
__ __ __ __ __ __ __ __
0379 control |x |x |x |IE|SL|IN|LF|ST|
|__|__|__|__|__|__|__|__|
IE = 1 to enable printer interrupt
SL = 1 to select printer
IN = 1 for normal operation (zero causes printer reset)
LF = 1 to force printer to advance paper
ST = 1 to transfer data into printer
The protocol for printing one character using this interface was
established by Centronics Corp for their line printers in the early
1970's, significantly before the IBM PC came to market. To print one
character, the software must load the character in the parallel port
data register and then send out a positive pulse on the
ST (strobe) line. The printer acknowledges the strobe pulse with a
positive pulse on the AK (acknowledge) line and a negative pulse on
the BS (busy) line. The contents of the data register must remain valid
until the end of the AK pulse, and no new transfer is legal until the
end of the BS pulse. Usually, these two pulses will be coincident.
If interrupts are enabled (IE = 1), the printer interface will request
an interrupt when AK (acknowledge) goes from 1 to zero. The particular
interrupt requested may depend on the settings of jumpers or option switches
on the parallel port interface, but it is normally IRQ7, the lowest priority
hardware interrupt supported on the PC family of machines.
The SL, IN, LF and ST lines from the control register correspond directly
to corresponding output pins in the parallel port connector, and all of the
status lines correspond to input pins in the parallel port connector.
Some of these lines are inverted, so the voltage levels corresponding to
1 and 0 in the above documentation aren't always as expected.
Aside from the IE (interrupt enable) bit and the interrupt logic
attached to AK (acknowledge), the parallel port hardware makes no
assumptions about the use made of the control, status and data lines.
The meanings of those lines are established by the hardware connected to
the printer port.
The PC parallel printer port interface is typical of the interfaces to
low performance devices where each byte of input or output is transferred
via explicit execution of instructions executed by the CPU. High performance
devices frequently use direct-memory access interfaces. The simplest
direct memory access interface must have at least the following device
registers:
_________________________________
DMA address | |
|_________________________________|
_________________________________
DMA count | |
|_________________________________|
________ __ __ ________
control | |IE| | COM |
|________|__|__|________|
IE interrupt enable on DMA transfer complete
COM command field
________ __ ___________
status | |DN| |
|________|__|___________|
DN indication of transfer complete
The direct-memory access processor idles until the program running
on the CPU loads the control register
with a command requesting an input or output transfer.
On receipt of this command, the DMA processor repeatedly transfers
data between memory and the device, in a direction indicated by the
command. Each transfer addresses the memory location pointed to by
the DMA address register, and after each transfer, the DMA address
register is incremented by the size of the transfer and the DMA count
register is decremented, typically also by the size. Some direct
memory access devices transfer one byte at a time, while others transfer
larger blocks, up to the word size of memory. When the DMA count
register reaces zero, the DMA processor sets the done bit in the status
register; if the interrupt enable bit is set, this also results in an
interrupt request to the CPU.
The interface described above would be sufficient to support direct memory
access transfers to simple devices such as the parallel port, but some devices
require other device control registers. Consider, for example, disk drive.
Typical disks allow random access to any sector of any track of any cylinder
of the disk, and we can speak of the specification of the sector, track and
cylinder as the address, on the disk itself, of the data we are referencing.
Before a transfer to or from disk, therefore, we must somehow tell the
disk drive what disk address to use. More generally, for random-access
devices, we speak of the device address and there must be a
device address register somewhere in the interface for each such device:
_________________________________
device address | |
|_________________________________|
Note that the device address register may have multiple fields, for example,
one for the sector number, one for the track number and one for the cylinder
number on disk. When these cannot all be packed into one register, the
device may have separate registers for these fields.
Prior to any transfer to or from a random access device, the program running
on the CPU must load the device address register appropriately. Floppy
disks are low performance devices, so the data transfer to and from a
floppy disk is frequently done without benefit of direct memory access
mechanisms, but direct memory access is required when high-speed hard drives
are used. Therefore, the software controlling a typical disk interface must
generally loasd the device address register, the DMA address and count
registers and finally the device command register in order to initiate
the reading or writing of one sector of data.
The Stream Abstraction
The stream is a typical (but by no means universal) input/output
abstraction. Typical stream operations are get to read a character
from a stream and put to write a character to a stream. In the
C standard library, for example, the fgetc(s) function returns
the next character read from the stream s, and the fputc(c,s)
function writes the character c to the steam s.
Whether these are packaged in procedural or object oriented form, these
functions are usually hidden under layers of middleware that allow reading
and writing higher-level data types such as character strings or integers.
We should consider the conversion routines between these higher level types
and the character level stream to be attributes of those data types, so we
will not discuss such conversions here.
Streams are abstractions in the sense that they don't necessarily map
directly to the actual input/output operations on any particular
physical device. Some devices are inherently stream oriented, for example,
the simple parallel port discussed above, as well as asynchronous serial
communications lines and keyboards, while others are not, for example,
disks, high performance network interfaces and graphics display devices.
The UNIX stream abstraction is somewhat more complex than the simple put and
get operations suggested above. In UNIX, the operations
are:
len = read( f, buf, buflen )
len = write( f, buf, buflen )
In both of these f names the operating-system level stream to be
read or written, and data is transferred to or from the buffer buf.
This buffer is of size buflen; on output, this many characters
will be written unless there is a problem; on input, this many characters
will be read unless some other condition terminates input sooner. Both
of these return len, the number of characters actually read or
written.
Note that read(f,&ch,1) is therefore equivalent, in a simpleminded
way, to ch=fgetc(s). There are two differences worth noting:
First, the parameter f to read is a file descriptor,
the integer index of an open file in the open file table for the
process, while the parameter s to fgetc is a pointer to
a stream object (confusingly enough, this is of type FILE*).
Second, each call to read involves a relatively expensive
context switch across the user/kernel protection boundary, while the
call to fgetc is a simple function call. If the time taken by
a function call is too much, the standard C include files also provide
getc, a macro that expands as inline code to extract a character
from the stream. Typical C stream implementations include a stream buffer
with a capacity of around 4K bytes, and only when this is empty does
fgetc call read to refill the buffer.
Also note that the there are undefined and device dependent characteristics
associated with kernel-level Unix files: For example, in reading a stream
attached to the keyboard, it is possible to configure the system to force
read to terminate when the user types some key (typically return),
it is possible to configure it so read terminates after a timeout
if the user types nothing.
It is important to emphasize that both the Unix file and the C stream
abstractions are conceptually objects with methods to operate on them.
The fact that the UNIX system call interface and the C standard library
interfaces are procedural can hide this, but logically, the operations
on an open UNIX file should be thought of as references to methods of a
file object. Thus, for example, it is constructive to think of
read(f,buf,len) as entirely equivalent to f.read(buf,len).
Block Oriented Abstractions
typical abstration used for input/output supports block-oriented
random access to input output devices. In this case, typical primitives
are:
Read( device, memory_buffer, device_address )
Write( device, memory_buffer, device_address )
Here, the buffer is a block of consecutive memory locations in the caller's
address space, and the device address names a block of consecutive locations
in the random-access I/O abstraction, perhaps a relative address in a disk
drive or disk file.
The block size in this scheme may be fixed system wide, fixed by the
particular device, or passed as an explicit parameter.
This abstraction maps naturally to disk devices in the case where the
block size is one sector and the device_address is the sequential sector
number on the device.
It maps poorly to almost all other real devices, but disks are so
important that it is frequently used.
The file abstraction of the Unix kernel is a generally successful attempt
to generalize streams to include both sequential and random-access block I/O.
The operations are:
len = read( f, buffer, length )
len = write( f, buffer, length )
lseek( f, offset, base )
The lseek() operation is used to position
the read-write pointer in the
indicated open file. The offset may be relative to the start of the stream,
to the end of the stream, or to the current position in the stream, depending
on the value given for base. Obviously,
seeking in some streams is meaningless (consider seeking relative to the end
of the stream of characters being read from a communications line),
and seek operations are not always supported (seeking
in the stream from the keyboard would require storing the entire history
of what was typed, and this is not worth doing), but on those file types
where seeking is useful and can be efficiently implemented, for example,
on disks and tapes, this interface works quite well.
With a buffer size of one byte and no seek operations, this interface
exactly emulates a pure stream. With a buffer size of one disk sector,
with a seek offset that is a multiple of the sector size prior to each
read or write, this interface exactly emulates a conventional random-access
model. If the sector size is not matched to the buffer size or if seeks
are made to device addresses that are not sector aligned, intermediate
buffers and possibly extra input-output transfers
must be used to reblock the data between the user and the device.
Of course, if the underlying device is incapable of a seek operation,
for example, if it is a printer or a keyboard, the seek operation
in the user interface is meaningless.
Class Hierarchy for Streams?
One of the very first uses of object oriented programming can be found in
Christopher Strachey's stream implementation in his OS-6 project in the late
1960's. Strachey, the primary developer of the CPL programming language,
naturally chose to write this system in BCPL (Basic CPL, the ancestor of C),
but he wrote his code in clearly object oriented form, and today, we would
say that he used a class hierarchy something like the following:
Write Only Stream
write(c)
|
Read/Write Stream
read(c)
write(c)
|
Random-Access Stream
read(c)
write(c)
seek(c)
In fact, the whole notion of having the class hierarchy have any relationship
to the implementation hierarchy fails for I/O, because, for some
devices, disks for example, random block access is the primitive, while stream
access is a higher level abstraction. For other devices, such as asynchronous
serial ports, the situation is reversed; stream access is primitive, and
blocks must be introduced as a higher level abstraction. Thus, one of the
original inspirations for object oriented programming remains a good test of
the methodology!
Implementation of Streams
A device driver is the component of an operating system that
supports input or output operations to some device. Conceptually, this
bridges the gap between the abstractions presented to the user and the
concrete interface of the hardware itself, but in systems such as Unix,
the front-end interface to user code, for example, the
read(), write() and lseek() kernel functions
of Unix, include standard code that applies to all devices and is therefore
above the level of the input/output drivers.
Stream input/output provides an excellent example of the implementation of
an input/output driver. The driver is usually subdivided into two halves,
one half deals with the hardware interface and includes the interrupt
service routines for the device, and the other half deals with the user
interface. It is quite common to include a FIFO queue between the two
in order to allow computation and input/output activity to overlap.
The following diagram illustrates this overall structure
for a parallel port printer handler under an operating system for the IBM PC:
|| ||
write(f,buf,len) ---> FIFO ---> Interrupt Handler
when f refers queue ||
to the parallel ||
port ||
|| ||
user || device driver || hardware
|| (in the operating system) ||
Note that the user interface suggested here is that of Unix, but this is
only for the sake of example. What is important is that this user interface
routine enqueues the successive characters passed by the user in the queue,
waiting for space when the queue is full, but otherwise returning to the user
as soon as possible. The interrupt handler, in turn, dequeues one character
from the queue for each interrupt and transfers that character to the hardware.
For input stream devices, for example, the input driver for a unidirectional
input-only communication line interface, we can draw a similar picture:
|| ||
read(f,buf,len) <--- FIFO <--- Interrupt Handler
when f refers ||
to this input ||
port ||
|| ||
user || device driver || hardware
|| (in the operating system) ||
The only difference between this example and the output example above
is that the queue is now fed by the device and
emptied by the user. When the user calls read, characters are dequeued
from the queue for return to the user. With each dequeue, the user may
be forced to wait until data arrives, but if data has arrived before
the user calls read, the user does not wait.
On first generation operating systems, whether on mainframes of the 1950's
or on microcomputers of the 1970's, programs were limited
to synchronous I/O, which is to say, neither interrupts nor queues were
used in the I/O drivers, and user programs were therefore blocked until
each I/O operation was completed. With no overlap of user computation
with I/O, any input that arrived while the user program was busy was
simply lost unless the hardware contained holding buffers.
The IBM PC BIOS and most of the old MS/DOS operating systems were the
last widespread survivors of this first generation of operating systems.
Modern operating systems for PC compatables generally only use the
BIOS routines to load the operating system and get descriptions of the
devices for which drivers should be loaded. Once the system is running,
the drivers in the system typically work directly with the hardware
device interfaces.
Echoing of Keyboard Input
Keyboard input is somewhat more complex than other input devices because
users have come to expect everything they type to be echoed on some kind
of display. Teletype terminals, based on hardware dating from the 1920's,
always offered two operating modes, full duplex and half duplex.
In full duplex mode, everything typed on the keyboard was simply transmitted
over the asynchronous line, and the printer printed everything that was
received. In effect, the Teletype acted as two independent devices, one
input only, and one output only. In full duplex mode, it was up to the
remote system to echo keypresses so that everything typed would appear on
the printer. In half duplex mode, the printer was connected
directly to the keyboard; everyting typed was therefore printed locally
as it was transmitted to the remote system. Half duplex operation
simplified the design of modems and it reduced the complexity of remote
software, but it was less flexible.
By 1970, most interactive computer use was done using full duplex Teletypes as
terminals, but there were still several open options for echoing what the
user typed. The responsibility for echoing could be left to the application,
or it could be handled by the system, and if handled by the system, it could
be done in several ways. These are illustrated below:
|| ||
Read <--- FIFO <--- Interrupt Handler
|| | ___________/ | ||
|| A| / B FIFO C ||
|| | | | ||
|| V V V ||
Write ---> FIFO ---> Interrupt Handler
|| ||
user || device driver || hardware
In path A above, the read routine enqueues character to be echoed
as it dequeues it from the input queue. To the interactive user,
this is indistinguishable from echoing doen by the application itself.
As a result, input typed while the application is in the midst of
generating output will not be mixed with that output but will instead
wait in the input FIFO queue until it is read by the application.
In path B above, the input interrupt routine handles echoing. As a result,
they are echoed soon after they are typed, independent of when they are
read by the application. Because they are enqueued in the same FIFO as
other output, echoing may not be immediate when other output is in that
queue. Most Unix interactive I/O drivers use this model.
In path C above, an additional (short) queue is introduced. The
output interrupt driver must check both queues, with the new queue
of characters to be echoed taking priority over the normal output queue.
This leads to the fastest possible echoing of input, but in addition
to allowing input to be mixed arbitrarily with output, it also
makes the output driver more complex.
On Unix, application connected to an interactive terminal
may select the echoing mode using some of the less
common system calls, stty() and ioctl().
These calls allow nonstandard and device dependent options to be
set, and the set of options they allow for use with the terminal I/O
driver is huge, including the ability to turn on and of echoing.
Block I/O Implementation
As with stream I/O, there are two interfaces that concern us, and the
system provides a mapping between these, as well as providing for
buffering to improve performance.
|| ||
Write Queue ||
|| or ---> of ---> Interrupt Handler
Read Requests ||
|| ||
user || device driver || hardware
|| (in the operating system) ||
Read and Write operate on the same queue, placing requests for
block I/O in it, instead of data. Each time the device issues
an interrupt, it signals that it is ready to begin a new transfer,
at which point, driver removes an entry from the queue and gives
it to the device, starting the next transfer.
Each entry in the queue contains at least the following fields:
- The device address -- what disk sector, or the equivalent for other devices.
- The buffer address -- where in main memory is the buffer.
- The transfer direction -- read from disk or write to disk.
In a Unix-like system, the user interface assumes that the user wishes
to wait for the I/O operation to complete. Therefore the read or write
routine can only return to the user when the I/O transfer is actually done.
This happens after the I/O request is dequeued by the interrupt service
routine, when the interrupt comes in indicating the completion of the last
transfer from the buffer. Therefore, the queue entry must contain the
synchronization information needed to return control to the user.
Exercise: Consider the alternatives for solving this problem.
You could put a pointer to the user process itself in the queue entry, so that
the interrupt handler puts the process back in the ready list when it finishes
the operation. Alternatively, you could put a flag in the queue entry; with
this scheme, the user can test the flag to see if the I/O is done. Finally,
you could put a semaphore in the queue entry; when the I/O is complete, the
interrupt service routine signals this semaphore, and when the user wants to
see if it is complete, the user can test this semaphore. Whis of these
schemes is easiest to use, which is easiest to implement, and which maintains
the best isolation between the scheduler and the I/O code.
The simple structure suggested above assumes that each buffer provided
by the user holds exactly one block of data to be transferred to or from
disk. Most devices have a fixed hardware-defined block size, so this
requires that the user code be written in terms of this block size.
The Unix model allows random access to the arbitrary byte in the stream,
and it allows user buffers to be of any size. Some Unix device drivers
can do I/O directly to or from the user buffer if the user happens to request
exactly one disk block, but all Unix drivers allow non-aligned I/O transfers,
where the user buffer contains more or less than one block and begins at any
byte position. This is handled by an additional software layer between the
interface described here and the user-level interface to the kernel.
Exercise: Design an appropriate intermediate layer allowing
seeking to any byte in a file and reading or writing any number of bytes
starting at that byte, assuming that all file I/O must be done in 512 byte
blocks that are aligned on 512 byte boundaries. Is your solution efficient?
Input/Output Latency
For most input/output devices, we must recognize that some of the time taken
to transfer data to or from the device is spent waiting. For example, with
disk devices, we must move the heads to the appropriate track and then wait
for the disk to rotate until the desired data sector is under the heads.
During the time the heads are moving and while we wait for the right sector,
no input/output is taking place.
In general, we use the term latency to describe the time between
the initiation of an operation and the actual start of the operation.
Latency is a general English word, so for example, we can speak of a latent
infection during the time between the actual contact with some germ and
the first symptoms of that infection.
For disks, CD-ROM devices and similar rotating memory devices, we use the
term rotational latency to refer to the delay
caused by the rotation of the storage medium. For example, if a disk rotates
at 100 revolutions per second, we might have to wait up to 1/100 of a second
before a particular bit of data on that disk is rotated under the heads.
It is worth noting that we use the term rotational latency to refer not
only to delays caused by physical rotation, but also to delays caused by
logical rotation. Thus, we can speak of the rotational latencyh of a
shift-register memory, where nothing physical rotates, but the data circulates
around a shift register as if the data was on a rotating belt.
We also speak of head positioning latency on devices where there is
only one head that may need to be moved to the appropriate storage track
before data on that track can be read or written. Disks have been built
with one head permanently mounted above each track, but most disks have
only one head per recording surface, and this head must be moved to the
appropriate track prior to any use of that track.
Rotational and head positioning latency are both examples of
mechanical latency (unless we are speaking of the rotational latency
of a shift-register memory device). Some devices and device interfaces
also have latency characteristics that arise from electronic or logical
requirements. We may speak, for example, of the head selection latency
on a multi-head disk-drive; although this is usually negligable. Some network
protocols lead to a delay before a machine is allowed to send a message;
we can describe these as network protocol latency.
Disk Input/Output
Disks
Rotating magnetic memory devices date back to the early 1950's. The first
such devices were drums, with one head fixed over every track. The drum
on ILLIAC I, for example, had 64 tracks, with an 8 by 8 array of heads.
(This drum is still owned by the U of Illinois computer science department).
In the late 1950's and early 1960's, workers at IBM figured out how to make
a record/playback head "fly" over the flat surface of a spinning disk.
The result was a new technology, the moving head disk, where one head
was positioned over each recording surface on the disk spindle and moved
in or out to the appropriate track.
For large-capacity, high performance disks, whether in 1965 or 2000
a number of disk platters are rigidly attached to each disk spindle.
The number of platters on one spindle has varied greatly over the years.
IBM's original disk drive had over 20 platters, while it was common
to see disks with 12 platters on a common spindle in the early 1970's.
Usually, both sides of each platter are used for data storage; today's
larger capacity disks typically have between 4 or 8 recording surfaces.
When there are multiple recording surfaces, there is typically only one
head per surface, with all heads sharing a common head-positioning mechanism,
as illustrated below:
spindle
|| v------|| __________
=====[]===== ||___| |
|| ^------|| | head |
|| v------||___|positioner|
=====[]===== || |__________|
__||__^------||
| |
| motor|
|______|
It's fun to note that the tiny disks used in today's laptop computers
frequently have the motor literally inside the hub of the disk, while
the disk drives of the 1960's had motors like those found in
washing machines, with a belt-drive from motor to disk spindle.
Moving head disks did not displace older technologies immediately.
Drums continued to be made until the mid 1960's, and fixed head disks,
disks with one head fixed over every track, persisted into the 1970's.
Drums and fixed-head disks were particularly popular as paging and
swapping devices because their lower latency led to reduced page-fault
service times. At the other extreme, moving head disks have been made
with more than one head per recording surface, and they have been made
with more than one head positioner per disk drive. Such eccentric
designs are uncommon and will not be discussed here!
The first production moving-head disks with one head per surface date
from the very
early 1960's. The IBM 1401 computer, indroduced in 1961, came with the
IBM 1505 disk drive (the combination was sold under the name RAMAC -- The
Random Access Method of Accounting Control, named after its initial business
application).
This disk drive had 25 platters, 200 tracks per surface, 5 sectors
per track and 200 6-bit bytes per sector. 5 and 10 platter disks were
common in the late 1960's and early 1970's, with platters about 14 inches
in diameter. Some early disks used hydraulic head positioning mechanisms,
but the dominant head positioning technology for high performance disks
from the late 1960's to the present has been the voice-coil servo drive.
Reynold B. Johnson, 1906-1998, led the IBM lab in San Jose that developed
the RAMAC; he also developed one of the first mark-sense readers and the
1/2-inch video-tape. (Professor Lindquist of the University of Iowa developed
another early mark-sense reader to grade his college entrance exam, the ACT.)
While the speed and capacity of rotating storage media have increased
immensely since the early days of computing, and sizes have fallen from
large free-standing cabinets to palm-sized enclosures, the basic physics
has remained unchanged, and thus, the basic issues with which the software
must deal are the same now as they were in the early 1950's.
Disk Latency
The rotational speed of all high performance disks is fixed; typical
disks of the early 1970's ran at 20 revolutions per second, while modern
high performance disks may rotate much faster. The stored
kinetic energy in the rotating disk is sufficient that it may take a few
seconds to bring it up to speed on power-up, and any speed change after that
is prohibitively expensive. This is as true of todays miniature disk drives
with flea-sized motors as it was in the days when the disk assembly weighed
20 pounds and was spun by a washing-machine motor.
Some floppy disk drives, notably the old Apple Macintosh 800K drives, used
a variable speed scheme to pack a bit more data onto a disk than they
could pack onto a fixed speed disk. This worked only because the floppy
disk itself is very lightweight and rotates relatively slowly; most of
the motor power is needed to overcome friction and it takes only a small
amount of power to accelerate or decelerate the disk.
It is easier to use a variable speed clock to solve this problem. While
variable speed clocks easy to build at low data rates typical of
floppy disks, variable speed clocking for high performance hard drives is
difficult and rare. The problem is, the highest performance drives move
data to and from the disk at speeds close to the limit of the digital logic
itself.
Magnetic disk recording surfaces are divided into numerous circular tracks;
typically, each track is subdivided into sectors, and the basic operation on
the disk is to read or write one sector of data. On a disk or drum with
one head fixed over each track, the time delay between the issue of a read
or write and the completion of that operation can be divided into two
components, the rotational latency time and the transfer time.
The rotational latency time is the time it takes for the rotation of the
disk to bring the start of the desired sector into position under the head.
This may vary from one full revolution, if the start of the desired sector
had just passed under the head at the time the transfer was requested, to
nothing, if the start of the desired sector is just about to pass under the
head at the time the transfer is requested.
The transfer time is the time it takes for the rotation of the disk to
rotate one sector past the head.
With the advent of moving head disk technology in the early 1960's, a new
component was added to the time taken for a single read or write operation,
the head positioning latency time or seek time. (The latter
term derives from the common use of the verb `to seek' to describe the act
of moving the heads to the desired track.) This is the time it takes for
the disk head positioning mechanism to move the heads to the desired track;
it may vary from nothing, in the case where the head was already on the
desired track, to an appreciable fraction of a second.
Typical head positioning hardware smoothly accelerates the head from the
starting track to the end track, and a bit of simple physics leads to the
result that such hardware will give track-to-track head movement times that
are quadratic. That is, if a move of one track takes one time unit,
a move of four tracks will take two time units and a move of nine
tracks will take three time units.
It is worth noting that the standard CD format, being designed for audio
recording, uses a single spiral track. Like a magnetic disk, this track
is divided into sectors, and random access to the recorded data, whether
to play a specific track of an audio CD or to read a random sector of a
CD-ROM on a computer, typically involves two steps, first moving the head
blindly to the estimated radial distance of the desired sector, and then
reading the first sector header encountered in order to see if the move was
far enough. As a result, CD-ROM drives have head motion latency and
rotational latency not too different from magnetic conventional disks.
Note on Data Formats
A typical sector on a disk is organized as follows:
__________________________________________________________
|________|______________________________________|__________|
prolog data epilog
Typically, the prolog includes synchronizing marks that allow the clock
used for reading the sector to be synchronized with the data and that
allow the hardware to recognize the start of a disk sector. Following
this, the prolog typically includes the full disk address of the sector.
The disk address is included in the prolog for many reasons. First, it
allows verification that the head positioning hardware and disk surface
selection logic is working correctly, or at least, that it is working
the same way it was when the disk was formatted. Second, on some disks,
the head positioning servo actually uses the track or cylinder numbers
read off the disk to move the heads in or out to the right track, and
finally, the addressing logic of many disk systems relies on the sector
number recorded at the start of each sector to determine whether the
sector is the one the user wishes to access.
The end of the prologue is typically a checksum on the prologue. Once
the prologue is recorded by the disk formatter, it is usually never written
again. When a write operation is requested, the hardware begins by
reading and verifying the prologue. Only then does it begin to write the
data.
The epilogue typically consists of a checksum. All read and write operations
must process the entire sector, either writing a full sector and then writing
the checksum, or reading the full sector and then verifying the checksum.
If the disk controller allows partial sector reads or writes, this is an
illusion. A partial sector write almost always overwrites the full sector,
appending arbitrary data to the user's data in order to achieve this. A
partial sector read only reports a fraction of the sector's data to the
user, but the entire sector must be read by the controller in order to verify
the checksum.
Formatting a disk
When a "raw" disk is mounted on a disk drive, it must be formatted.
Typically, this involves writing predetermined data in each
sector of each track of the disk. The important part of the formatting
operation from this perspective is not the data recorded but rather
that the formatter puts valid prologs and epilogs on all of the sectors
are written.
Usually, the disk formatting software also verifies that each sector is
readable and writable, gathers the good sectors into a free-space data
structure, and creates a file system of some type on the disk.
The latter subject will be discussed in more detail later, and in fact,
close examination of most formatting programs shows that these two
jobs are largely independent.
When the disk interface is configured for disk formatting, the hardware
generally allows the software to write out arbitrary data to fill entire
sectors, including the prolog and epilog. This gives the software the
option of laying down sectors in arbitrary order on each track of the
disk. If they are put down in consecutive ascending order, it may be
impossible to read or write consecutive sectors at full speed. This is
because, by the time the interrupt marking the end of one read or write
has been processed, it may be too late to start work on the next sector
without waiting a full revolution.
To avoid this problem, many systems arrange the sectors
on a disk in interleaved order, so that logically consecutive sectors
are separated by one or more other sectors. The following example
illustrates the interleaving of 16 sectors, using consecutive letters of
the alphabet to label logically consecutive sectors.
A B C D E F G H I J K L M N O P -- no interleaving
A I B J C K D L E M F N G O H P -- 2-way interleaving, almost
A G L B H M C I N D J O E K P F -- 3-way interleaving
A E I M B F J N C G K O D H L P -- 4-way interleaving, almost
A N K H E B O L I F C P M J G D -- 5-way interleaving
The 2-way and 4-way interleaving patterns shown above are imperfect
because 2 and 4 evenly divide 16. The 3 and 5-way interleaving patterns
are perfect. Interleaving can be done in hardware, or it can be done in
software.
When the physical addresses recorded in the sector prologs on the disk are
not interleaved, the software may interleave them. In general, if there
are n sectors, numbered consecutively from 0 to n-1 on each track,
and if we want m-way interleaving, then so long as n and m are relatively prime
(they have no common factors) the software can generate interleaved sector
numbers using the formula:
interleave(i) = (m * i) mod n
The software interface
To read or write one sector of a disk, the software must:
- Specify the address of a buffer in memory.
- Specify one of the several disk drives that may be attached to a single
controller.
- Specify how to position the heads on the drive in question. This
is usually called the cylinder portion of the disk address.
- Specify which recording head or recording surface is to be used
on the selected drive.
- Specify which sector of the track selected by the cylinder and
surface is to be used.
- Specify the data transfer direction.
In simple disk controllers, each of the above might appear as a separate
register in the controller, writable by software. Typically, the actual
transfer begins only when the final command is issued, the transfer direction.
Typically, it is the act of loading a value in the final register that actually
starts the controller in operation. Thus, the loading of a memory address
or a field of the disk address has no effect until the software actually
specifies a read or write operation, at which point, the values previously
loaded are used to initiate a seek on the appropriate drive, await the
right sector of the selected track, and then transfer the data.
When a data transfer is completed, a status register typically displays
this fact, along with an indication of any errors detected. In addition,
the done condition usually results in an interrupt request.
Added Complexity
Interface busses such as the SCSI bus separate the responsibility
for each operation into two halves, as illustrated below:
__________ __________ ____
CPU ----->| DMA | | DISK |-|disk|
|controller|<========>|controller|-|disk|
Memory <=>|__________| |__________|-|____|
This design adds significant complexity, and it is noteworthy that
such hardware has been common since the 1960s, where the the IBM 360 channel
architecture serves as the architypical example. There, the DMA controller
was called a channel processor. Todays examples include the SCSI, USB
and Firewire controllers; as with the IBM 360, most of these allow
multiple disks (and their controllers) to be attached to one such
bus.
Of course, everything is smaller today than in the 1960's. Today's disk
controllers are made of a few chips mounted on a single printed circuit
board attached to each disk drive, and today's channel processors are
usually as small and in many cases are built into the motherboard of the
host computer. In the 1960's, each of these occupied a free-standing
cabinet full of electronics.
Then and now, commands from the CPU to the channel controller, perhaps a SCSI
or Firewire controller, direct it to transfer data to or from some memory
buffer and direct it to forward certain additional commands to the disk
controller.
All of the usual I/O operations are possible with such systems, but most such
systems also allow many nonsensical operations such as directing the channel
controller to read data from the disk to memory, while directing the disk
controller to write data from the memory to disk.
Another complicating factor is introduced by the fact that it is frequently
useful to start one disk seeking while another reads or writes data. This
is particularly important in large servers. To support this,
many disk controllers allow a seek operation to be initiated on some disk
without initiating either a read or write. Software that actually uses this
feature is common on the high performance operating systems and is only slowly
becoming common on personal computers as these migrate from single-user
operating systems to scaled-back versions of general purpose operating systems.
Virtual memory adds another layer of complexity. Generally, the user's
I/O buffer is not forced to lie entirely within one page of the user's address
space, but instead, it may span several pages. While these may be consecutive
in the user's virtual address space, they might be anywhere at all in the
physical address space. DMA controllers that understand the structure of page
tables have been built, but for most simple DMA controllers, the I/O software
must first gather data from the fragments of the user's buffer into a
physically contiguous buffer, and then do a DMA write from that buffer to
disk, and similarly, after a DMA read from disk, the system must scatter
the data from the system's buffer to the fragments of the user buffer.
Many high end DMA controllers support scatter read and
gather write operations. For these controllers, there is not a
single base/count register pair describing the user's buffer, but rather,
a list of base/count pairs describing the fragments of the user's buffer.
Usually, this list is in main memory, and the DMA controller has a
register that points to the next pair to process when the current pair
is completed. Only when the list is exhausted does the transfer stop.
In fact, this is an old feature! The IBM 360 channel controllers supported
an even more complex variation on this through what were called
channel programs.
Some disk controllers allow multiple sectors to be read in
sequence, so if the buffer is larger than one sector, it can be entirely
read in one disk transfer, so long as the desired sectors are consecutively
numbered on disk. Thus, depending on the controller, it is sometimes possible
to read or write as much as an entire track or even a cylinder
to or from the disk in one transfer.
Finally, at the other extreme of performance, some modest and low
performance disk controllers have been built without the full generality
of DMA. Instead, the disk controllers are built with an internal buffer
able to hold one sector or one track, and all disk I/O is to or from this
buffer. Prior to writing the disk, software running on the CPU must copy
the data from the application's memory area to the disk buffer, and after a
read, it is up to software to copy from the disk buffer to the user's
buffer. The disk buffer in such controllers may or may not be
addressable as a normal memory area; in the lowest performance disks,
data is transferred between the CPU and this buffer one byte at a time.
Disk Request Queues
Typically, assuming a simple DMA disk interface, without the added features
discussed above, a modern operating system will maintain a queue of disk
I/O requests. Each request is a record of a disk input or output operation
that has been requested by a user but not yet carried out by the disk
interface. A typical disk I/O queue will contain
records with a format such as the following:
Disk Request Record:
Buffer Address in main memory.
Disk Address:
cylinder.
surface.
sector.
transfer direction
done semaphore
A user process, or rather, the file system or page fault handler acting on
behalf of that user, might pose a disk request as follows:
low-level user disk I/O request routine
enqueue( request, disk-queue )
enable( disk_interrupt )
P( request.semaphore ) -- wait for I/O complete
return
The interrupt service routine might operate as follows:
disk interrupt service routine
disable( disk_interrupt )
if request not null then V( request.semaphore )
-- the above lets a waiting user continue
if not empty( disk-queue ) then
request := dequeue( disk-queue )
buffer-address-register := request.buffer-address
disk-address-registers := request.disk-address
disk-command := request.transfer-direction
enable( disk_interrupt )
return
else
request := null
return
Disk interrupts signal the completion of the previously requested operation,
so the first thing the interrupt service routine does is signal the semaphore
on which the process requesting the previous service may have blocked. The
having done this, the interrupt service routine continues by starting a
new disk transfer, if one is pending in the disk request queue.
Note that, if the queue is empty, the disk routine returns, leaving the
interrupt disabled but doing nothing to make the controller drop its
interrupt request. When a user enqueues a new request, the user code
enables the interrupt, causing an immediate interrupt and allowing the
controller to immediately start the newly requested transfer.
Exercise: The above pseudocode for the disk interrupt service routine
and for enqueueing disk requests assumes that disk I/O is being requested for
typical user code. Imagine the case where some disk requests are enqueued
by the page-fault service routine. In that case, it may be awkward for the
service routine to block on a semaphore (some models of interrupt service
prohibit service routines from blocking on anything), so it may be better
to have the disk interrupt service routine finish the processing of the
page fault. Propose an augmented data structure for the disk request queue
and pseudocode for the page-fault handler and disk interrupt service routine
to do this.
Improving Disk Performance, Disk Caches
One way to improve the performance of a disk system is to use the queue
of pending disk requests as a cache. New read requests are only posted
to the queue if no write is pending to the same sector, and new write
requests cause previously posted write requests addressed to the same
sector to be deleted. The following version of the user's code to post
a new disk I/O request to the disk request queue describes this:
low-level user disk I/O request routine
if request.disk-address matches prior.disk-address
where prior is a request already in the disk-queue
if request.direction = in and prior.direction = out
copy prior.buffer into request.buffer
return
if request.direction = out and prior.direction = out
delete prior from disk-queue
return
enqueue( request, disk-queue )
enable( disk_interrupt )
P( request.semaphore )
return
The result of this logic is that there will never be more than one
write queued for any disk sector, and if a write is queued for a sector, there
will be no reads queued for that sector.
Exercise: Multiple reads from the same sctor can also be combined.
Propose additions to the disk request queue data structure and to the
pseudocode given here that would do this.
The use of a disk cache poses risks, whether the cache is based on
the request queue, as outlined above, or whether it is based on specifically
reserved buffer space in main memory. In either case, there is the possibility
that some write may never be recorded to disk! If some software keeps
writing some sector, each such write will cancel the write already in the
queue, and the result is that no write will ever be done. So long as there
are no system failures, this is not a problem, but if there is a power
failure, this could cause the loss of very important data. Therefore, it
is common for systems that support disk cache schemes to include, in the
kernel, a specific operation to force all writes to a disk to be forced
to be completed. In the UNIX system, the sync() kernel call does
this.
Improving Disk Performance, Disk Schedulers
Disk scheduler software can markedly improve the performance of disk I/O for
any disk that serves multiple independent sources of I/O requests. This
is typical of disks in large timesharing systems and of disks in network
file servers. The key is to
note that, if there are multiple requests in the disk I/O queue, the time
taken to process those requests depends on the order in which they are
processed. This is because the latency time depends on the order in which
they are accessed.
For example, consider a request queue containing 2 requests for operations on
track 1 and two requests for operations on track 400. Clearly, if these are
done in the order (1, 400, 1, 400) the result will be quite a bit slower
than if they are done in the order (1, 1, 400, 400). The former requires
three long seeks, while the latter requires only one long seek.
Thus, if disk requests are scheduled in the appropriate order, disk performance
can be markedly improved. It is generally impossible to point to a separate
procedure in the operating system and say "this is the disk scheduler".
Instead, the disk scheduler is broken between the two ends of the disk request
queue, and the queue data structure is itself the structure used to organize
the scheduler.
Typically, the queue for each disk is broken into separate pieces for
each cylinder, and the enqueue routine on the user side of the disk
request queue is made responsible for enqueueing the request in the
appropriate cylinder's queue. The scheduling policy is then the responsibility
of the disk interrupt service routine.
Typical scheduling policies are:
- FIFO, the base case, is inherently fair to all users, but it does nothing
to improve throughput. (A fun exercise is to quantify fairness and then
prove that the FIFO policy is optimal).
- Shortest seek time first.
After processing all requests for the current cylinder, the next cylinder
to be processed is the one with pending requests that is closest to the
current cylinder. This policy produces good throughput (the number of I/O
requests processed per second) but it is unfair, it will cause some users
to wait indefinitely. (The optimality of this method depends on how the
seek time varies as a function of the number of cylinders moved. For a fun
exercise, figure out the class of functions for which this is optimal.)
- The bidirectional elevator algorithm.
After processing all requests for the current cylinder, the next cylinder
to be processed is the one closest to the current cylinder in a fixed
direction, for example, towards the center of the disk. Only if no work
is pending in any cylinder in that direction does the preferred direction
reverse so that future seeks will be in the opposite direction. This
technique offers a good throughput, but it discriminates against disk requests
near either extreme of the range of valid cylinder numbers.
- The unidirectional elevator algorithm.
After processing all requests for the current cylinder, the next cylinder
to be processed is the one closest to the current cylinder in a fixed
direction, for example, towards the center of the disk. Only if no work
is pending in any cylinder in that direction does a seek in the opposite
direction take place. In that case, the most distant available cylinder
with pending work is selected, and the preferred direction remains unchanged.
This technique offers equal service to all cylinders while only slightly
reducing the throughput of the bidirectional elevator.
The elevator algorithms are based on the model used by elevators in
tall buildings. The elevator does not visit floors in FIFO order,
that is, going from floor to floor in the order those floors were requested
by users pushing buttons requesting visits to various levels of the
building. Instead, the elevator sweeps systematically up the building,
visiting those floors for which it has pending requests and stopping when
it reaches the highest requested floor. At that point, it begins a downward
sweep, stopping at the lowest requested floor. This is exactly the
bidirectional elevator algorithm, with floor numbers substituted for
cylinder numbers.
Queues for each
disk cylinder
--------- |
00000 | |
--------- |
000 | |
Source --------- |
of ----> 0000 | |
requests --------- /---\ Cylinder
00| currently
--------- \---/ being
0 | served
---------
The conflict between throughput and fairness illustrated by the
differences between the unidirectional and bidirectional elevator
algorithms is typical of a number of scheduling issues in operating systems.
Attempts to optimize against a single goal are fairly straightforward,
but there is no solution which is simultaneously optimal against all
criteria!
The following pseudocode implements the elevator algorithm; in this
code, the disk request queue has been replaced with an array of
queues, one per cylinder:
low-level user disk I/O request routine
enqueue( request, disk-queue[ request.cylinder ] )
enable( disk_interrupt )
P( request.semaphore )
The interrupt service routine consumes requests from the queues as
follows:
disk interrupt service routine
disable( disk_interrupt )
if request not null then V( request.semaphore )
-- the above lets a waiting user continue
if empty( current-queue ) then
-- time to move the elevator to a new floor
i := current-cylinder
repeat
i := (i + 1) mod cylinders
until i = current_cylinder or not empty( disk-queue[i] )
current-queue := disk-queue[i]
disk-queue[i] := empty
current-cylinder := i
end if
if empty( current-queue )
request := dequeue( current-queue )
buffer-address-register := request.buffer-address
disk-address-registers := request.disk-address
disk-command := request.transfer-direction
enable( disk_interrupt )
return
else
request := null;
return
Note that the disk interrupt service routine uses a separate queue,
current-queue to hold the requests for the cylinder
it is currently servicing. This prevents late arrivals at that cylinder
from being served until the next visit to that cylinder; if this was not
done, and if some process continuously maintained a supply of pending
disk addresses for one cylinder, no other cylinder would ever be served.
Another detail to note is that the search for the next cylinder to
process need not be a sequential search, as outlined above. If the
disk queue array is maintained as a circular sparse array (a linked
list structure), so that empty queues are not represented at all, the
next queue will always be adjacent to the current one and the computation
shown as a search above can be recoded as a simple pointer assignment.
(for a pleasant exercise, write the necessary code.)
Device Independence
The desire for device independent input/output was one of the original
motives behind the development of polymorphic object-oriented programming
methodologies. Typically, we aske that each device support some set
of input/output operations, for example:
Read( ch ) Arbitrary software
Write( ch ) <-----> that performs actions
Read( buf, add ) that look like I/O to
Write( buf, add ) the user, and may even
involve I/O
Used for disk files,
windows, network ports,
and similar abstractions.
The advantages of using a standard software interface for every device,
whether disk or terminal or printer or network port have been known since
the development of the FORTRAN programming language in the late 1950's.
To achieve this, it is necessary to have a user-level
device interface that is sufficiently flexible to operate with both random
access and sequential devices, supporting both stream and block modes of
operation on any device. The low level Unix interface (read, write, seek)
is one good example of such an interface.
This interface may be considered a virtual device interface, with the
input output software of the operating system serving the function of
virtual to physical input output operation translation.
The virtual device may also be considered as an object, in the sense used
in object-oriented programming. In fact, the first clear description of how
objects in this sense can be implemented came from the field of operating
systems, in a paper by Christopher Strachey in his OS 6 system from the
mid 1960's. In this system, he used the following implementation:
An open file descriptor
________
F ----------->| o--|--------> Read( F, ch )
| o--|--------> Write( F, ch )
|________|
Data specific | | Pointers to code
to this file | | to do device dependant
and device | | part of the I/O operation
|________|
Read( F, ch ) <------ the device independant read
{ routine called by the user.
F^.Read( F, ch )
} It simply calls the device
specific routine, passing
the pointer to the data
needed to do the operation.
This was originally implemented in BCPL (the Basic Computer
Programming Language), a predecessor of C based on CPL.
Strachey invented CPL, which stood either for Christopher's
Programming Language, the Combined Programming Language, or
the Cambridge Programming Language, depending on who you
talk to.
In systems with no protection, a file variable can be as
simple as a pointer to the descriptor for the open file.
This descriptor is typically created by the Open system
service. If there is protection, the descriptor must be
protected from the user program, so F, the file varialbe,
is more complex than just a pointer. On Unix, for example,
F is an integer index into a system maintained array of
pointers to descriptors (the array is called the file table
of the current process).
In object oriented terms, each of Strachey's open files is an object,
an instance of the class file.
The class file is amenable to multiple
implementations, each requiring different code; such a type
is called a polymorphic type, and one requirement of the
operating system context is that multiple implementations of
the abstraction coexist, so that, for example, a user can
have one open file refer to a file on disk, while another
represents the keyboard.
In object oriented programming languages, the same approach
described here is commonly used to implement polymorphic
objects. The data structure representing the object contains
pointers to the action routines that implement the abstract
operations on the object.
File Systems
An open disk file is an example of a virtual device just as much as a
disk or a printer interface is, when presented through a device
independent interface. Thus, we can extend our polymorphic class
hierarchy for open files to include not only open files which refer
directly to physical devices such as printer ports, serial communications
lines, and disk drives, but also open files that refer to files on the
file system, windows under the window manager, and other virtual I/O
devices.
Therefore, the underlying data structure for an open disk file under a
modern file system might resemble that shown in the following diagram:
________
F --->| o--|--------> File System
| o--|--------> Code
|________|
| | ________
| o--|------------>| o--|------> Disk I/O
| | | o--|------> Code
|________| |________|
Descriptor of | |
open disk file | | Data needed to
| | access the disk
|________|
Descriptor of
the disk itself.
Code to read and write an open disk file
* May convert block size to that of disk.
* Translates address of block in file
to disk address on the real disk.
The latter is Virtual Address Translation
The descriptor of the open disk file will typically
contain buffers for block size conversions, and it
will contain the data structure needed to convert
addresses from file-relative form to absolute disk
addresses.
The descriptor of the disk device itself will
typically contain the information needed for
disk scheduling, the queue needed to communicate
with the disk interrupt service routines, and other
information specific to the actual disk being used.
The read and write routines for the open disk file
can be written in such a way that they call the
standard, device independant read and write routines
of the disk itself -- in effect, the read and write
routines that implement the open disk file can be
written as if they were user-level code.
Real time Clocks
Timer hardware
Before discussing file systems, it's useful to spend some time on a simpler
class of devices: timers. In a sense, a timer is the simplest of all I/O
devices, because it performs no I/O. There are many kinds of timers; the
simplest is the periodic timer. All it does is request a periodic interrupt.
Most modern operating systems, Unix, Windows and Mac/OS included, use periodic
timers. In the case of early Unix the timer fired 60 times per second, a
frequency derived from the power supply. Modern Unix systems almost
universally have the system clock fire 100 times per second.
In the case of the DOS and Windows environments, the interval has classically
been 18.2 ticks per second. Most modern systems use hardware that is able to
provide far more general timer service, but commercially available operating
systems (excluding those sold as real-time systems)
frequently always ignore the generality of the hardware in favor of
using it to provide a simple periodic interrupt.
As an example of a very simple hardware interface, consider a timer that
fires every millisecond. This might present the following interface to
the software:
expired -- once a millisecond, the timer sets this single bit interface
register. When this bit is set, the timer also requests an interrupt. The
timer interrupt service software must reset this bit before returning from
or re-enabling interrupts in order to turn off the interrupt
request.
This simple timer will be used for the remainder of this lecture, but note
that far more complex timers are common. For example, interval timers
contain a register that is decremented at a constant rate, for example,
once per microsecond. The timer requests an interrupt when the register
reaches zero, and the interrupt service routine may load a new count in the
register to request the next interrupt.
Timer Software
It is rare to find application software that requires a periodic interrupt
at the same rate as the hardware period of the timer. When the application
requires a periodic interrupt at longer intervals, however, this can by
synthesized from the basic periodic interrupt of the hardware clock as
follows:
periodic-interrupt-service-routine:
reset( expired )
countdown := countdown - 1;
if countdown = 0 call user-service
return
Here, the procedure user-service is called at the end of an
interval defined by the periodic decrementing of the variable countdown;
user-service should reset countdown each time it is called. If
user-service always sets countdown to the same value, the user
routine will be called periodically; if user-service sets different
values each time it is calle, it will be aperiodic.
This creates, in software, a programmable interval timer, but most modern
real-time-clock hardware directly supports this function, with a
countdown register decrementing at several megahertz and the rule that
the hardware will request an interrupt when the register reaches zero.
Application programs rarely want just one interval timer. What they
typically want is a service such as the following, callable from within
any user process:
delay( t ) -- when a process calls this operating system service,
that process will be delayed t time units (the particular time unit
depends on the system; we will use milliseconds for the examples presented
here).
Exercise: How can you implement the example delay()
service in terms of the Unix kernel's timer primitives -- see the man
page for setitimer() for documentation.
On a system with only one process and an interval timer, this service
would be trivial to implement. On a system with more than one process
and a periodic timer, this service is far more interesting. In such a
context, many processes may be awaiting different times, and there may
easily be more than one process waiting for the same time instant.
Thus, there must be a list of waiting processes, with a record, for each
process, of the time it is awaiting. When a timer expires, it must somehow
search this list and wake up each process waiting for the current instant.
The need to block processes and then wake them up suggests that each timer
record will have a semaphore on which the process blocks. This leads to
the following data structure:
Timer record:
link fields, as needed for list management
delay -- the requested time delay
sem -- semaphore used to block process
Timer list:
a linked list of timer records
The delay service seen by an application program would then be as follows"
delay( t )
get a timer record r
initialize r.sem to a count of zero )
r.delay := t
append( r, timer-list )
P( r.sem )
A single timer record can be permanently attached to each process. The
interrupt routine needed to support this service is as follows:
periodic-interrupt-service-routine:
reset( expired )
for each timer record r in timer-list
r.delay := r.delay - 1
if r.delay <= 0 then
remove( r, timer-list )
V( r.sem )
endif
endloop
return
Code written along the above lines is actually used on some systems, but
it begins to behave very badly on systems with large numbers of waiting
processes. The problem is that interrupt service routines should never
perform open ended computations such as scanning of linked lists. This
work should be moved to an application process if at all possible.
Nonetheless, the above code implements what can properly be called a virtual
interval timer available to each applications process. This notion of
virtual timers is an important example of virtual device interfaces; its
primary importance is that timer, as a class of device, have the simplest
of semantics.
Exercise: Explore the documentation for the UNIX timer services
and write a critical essay on the deficiencies of these services. There are
many!
Improving Timer Performance
The key to avoiding complex interrupt processing in the implementation of
timer services is to rely on a sorted list of timer records, each holding
the time being awaited. This requires the following data structure:
Timer record:
link fields, as needed for list management
time -- the requested time at which the timer should fire
sem -- semaphore used to block process
Timer queue:
a priority queue of timer records, sorted by time
Clock:
the current time
There are many priority queue implementations, only the simplest, based on
a linear list, is suggested by the figure below:
_____________ ______ ______ ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________| |------| |------| |------|
| time | | time | | time |
/ |------| |------| |------|
Sort by order / | sem | | sem | | sem |
of time fields! |______| |______| |______|
If the system is to run for long periods, the variables used to hold clock
and the time fields of timer records must have sufficient precision that
overflows are not likely during the life of the system. A year's worth
of milliseconds will overflow a 32 bit counter, so realistic systems
should 48 or 64 bit counters.
Some systems maintain all time values in, for example, milliseconds since
some arbitrary dawn of time. Others maintain time values in more expensive
forms, with fields divided out for year, day, hour, minute, second and
millisecond. There is little reason to do the latter, however, since these
divisions are usually only needed for the output of dates, and internal
time and interval arithmetic is simplified by holding time values in a simple
form.
Exercise: The classic Unix representation of the date uses a signed
32 bit count of the seconds since Midnight January 1, 1970, GMT. When will
the UNIX date overflow? (It is interesting to note that MacOS and UNIX
were both designed without Y2K errors in their kernels.)
Assuming that the time fields are absolute times, in milliseconds, and that
the timer interrupt service routine maintains a global variable called
clock, measuring milliseconds from the beginning of time, the delay service
can be coded as follows:
delay( t )
get a timer record r
initialize r.sem to a count of zero
r.time := clock + t
enqueue( r, timer-queue )
P( r.sem )
The interrupt routine needed to support this service is as follows:
periodic-interrupt-service-routine:
reset( expired )
clock := clock + 1 -- count time
while head( timer-queue ).time <= clock
r := dequeue( timer-queue )
V( r.sem )
end loop
return
This code rests on a priority queue. The simplest such data structure
is a sorted linked list, where the enqueue service sorts new items
into the list in time order, the head of the list is always available,
and the dequeue service simply removes the head element from the
list.
This simple implementation moves most of the work to the enqueue service,
where it belongs, but if the list is long, the enqueue service may take
some time. There are faster algorithms that use binary trees or heap-ordered
trees to achieve O(log n) operation times on the queue. You can get more
information on this subject from "An Empirical Comparison of Priority Queue
and Event Set Implementations" in Communications of the ACM, 29,
April 1986, pages 300-311.
Code
for interesting priority queue algoritnms is available to WWW users.
Adapting the above code to a programmable interval timer poses some problems.
Primary among these is the problem of maintaining the clock. Every time the
interval timer expires, the clock can be updated to reflect the interval that
has just passed, but for inquiries about the clock value between updates,
the running count (countdown, in the example given earlier) must
be inspected to get the current value.
Date Overflow Considerations
The above code relies on a global variable "clock" that can be incremented
once per timer interrupt. This is a source of significant problems in real
systems. For example, consider a timer that increments once per millisecond.
This will increment by 1000 per second (so about 10 bits of timer precision
are devoted to counting seconds. Counting 60 seconds per minute takes roughly
another 6 bits of precision, and counting 60 minutes per hour takes another
6 bits, leaving 10 bits out of a typical 32 bit word available for counting
hours. 1000 hours is just 40 days of 25 hours each, so we can quickly estimate
that our 32 bit counter will overflow in just 6 weeks! If we replace rough
estimates such as those used above with accurate arithmetic, we get a clock
lifetime of 49.71027 days.
Using the above algorithm, we would expect our system to run correctly for
a month, and then we would expect a brief burst of irregular behavior as the
clock rolled over, after which it would deliver correct performance for
another month. This is hardly acceptable!
Therefore, we seek a better solution!
One approach is to simply use a higher precision counter. For example, we
could replace our 32 bit counter with a 64 bit counter; with this, our
system would last for millennia instead of mere months. We pay a price,
however, in slow arithmetic, unless we're running on a machine with a 64 bit
ALU. Actually, the situation is not that bad. On most machines where
a register to register add takes one instruction, a double-word register
to register add can be done in two or three instructions.
Another approach is to note that, while we may well want our system to run
for longer than one month, most of the timed delays requested by users are
likely to be on the order of seconds. We can easily rewrite the timer data
structures to support this by replacing the time field in each timer record
with a relative delay field:
Timer record:
link fields, as needed for list management
delay -- the delay between the previous timer and this
sem -- semaphore used to block process
Timer queue:
a sequential queue of timer records, sorted by expiration time
Clock:
the current time
_____________ ______ ______ ______
| Timer Queue |-->| Link |-->| Link |-->| Link |--X
|_____________| |------| |------| |------|
| delay| | delay| | delay|
Relative delay / |------| |------| |------|
from one record | sem | | sem | | sem |
to the next. |______| |______| |______|
This data structure adds only marginally to the complexity of the timer
services, while it allows very fast arithmetic to be used for typical short
delays.
Summary: The above notes suggest a number of ways to implement
a large number of virtual timers for use by user processes, when given only
a single hardware timer. There are many more ways of doing this! This
lesson applies to any of a number of virtual resources created by operating
systems, so, for example, a file system may creates many virtual disks from
one physical disk in many different ways, and a window manager may create
many virtual display screens within the confines of one physical display
screen in many different ways.
Real Time Considerations
The semantics of timers is not always immediately obvious. When a process
says delay s seconds this does not mean to wait exactly s seconds and
then continue. Rather, it means to wait at least s seconds. If there are
other ready processes at the end of the wait, we generally make no attempt
to predict which process will be running at the end of the wait; we merely
insist that the waiting process become ready when the timer expires.
This is consistent with the usual weak assumptions made about parallel
processes -- that they advance unpredictably, so that we can make no
assumptions about how CPU resources are divided between them. Usually,
proof of correctness under these weak assumptions is easier than proof under
more complex assumptions, so we use these weak assumptions even when stronger
assumptions are reasonable.
Note that the presence of a timed-delay or equivalent service is one of the
defining characteristics of a real-time operating system! Real-time
applications, demand stronger assumptions. Typically, the defining
characteristic of a real-time system is the presence of deadlines by which
the software must complete some actions.
When parallel programming is applied to real-time systems, we must ask
if there is a feasible schedule. That is, we ask not only
if there is enough CPU time overall to complete the entire computation,
but we also ask if each task can accomplish the necessary computation
before each deadline.
If there is no feasible schedule, the system cannot succede, but if there
is a feasible schedule, we must also ask whether the scheduling algorithm
is able to find the feasible schedule among the infinite set of possible
schedules. Two classes of real-time scheduling algorithms have been widely
studied.
Rate monotonic priority scheduling will always find feasible schedules
if there is approximately 1/3 idle CPU capacity. In this scheme, processes
are scheduled with priorities related to the intervals between deadlines.
Processes dealing with deadlines that have a short interval are given
high priorities, while those dealing with deadlines that have a long
interval are given relatively lower priorities.
Usually, rate monotonic scheduling is discussed under the assumption that
intervals are periodic, in which case, the priority of a process is fixed
proportionally to the rate at which it must meet deadlines. In fact, there
is no reason to limit this scheme to periodic deadlines, but introducing
aperiodic deadlines does force the system to use dynamically varying
priorities.
Deadline based scheduling uses the time of the next deadline to be met
by a process as its priority. Thus, if a process must finish some job by time
t, the priority of that job is t. Distant future deadlines have low priority,
while imminent deadlines have high priority. Deadline basee scheduling is
somewhat uncommon in practice, but it can find a feasible schedule even when
there is no excess CPU capacity.
Файловая система
File System Layers
A well designed file system (as opposed, for example, to the file
systems used on most commercially successful operating systems)
will typically be built in many layers,
for example, with a structure such as the following:
- The device driver
- Interrupt service routine
- Disk request scheduler
- Low level disk-request queue interface
- Low-level file system
- Buffered streams
- Named files
The low-level file system offers a device interface quite similar to that
offered by the device driver, and possibly identical to it. The difference
is that the device driver enqueues requests for I/O to a real device, while
the low-level file system accepts requests for I/O to virtual devices that
are each implemented by one or more real devices.
Higher level layers of the file system create the illusion of buffered stream
I/O, and independence from the actual block sizes of the underlying device,
and create a system of directories and file names for the files implemented
by the low-level file system. Here, we will focus on how a decent low level
file system can be built.
A standard low-level interface
If you assume a standard device interface, shown pictorially as:
_______
| READ | -- device.read( buffer, device-address )
|_______|
| WRITE | -- device.write( buffer, device-address )
|_______|
It is clear that this can be supported by the low-level interface to the
device drivers of a large variety of random-access
storage devices, ranging from hard and floppy disks to software emulation
of such devices using fast RAM for data storage. For the purpose of this
section, we will assume one fixed buffer size, equal to the sector size of
the device, and we will assume that device addresses are lineraized
over a range such as 0 to Max-Address instead of being constructed from
components such as sector, cylinder and surface numbers.
To go beyond these simplifying assumptions, we could add a service to our
standard low-level random-access device interface:
_______
| READ | -- device.read( buffer, device-address )
|_______|
| WRITE | -- device.write( buffer, device-address )
|_______|
| INFO | -- device.info( infoblock )
|_______|
The new info service for each device would be expected to return information
about that device's disk geometry answering questions such as: What is
Max-Address for this device? Is the sector size fixed, and if so, to what
value? Is the number of sectors per track fixed and if so, to what number,
and is this divice divided into multiple surfaces and cylinders, and if so,
how many of each? The answers to these questions may vary from device to
device, creating serious problems for higher level software.
Higher level I/O primitives, such as the standard random-access stream
primitives of Unix or the even higer level stream primitives of C can easily
be implemented on top of this layer, if needed, and these primitives can
hide most of the complexity exposed by the addition of the info service
suggested above.
In addition to constructing physical device drivers that use this
interface, we can also construct virtual devices that use it. For
example, consider the following picture:
___
_______ ____________| _______
| READ | | | READ |
|_______| Disk | |_______|
| WRITE | Cache | | WRITE |
|_______|____________| |_______|
|___
This shows a disk cache implementation that assumes the availability of
an underlying device supporting this interface, where the cache itself
supports exactly the same interface as the underlying device.
In object oriented terms, both the device interface itself and the
disk cache are subclasses of the same abstract polymorphic class,
and all instances of the cache subclass are implemented in terms of
some instance of the same abstract class. Mathematical type theorists
say this is an illegal use of the type theory on which object oriented
programming rests, but it is a commonplace one!
If interposed between any physical device and a user of that
device, this disk cache can improve the apparent speed of the disk, from
the user's viewpoint, by caching recently used sectors in some area of
main memory, for example, using an LRU replacement algorithm.
The disk cache might, for example, maintain a set of buffers in memory
that hold copies of the most recently accessed disk sectors; on read, if
the desired sector is buffered, no actual disk I/O takes place. On write,
a copy of the data written can be held in the cache for later reading,
and cache buffers can be assigned to sectors on an LRU basis, with all
of his done by the software.
Our focus here is on file systems. In effect, our lowest level model of
an open file is a virtual device that also supports this standard device
interface, although perhaps with a noticably smaller value of Max Address.
Thus, the following picture applies:
___
_______ ____________| _______ _____
| READ | | | READ |
|_______| Opened | |_______| Physical
| WRITE | File | | WRITE | Disk
|_______|____________| |_______|_____
|___
The implementation of an open file must map read and write requests from
the user into read and write requests on the underlying file. Of course,
the file system need not be implemented directly on a physical device.
The following structure is quite possible:
___ ___
_______ ____________| _______ ____________| _______ _____
| READ | | | READ | | | READ |
|_______| Opened | |_______| Disk | |_______| Physical
| WRITE | File | | WRITE | Cache | | WRITE | Disk
|_______|____________| |_______|____________| |_______|_____
|___ |___
A file system consists of two essentially separate parts, one that creates
opened files in response to open requests, and one that implements opened
files. We will focus on the former first.
Opened File Semantics
Just like the underlying disk or cached disk on which it is implemented,
an opened file supports our two basic operations, read and write.
The fact that it is an open disk file does nothing to change the data
that flows to or from disk with these operations. What it does do is
change the disk addresses used. Thus, the top-level code of the read
and write operations on an open file can be expressed as:
open-file representation and access methods:
device -- identity of device used to implement this file
description -- some kind of description of this file
serving to distinguish it from other files on the same
underlying device.
read( buffer, file-address )
device.read( buffer, disk_address( file_address, description ))
write( buffer, file-address )
device.write( buffer, disk_address( file_address, description ))
A user of a bare disk or a disk cache uses real disk addresses. A user of
an opened file uses addresses relative to that file. In effect, these can
be considered to be virtual disk addresses, and all of the address
mapping mechanisms applicable to virtual address tranalation are equally
applicable to translating file addresses to disk addresses.
Additive linear mapping is the simplest way to translate
virtual to physical addresses. This is illustrated below:
disk_address( file_address, description ) =
if file_address < description.size
then return file_address + description.base
else error
In this case, the description of an open file consists of just two fields,
a base and a limit, where both are sector numbers.
In old minicomputer and early microcomputer operating systems, file systems
were frequently constructed this way. A directory entry consisted of a file
name, a base address and a file size. If this is the only form of file mapping
on a system, the primary weakness shows up when there are large numbers of
small files. As the file system evolves, the disk space grows progressively
more fragmented, until sufficiently large free blocks cannot be found to allow
new files to be created. At this point, the file system must be compacted,
sliding all existing files together and consolodating the free space into one
large free block.
On modern systems, such simple additive linear mapping is commonly called
partitioning, and it is quite common for large disks to be divided into many
smaller virtual disks, each called a partition, and each supporting an
independent file system.
Partitioning is done for a number of reasons:
- to allow partial backups
- It is common to divide a disk into a user partition and a system
partition because the system is only changed occasionally, and therefore
only needs to be backed up occasionally.
- to support an old file system
- This is an unfortunate reason for partitioning a disk! As disk drives
get larger, they have frequently exceeded the maximum size anticipated by
the designers of older file systems. Many of todays operating systems were
originally designed in an era when disks with capacities over a few hundred
megabytes were unavailable, and the designs frequently failed to anticipate
the availability of larger drives. Small virtual disk drives can easily
support the old file system with minimal changes!
- to control resource contention
- If each subcommunity of users allocates their files in independent
partitions, the system management can control the impact of misuse of disk
space. Well designed multi-user file systems generally have far more
powerful tools for solving this problem, but partitioning remains in
common use for this purpose.
- to allow coexisting file systems
-
When one hardware system must support multiple operating systems, it is
common to partition at least one of the disk drives, reserving a partition
for each of the operating systems. Thus, we find PC systems with Linux
partitions and DOS partitions. The coexisting operating systems must
cooperate to support this by supporting compatable disk partitioning schemes;
if one file system partitions the disk by cylinders while the other partitions
the disk by recording surface, coexistance will be difficult!
Table lookup mapping is a common method - this is exactly the method
used with virtual memory address maps, except that it is done entirely
in file system software instead of involving a memory management unit:
disk_address( file_address, description ) = description[ file_address ]
Here, the description of each file is an array, indexed by file address,
giving the corresponding disk addresses.
Each file has its own mapping table; the difficulty with
this is that most files are quite small, while the table size is determined
by the largest legal file. In such a case, most table entries would contain
error indicators. A more reasonable implementation would involve storing
only the non-error entries in the table:
disk_address( file_address, description ) =
if file_address < description.size
then return description.table[ file_address ]
else error
Here, the description of an open file is a record containing the size of the
mapping array, along with that array. This allows a dynamically allocated
array of size just large enough for the file.
For large files, it is not sensible to keep the entire mapping table in main
memory! The obvious place to put it is on disk, and one way to do this
is to put the mapping table in another file. This is recursive, but it
isn't infinite regress if there is a minimum threshold size used for this
method. In that case, the translation code might look something like:
disk_address( file_address, description ) =
if file_address >= description.size
then error
else if description.size = 1 then
then return description.address
else
else
sector = file_address div slots_per_sector
slot = file_address mod slots_per_sector
description.file.read( buffer, sector )
return buffer[ slot ]
Here, an open file description has 3 fields, the size of the file, the
disk address of the sector holding that file, if it is only one sector,
or the file holding the table of sectors of that file, if it is larger
than one sector. Each address translation for a large file involves
reading a sector of the description file to get the required address.
Note that this approach is only efficient if a disk cache sits between
the file system and the disk -- if not, an extra disk access would be
required for every sector of a large file read or written. Because
it is highly likely that consecutive sectors will be accessed or that
one sector will be accessed multiple times, perhaps with a read and then
a write, the locality principle operates here, and the use of an
appropriate disk cache will eliminate most of the extra I/O operations.
In fact, the scheme outlined above can also be viewed as storing disk
files as tree structures:
_ _ _
root (1 sector) |_|_|/|
__| |__
_|_ _ _|_ _
description file |_|_|_| |_|/|/|
__| | |__ |___
_|_ _|_ _|_ _|_
data sectors |___||___||___||___|
In the above, both root file and each sector
of the intermediate level description file have been assumed hold three
disk addresses each. It is far more common to have upwards of 128 disk
addresses per sector of the description file (or files), and instead of
defining tiny files as those of just 1 sector, the base level data structure
can contain the addresses of up to 8 or 10 sectors.
This kind of tree structure does not impose any physical organization on
the sectors of a file, and this is both an asset and a liability. It is
an asset because it means that a file system organized using this kind of
indexing is not subject to serious external fragmentation problems.
It is a liability because it means that sequential access to files on such
a system may be subject to serious latency problems imposed by poor
organization of the file sectors.
A good file system using this type of organization will typically maintain
free-space data structures in such a way that space for files can be allocated
in the same cylinder as the space used to index the sectors of those files,
and so that consecutive sectors of a file are stored in the same cylinder
whenever possible.
Tree structured file systems are quite common, but they are rarely
implemented in the straitforward way suggested above. Instead, they
are usually implemented using fairly awkward special purpose code.
Unix I-nodes
The widely used Unix file system is an example. There, each open file
is represented in memory by a data structure called an I-node.
The I usually stands for either the words information or index.
The classic version of the I-node data structure contains:
- The access rights.
- Owner, date of creation, etc.
- The disk addresses of the first 8 sectors. This allows fast access
to the first few sectors of a file.
- The disk address of the sector containing pointers to the next 128 sectors
of the file.
- The disk address of the sector containing pointers to the sectors
containing pointers to the next 128x128 sectors of the file.
Given that the classic Unix sector size is 512 bytes, this allowed accessing
files up to about 8 megabytes on classic Unix systems. This was sufficient
for machines with 16 bit words, and it was sufficient for the disk technology
available in the early 1970's, but by the late 1970's, it was obviously too
small.
Modern Unix systems overcome the deficiency of the data structure
outlined above by adding one more disk address to each I-node that
supports a three-level tree, the so-called large-model file system.
This extends naturally to what came to be called the huge model, with a
4-level tree.
The complexity of the Unix I-node data structure is a result of the memory
limited context of early Unix systems, combined with a desire to make the
first few sectors of even the largest files easy to access. Fast access
to the first sector of a file is particularly important under Unix and
related systems because, when a file is executed, the "magic number"
included in the first bytes of the first sector determines whether the
file is object code or is intended for execution by some interpreter.
Fundamental Problems
The two fundamental problems that a well engineered file system must address
are as follows:
Most files are small. Small shell scripts, bits of E-mail, single procedures
and other bits and pieces make up the majority of the files on most systems.
Many of these are significantly smaller than a single sector.
Therefore, a well engineered file system should be able to store a large
number of tiny files.
Most accesses to files made by running programs are to large files. Image
files, sound files, gigantic databases, and similar applications are important!
Therefore, a well engineered file system should support
efficient access to very large files.
File Names and Directory Managers
The primary job of a directory manager is to maintain a mapping between
textual file names, as presented to the open function, and the information
needed to build an open file description record.
The directory itself is the data structure on disk used by the directory
manager in performing this mapping.
The simplest approach to directory management was used on many early computer
systems, including most early floppy-disk based systems. On such systems,
a fixed part of each disk was reserved for the directory, which took the
form of a flat table, for example:
Directory = array [1..64] of record
name: packed array [0..7] of char;
start-sector: disk-address;
sectors: integer;
end record;
Here, as in many crude file systems, a linear additive mapping scheme is used
to describe the disk space occupied by a file. It is trivial to extend such
a directory to include such features as the date of file creation, the length
of the file, in bytes, and similar details.
Hierarchically structured file names
In general, users don't like flat name spaces, and they quickly start
building hierarchic spaces using simple conventions such as the now almost
universal
filename.extension
format. In fact, the name of this file,
/csf/jones/.public-html/opsys/notes/13.html
could as easily be treated as a single long character string in a flat
name space as it can be treated as a description of a path through a
tree. Flat directory structures become unwieldy, though, so most modern
systems support some variant on the tree-structured directory.
Tree-structured directories appear to have originated with proposals put
forth by workers at project Mac for what eventually became the Multics
system. These ideas were published in paper by Dennis and Van Horn, in
Communications of the ACM, March 1966.
The key idea is to store directory data structures such as that outlined
for a simple flat directory system in files. One bit attached to each
file determines whether that file contains data or a directory. Whether
it is a data file or a directory, all other details of storage allocation
for the file are typically the same. Only the interpretation and the
set of allowed operations on the file's contents changes.
The Unix file system
The Unix file system is a variant on the theme of hierarchically structured
directories. Underlying the user-level directory manager is a primitive
flat file system using integers as file names. The integers are simply indices
into an array of I-nodes, the I-table, allocated out near the center of the
(linearized) range of disk addresses:
___________________________________________________
|super| | | |
|block| | I-table | |
|_____|________________|_________|__________________|
The superblock at the start of the disk contains the disk address of the
I-table and its size, along with free-space data structures. In order to
provide a degree of crash protection, the superblock is actually stored
redundantly. The I-table is in the middle of the disk in order to minimize
seek times in moving the heads from I-table to data and back.
Each I-node has one bit indicating whether the associated file is a directory
or a data file. Other than that, all other attributes such as how storage
is allocated for files are the same, whether it's a directory or a data file.
Under the Unix system, each directory is a sequence of ordered
pairs <file-name, I-number>, where file-name is a variable-length
string and
I-number is an integer. Each directory contains two special entires, one
called "." (dot), which contains the I-number of that directory, and
one called ".." (dot dot), containing the I-number of the parent directory.
Other than these two special entries, the directories of a Unix system are
strictly tree structured.
Unix does allow data files to be referenced from multiple directories.
Thus, the same file may appear under different names in two or more
directories.
The redundancy of the Unix directory structure (self-links and back links)
allows a program called fsck (file system check) to reconstruct the directory
structure in the face of many possible failures. On most Unix systems,
fsck is routinely run as part of the startup script. The redundancy of the
Unix directory structure was not designed with support for fsck in mind;
other systems, notably the Xerox Distributed File System (XDFS), have provided
far greater redundancy, so that the entire directory system can be
reconstructed from the data files themselves.
Under XDFS, the prefix on each sector of each file contains the sector
number of that file relative to the file it is in, along with the file
number of the file (analogous to a Unix I-node number). As a result, the
entire data structure describing mappings from [sector-number,file-number]
pairs to actual files on disk can be reconstructed by scanning the headers
on all the sectors on disk. In Unix terms, the I-table and all subsidiary
data structures can be reconstructed from the representations of the files
themselves.
Under XDFS, sector zero of each file contains the file number of the
directory that references that file, as well as the name of the file and
any related information for the file. Thus, by finding sector zero of each
file on the system (by scanning the entire disk) a strictly tree-structured
directory for the file system can be reconstructed. The net result is
that if some sectors on a disk are damaged, the only possible data loss
is to the data in those sectors. No localized damage will ever destroy the
entire file system. Sadly, few modern file systems have incorporated this
degree of redundancy in order to allow a similar degree of damage survival,
so today, it is common to lose entire files or directories because of very
localized disk failures.
Under Unix, the prohibition on directory structures other than a tree
with back links (..) and self-links (.) is imposed in order to avoid problems
with reclamation of storage when links are deleted. Each I-node contains
a count of the number of incoming links; this is incremented when a link to
that I-node is added, and it is decremented when a link to that I-node is
removed. If the count reaches zero, the file is deleted.
The problem with this scheme is that it can't detect it when a directory
becomes inaccessible because the circular and self-links prevent the count
for the directory from going to zero. As a result, Unix uses one command
to remove links to data files (rm) and another command to delete directories
from the directory tree (rmdir). The Cambridge File System (see
Wilkes and Needham, The Cambridge Cap Computer and its Operating System)
eliminates this problem by using dynamic garbage collection to reclaim
unreachable files or directories. Again, this solution, while well proven,
has not been adapted by any commercial file systems in widespread use.
Performance Problems
A successful file system must support large numbers of small files while
supporting fast access to large files. Large numbers of small files requires
allocation of files in relatively small units such as the sector, and this
encourages use of translation tables, N-ary trees, or other similar data
structures to hold the data needed to translate between addresses within a
file and physical disk addresses.
To support fast access to large files, contiguous allocation of disk space
is essential. Many file systems do this, preferentially selecting among
available free blocks so that, when a file is enlarged, new sectors are
accessable with minimum latency. This requires that, when a file is
enlarged, free sectors in the same cylinder are used, if they are available,
and if not, free sectors in adjacent cylinders are used.
Protection
Memory Protection
Memory protection is a key element in
protecting the resources of a single-CPU or shared memory
machine. It is essential that users not be able to arbitrarily
manipulate the variables or code of the system; if I/O is
memory-mapped, memory protection can even be used to control
access to I/O devices.
If access to the memory management unit is protected so that only
the system can change the memory management registers, then the memory
management registers can be considered to be a protected part of the user's
process state.
This trivially allows each user to be given access to a disjoint region of the
systems virtual memory, and it is noteworthy that
this trivial policy can be supported with almost any MMU hardware.
Here's an illustration, using a paged system:
Trivial policy
USER 1 USER 2
____ ____
MMU mapping | o-|----- | o-|-----
registers | o-|--- | | o-|--- |
| o-|- | | | o-|- | |
|____| | | | |____| | | |
| | | | | |
_______ | | | _______ | | |
Pages on | |<- | | | |<- | |
disk or in |_______| | | |_______| | |
main memory _______ | | _______ | |
| |<- | | |<- |
|_______| | |_______| |
_______ | _______ |
| |<- | |<-
|_______| |_______|
To switch between the two users shown here, the system must not only
change their CPU registers (Accumulator, Program counter, etc),
but it must change the contents of the MMU registers.
If the user address space doesn't contain any system code, then
the MMU must have a second set of mapping registers that apply
in system state, or alternately, the MMU must be turned off
in system state. The latter is simple, on machines that allow it,
but it eliminates the possibility of having parts of the system run
in virtual memory.
The former (with two sets of mapping registers) is
easy -- but it may make it difficult for system code to address
arguments to system calls, since they're in the wrong address
space. The alternative of allowing system and user code to both
address the same memory requires that there be some mechanism
that changes the addressability of system pages when system
state is entered -- for example, the MMU could declare that
access to system pages is legal only when the program is in
system state.
Sharing Memory Between Users
A paged memory management unit can do much more than provide each user
with a distinct address space. For example, consider the following:
USER 1 USER 2
____ ____
MMU mapping | o-|----- | o-|-----
registers | o-|--- | | o-|--- |
| o-|- | | | o-|- | |
|____| | | | |____| | | |
| | | | | |
_______ | | | _______ | | |
Pages on | |<- | ->|Shared!|<- | |
disk or in |_______| | |_______| | |
main memory _______ | _______ | |
| |<- | |<- |
|_______| |_______| |
_______ |
| |<-
|_______|
How do users arrange to share pages (or segments)? What if shared
pages are not at the same virtual address? This simple picture does
not answer these questions!
In most Unix systems, read-only code segments are always shared between all
processes that are executing the same program. Shared read-write segments
are allowed under System V Unix, using a scheme borrowed from the 1966
vintage Berkeley Timesharing System. This required the following two system
calls:
Create shared segment ( pages, name, security )
Insert shared segment ( pages, name, security, address )
When a shared segment is created, it is given a name. Unfortunately,
this name is in a flat name-space (it is unstructured) and there
are no tools provided to prevent users from selecting names that
conflict.
Sharable segments have a size, an identifying name, and some
security attributes (on the Berkeley system, a password; on Unix,
a short access control list with the same format as that used for
files).
When a process needs to address a shared segment, the user specifies
the segment, the number of pages desired (which may be less than
the size), and the virtual address where the segment should appear
in the user's address space.
If two different processes insert the same segment at different
virtual addresses, then pointers that refer to part of the segment
from within the address space of one process will not refer to
the same place from within the virtual address of the other
process.
The Unix System V shmat() kernel call attaches a shared segment
using this model. Under System V, shared segments have 32-bit names, usually
interpreted as integers, selected from a global name space. As a result,
programmers must generally allocate names dynamically because of the
possibility that some other program has already created a segment with some
particular identifier.
Alternative Approaches to Sharing Memory - share with children
On the now discontinued Encore Multimax computer under the UMAX operating
system, a variant of Unix, there was only one system service used for
memory sharing:
Share( range of addresses )
mark the pages holding the indicated virtual addresses as shared.
Marking a page as shared had no immediate effect. Instead, the effect
was delayed until the process executed the fork()
system call. At that point, all non-shared pages were copied (at least,
that was the user's impression of what happened), while pages marked as
shared remained accessable by both the parent and child.
This scheme allowed pages to be shared between processes only
if the common ancestor of those processes established the
sharing. This was sufficient to allow multiple processes to
cooperate in solving some problem on behalf of a single user,
but it didn't allow sharing between users.
This scheme guarantees that a shared page will occur in the
same place in all address spaces that share it.
On demand paged Unix systems, the trivial implementation of
Fork is expensive -- copying all non-shared pages of a process
requires that all pages be pulled into memory and copied, a
job that may require many disk accesses in a row. To avoid
this, most virtual memory Unix systems use what is called lazy
fork or copy-on-write semantics.
After a fork, all pages are physically shared
but marked as read-only. When a user tries to write on a
such a page, the page-fault trap service routine replicates the page
on behalf of the process that tried to change it, making a private
read-write copy for that process. This is yet another type of
soft page fault. The result of this design is that
the expense of the fork operation
is distributed over some number of page faults, and the only
pages that get copied are those that are actually modified.
Alternative Approaches to Sharing Memory -- share pages of files
The Multics system had only one operation that supported sharing:
OPEN( disk file name, virtual address )
Map the pages of the indicated file into the process's
address space at the indicated address. Once a disk
file is open, the entire file becomes addressable as
if it had been read into a giant buffer in memory.
This relies on the file system to determine what users are allowed
to open which files.
The Multics system was built on the GE 600, a 36 bit machine, by
a consortium of MIT, Bell Labs and GE. Honeywell bought out GE's computer
division and continued selling Multics systems through the 1970's.
There may still be a few Multics systems running, and other
manufacturers, notably Prime, have copied most of the good ideas
from Multics.
The name Unix is a pun on Multics. Unix was written at Bell Labs
after Bell dropped out of the Multics consortium.
Multics used a paged-segmented model of virtual memory, and each
segment could address one file. Mapping a file into the virtual
address space of a process was the only way to read or write that
file. (Sequential I/O to terminals, magnetic tape and other non-random-access
devices was done through different mechanisms.)
If two users opened the same file at different virtual addresses
(different segment numbers), then major problems would arise with
data structures containing pointers in the segment they shared.
Shared memory under Windows NT is based on the Multics model, in the
sense that all memory sharing under Windows NT is done by mapping shared
files into the virtual memory address spaces of the processes that wish
to share memory. The BSD Unix mmap() system call supports this
model, and this has since been incorporated into System V and other
more modern standards for Unix.
Protection Theory
From a theoretical point of view, we can talk about any kind of
object and any kind of user. It is worth asking if
users are
themselves a kind of object, but what matters is that users
act on objects using operations, where the type of each object
may determine the set of applicable operations.
Here, we are not concerned with the nature of objects, other than the
fact that objects may be operated on by users using operations selected
from a finite set. If we adopt the terminology of object-oriented
programming, the set of operations applicable to an object is the
set of methods defined by the object's class.
Files fit this model, with the operations read and
write. Directories fit this model, with the operations make-directory-entry,
lookup-directory-entry and delete-directory-entry. Memory pages or segments
also fit this model, with the operations load and store.
We are also not concerned with the nature of users. From the the point
of view of the theory, users may be humans, programs, processes or anything
else that may evoke operations on objects.
For example purposes, it is useful to think of files as the objects and humans
as the users, but it is equally useful to think of pages or segments as the
objects and processes as the users.
Static Protection Models
The access matrix is a universal tool for expressing models of the
state of a protection system at some instant in time.
It shows, for that instant, all users, all
objects, and the set of operations each user is allowed to apply to each
object in the system. Pictorially, access matrices are generally presented
as follows:
Objects
| Cake | Tea |
------+------+-------|
Fred | Eat | Drink |
Users ------+------+-------|
Lucy | Bake | Drink |
| | Brew |
--------------------- Allowed
Operations
The above example illustrates the (rather steriotypical) interaction of
two users, Fred and Lucy, centered around two objects, Cake and Tea.
The operations Eat and Bake apply to cake; Fred can Eat Cake, but not
Bake it, and Lucy can Bake Cake, but not Eat it. The operations Drink
and Brew apply to Tea. From the point of view of the access matrix, neither
the names of users, the names of objects, nor the names of operations
have any meaning other than the explicit relations between users, objects
and operations detailed in the matrix.
Access matrices may be used to control access to any resource,
by any kind of user. Their use is not limited to computers;
for example, they have been used in the Russian Gulag to control
the access of prisoner/laborers to rooms in a prison/factory.
The access matrix describes allowed actions, but it does not
describe an enforcement mechanism. In the case of the Russian
prison, the rules of the matrix were enforced by guards with
guns.
On a computer, one could store the access matrix inside the
operating system, and have the system check the matrix whenever
a user tried to access a resource. The problem with this is
that the matrix is usually quite sparse -- that is, on most
systems, most matrix entries will say "no access".
It is important to note that the access matrix is a static
description of who can do what operations on which resources.
The model must be extended to describe how access rights are
changed, and these extensions frequently differ markedly depending
on how the access matrix is implemented.
Access Control Lists
There are two practical ways to implement the access matrix,
access control lists and capabilty lists. Each of these
allows efficient storage of the information in the access matrix.
The access control list implementation of the access matrix given above
is:
Access Control Lists
-------- --------
Objects | Cake | | Tea |
-------- --------
| |
-------- --------
Access | Fred | | Fred |
Control | Eat | | Drink |
Lists |--------| |--------|
| Lucy | | Lucy |
| Bake | | Drink |
-------- | Brew |
--------
An access control list or ACL is what results when the access
matrix is stored colum-wise. An access control list is a column
of the matrix is stored with a protected object. Each entry in
an access control list specifies a user and the operations that
user may perform on that object.
One problem with access control lists is that they may be quite
large. It is easy to imagine the access control lists for some
widely used files occupying more storage than the data in the
file. The usual solution to this is to introduce named sets of
users -- for example, OTHERS to specify the rights given to all
users not explicitly mentioned in the list.
Example: the Multics file system
Under the classic Multics system, which was also used by the PRIMOS operating
system, each file has an ACL, with access rights such as the following:
- Read
- Write
- Traverse -- for directories
- Execute -- for data
- EditACL
The read and write rights should be obvious.
The traverse right is used in the context of hierarchical directory
structures.
A user who has no right to read or write a directory can mention that
directory in the path name of a file as long as the user has traverse
rights to that directory -- the user is just passing through, but
may not open the directory to list its contents.
The right to edit an access control list illustrates one approach
to allowing the contents of the access control matrix to be modified.
In a sense, all users with the right to edit the access control list
of an object can be thought of as co-owners of the object. If one
user (the legitimate owner of an object) grants the right to edit
that object's ACL to another user, it is like granting the other
user a power of attourny over the object.
Example: the Unix file system
Under Unix, each file has an ACL, but the ACL is strictly limited in structure.
Each Unix ACL has 3 fixed entries:
- An entry for the owner of the file
- An entry for others in the same group as the the owner
- An entry for the general public
The rights associated with each entry are:
- Read
- Write
- Execute/Traverse
The Multics project originated as a joint project between MIT,
GE and Bell Labs in the mid 1960's. In the late 1960's, Bell
pulled out of the Multics project, largely because the technical
staff of the computing group had concluded that Multics was getting
too complicated. Unix was born from this frustration, and the
name Unix is a pun on Multics -- suggesting a scaled down system
designed to meet a scaled down set of goals.
The simple to implement ACL scheme of Unix is an example of this.
Unlike the Multics ACL, it is not general, but it meets the needs
of most users of small computer systems. As a result, every
Unix ACL fits exactly in 9 bits of the I-node of the file it
describes!
Multics was intended to meet the needs of a general information
utility, and as such, it had to deal with the problems of large
numbers of users, including users from competing organizations.
The generality of the Multics access control list is well justified
by this environment, and the success of PRIMOS in meeting the needs
of compaines like Compuserve or The Source (modern examples of
computer utilities) amply illustrates the appropriateness of the
Multics design.
When Unix systems are extended beyond single departments, the
short ACL of Unix becomes a problem. Distributed versions of Unix
such as the Appolo Domain operating system and it's successors
have been forced to add general ACL's back onto the Unix file model
in order to meet the demands of the large user base that such a
system can serve.
Capability Lists
As mentioned above, capability lists are the alternative to
access control lists.
The capability list implementation of the sample access matrix given above
is as follows:
------ --------------------
| Fred |-| Tea | Cake |
------ | Drink | Eat |
--------------------
------ --------------------
| Lucy |-| Tea | Cake |
------ | Drink | Bake |
| Brew | |
--------------------
Users C-lists
A Capability List is the dual (in the topological sense) of an
access control list. Instead of storing one column of the access
matrix with a protected object, a row of the access matrix is stored
with the user.
A capability list entry, known as a capability, consists of the
name of an object and the rights one user has to that object.
This does not change the rights a user has to a resource, and it
does not change the set of policies that can be implemented, but
it does change the economics of certain operations.
With a capability-list based protection mechanism, it is clear that
users should not be granted the right arbitrarily edit their own
capability lists! Instead, it is common to consider the capability
list itself to be an object to which rights can be granted. Thus,
Fred in the above example might have the right to edit Lucy's
capability list. Typically, users might have the right to move
capabilities from any list they can read to any list they can write,
but users would not be able to modify a capability, except by
removing access rights from it.
Capability Based Addressing
The most important application of capability lists in modern computer
systems is to addressing. With capability-based addressing, each address
is formed from two parts, as follows:
ADDRESS
__________________
|_______|__________|
| |
| Index into an addressed object
|
Index into a capability list
used to select an object
Capability based addressing, as such, was invented for the Chicago
Magic Number computer, a paper machine that was described by people at
the University of Chicago in the late 1960's. Capability lists
were invented by people involved with the Multics project, although
Multics was not itself thought of as being based on this idea.
Capability based addressing solves some interesting problems.
For example, on Unix, users can frequently name objects that they
may not access, and users may find themself in the odd situation
where they own a file (and pay for the storage it occupies) yet
they can't address it because they have no rights to the directory
where that file is listed.
Capability based addressing solves this problem by associating
the right to access an object with the ability to name it. User's
names of objects are always relative to some capability list, and
any global absolute name for an object is hidden inside the
capabilty and is never visible to the user.
Examples
The most common use of capability based addressing is in paged or
segmented virtual memory address translation. Each entry in a page
table clearly indicates both the object to which that entry grants
access and the access rights granted, and the page table associated
with each user is the page table for that user.
VIRTUAL ADDRESS
__________________
|_______|__________|
| |
| Word in page
|
Index into the process's C-list for pages
the objects referenced by capabilities are pages
Note that it is difficult to imagine an efficient implementation of paged
virtual address translation using one access control list per page. This
would require either an associative memory or a search of some kind with
each memory reference. In contract, a simple table lookup suffices to
check the access rights in a conventional (capability list based) virtual
address translation system.
Another common example of the use of a capability list is in accessing
open files. Even when access control lists are used to control the right
to open a file, once the file is open, it is common to simply enter the
file description into a list of open files attached to a process. Files
in this list are referenced not by name, but by their position in the
list, and the entries in this list indicate not only what file is involved
but how it was opened -- that is, what access rights apply. Clearly,
the list of open files is a capability list.
Read( f, buffer, address )
| |
| address of a block in the file
|
index into the processes open file C-list
the objects referenced by capabilities are files
These two examples illustrate a problem with the way capabilities
are used in modern systems. Instead of a single capability list
per user, where that list describes all the resources the user is
allowed to access, we tend to give users a collection of lists,
one list for each type of resource. This is awkward.
Some systems have been built that solve this problem, notably
the IBM AS/400 (and System 38). Another commercial system in
this class is the Plessy System 250. Experimental systems that
have solved this problem include the Cambridge CAP system and the
Carnegie-Mellon Hydra and C.mmp systems. These will be discussed
later.
It is worth noting that, on Multics, when a file is opened, the
sectors of that file become visible as a segment in the virtual
address space of the process that requested the open operation.
Thus, for Multics, there is no distinction between the capability
list describing the segments a process may access and the capability
list describing the files the process may access.
Domains and Domain Switching
The domain of a process is the set of all objects that process
may manipulate. In a simple capability based system,
Domain = C-list
In the abstract, domain switching involves a change of the
current working C-list.
Domain switching is very important! It is very common, and systems
must control it.
For example, when a trap occurs, the system typically changes from
a user domain to the system kernel's domain, and on return from trap,
the system switches back to the user's domain.
The entire reason for introducing two states in order to build a
simple protected system is to create two domains, a system domain,
that includes all objects, and a user domain, that includes only
some objects.
One of the most complex domain switching mechanisms was introduced
on the Multics system. On this system, each process has a level
between 0 and 16. In addition, each segment has a level, and when
a process attempts a memory access, the hardware checks the process
level against the level of the accessed segment.
A Multics memory access is legal if the process has level higher than
(or the same as) the segment, and illegal if the process level is
below the segment level.
The Multics hierarchy of domains is a generalization of the two
domain model of the primitive systems discussed previously. Multics
allowed the operating system itself to be subdivided, so that only
the process manager operated at the highest level, level 0. The
virtual memory manager was at level 1, accounting was at a more
restricted level, things like line-printer spoolers were even more
restricted, leaving a variety of levels were available to users.
A user might run undebugged code at level 15, so that bugs in the
code wouldn't damage other parts of the user's program.
To further complicate the Multics model, some pages can be marked
as gateways. A procedure call to address zero in a gateway page
is the only legal memory reference from a process running at a
numerically higher level.
This causes the current level of the process to be
pushed on the stack along with the return address.
The called procedure then runs at the level of the
segment containing it, and the level is reset to the
original level when the procedure returns.
A Multics gateway is the generalization of a trap in a two-level
security system. In addition to allowing system calls to run at
the system level, it allows users to divide their code into more
secure and less secure segments.
For example, consider a system where Joe, one of the users, owns
a private database (say Chem Abstracts), and wishes to sell access
to that database on a per-use basis. This means that Joe can't
allow direct access to the database, he only wants other users to
access it through his access code (which produces an appropriate
billing record every time someone calls it).
Joe would simply arrange for his access code to run at a low
numbered level (also called an inner ring), and he would require
that users wishing to run his code do so from a higher numbered
level (or outer ring), crossing a gateway as they call his code.
The above kind of consideration was a natural outcome of thinking
about a computer utility -- the motivation behind Multics.
Note, however, that the strictly hierarchic scheme of Multics does
not solve some problems. For example, it fails to solve the problems
of two mutually suspicious users. Either one or the other must be
given greater access rights if code from both users must share
an address space.
Deadlock
Introduction
Deadlock occurs when two or more processes
are blocked, where each process is blocked
awaiting some action by another blocked
process.
A process might be blocked awaiting a message from another process,
or it might be blocked awaiting a resource currently held by another
process. If all the processes in a deadlocked group of processes
are blocked awaiting messages, the deadlock is described as a
communication deadlock; if all processes in a deadlocked group
are awaiting resources, it is a resource deadlock. Mixed cases
where some processes in the group are awaiting resources and others
are awaiting messages are quite possible.
In either case, whether resources or messages (or both) are involved,
the key element in any deadlock is that there is a cyclic wait.
In some older texts, the possibility of message passing is completely
ignored, and deadlocks are defined entirely in terms of resource
contention.
Resource Deadlocks
Deadlocks resulting from resource allocation have been widely studied
since the late 1960's. Three distinct issues have been raised in this
context:
- Deadlock Detection
- Deadlock Resolution
- Deadlock Avoidance
Once a deadlock is detected, it must be somehow resolved. Deadlock
resolution is frequently destructive -- only fault tolerant
applications can survive it. Therefore, if it is possible, deadlock
avoidance is better.
Deadlock Detection
Since a deadlock involves is a cycle of processes waiting for
something, the standard algorithms for deadlock detection convert
the problem into graph theoretic terms and then use various cycle
finding algorithms on the resulting graph.
The following notation is traditionally used in graph theoretic
formulations of the deadlock detection problem:
Allocated Process that
resource has the resource
_ _
|_|------------->(_)
Waiting Resource the
process process waits for
_ _
(_)------------->|_|
Waiting Process that
process should send message
_ _
(_)------------->(_)
The graph has vertices for each process (round) and resource (square).
The graph has edges representing the waits for relation. A process
can wait for a resource, a resource can wait to be released by a
process, and (in message passing systems) a process can wait for a
message from some other process. The latter is a modern extension
to this old notation.
Here is an example "wait for graph" describing 4 processes, A, B, C
and D, and 4 resources, Y, Y, Z and W. A is waiting for X, but B has
both X and Z. C has W and is waiting for Y, and D has Y and is waiting
for W. Does a deadlock exist?
Processes Resources
_ _
A(_)--------->|_|X
__/
/
_ / _
B(_)<=== ->|_|Y
\/ /
/\ /
_ ____/ \/ _
C(_)<- /\_|_|Z
\ /
\/
_ /\____ _
D(_)<-- -->|_|W
\_____/
There is a deadlock involving processes C and D and resources W and Y.
Detecting a cycle in a graph involves traversing the cycle! Thus,
the cost of deadlock detection is related to the number of vertices
in the cycle, which could be as large as the number of vertices in
the graph. As a result, deadlock detection algorithms scale poorly
growing in expense as the system grows in size.
As long as a the only wait operations involve processes waiting on
resources, the deadlock graph is formally a bipartite graph, that
is, a graph where there are two sets of vertices (round and square,
here), and where all edges either connect a vertex in one set with
a vertex in the other. If processes may wait for messages, the
graph is not bipartite, but this has no effect on the relevant
algorithms.
A Review of Elementary Graph Algorithms
Basic graph algorithms are a standard part of undergraduate
courses on data structures!
A graph consists of a set of vertices (processes and resources, in our
example), and a set of edges (the arrows connecting processes and resources
in our example). In general, edges need not have direction associated
with them, but in our example application, we are only interested in directed
graphs, where edges have direction.
Many of the basic questions we will need to resolve about graphs rest
on the same basic meta-algorithm for graph traversal:
Initially, all vertices and edges are unmarked
R is a some vertex, to be used as the root
Mark R
S = { R }
Repeat
Remove a vertex V from S
For each V' directly reachable from V,
If V' is not marked
mark V'
put V' in S
mark the edge <V,V'>
endif
endfor
Until S is empty
At any point during the execution of this meta-algorithm, the set S
contains the vertices "on the frontier" of the graph being explored.
Marked vertices that are no-longer in S have been fully explored, while
unmarked vertices not yet in S remain to be visited.
The nature of the markings put on vertices and edges leads to variations
in the results obtained from this meta algorithm. The order in which
vertices are removed from S is the other primary variable.
If S is managed on a LIFO basis, this algorithm performs a depth-first
traversal of the graph, following the longest possible path from the root
before returning to visit other parts of the graph. If S is managed on
a FIFO basis, this algorithm performs a breadth-first traversal of the
graph, visiting all vertices reachable from vertices already in S before
visiting others beyond them. It is also possible to manage S as a
priority queue, using, for example, the value of the marking on each
vertex as a priority.
The set of marked edges produced by this algorithm is always a tree that
spans every vertex in that part of the graph reachable from the root
vertex R. That is, it is a spanning tree of the reachable part of
the graph. Not all parts of all graphs are reachable from any vertex,
so this is not always a spanning tree of the whole graph. Note that,
depending on the traversal scheme used (that is, depending on the
management of the set S) many different spanning trees may be produced
by the same meta-algorithm (unless, of course, the graph is itself tree
structured, in which case there is never more than one possible spanning tree,
and that is the graph itself).
NOTE: Other variants on the above metaalgorithm will show up
repeatedly in the remainder of this course! The subject of
spanning trees will also show up again and again!
Given a directed graph rooted at vertex R, it is not hard
to modify the depth-first recursive tree traversal algorithm to traverse the
graph and search for cycles. The modified algorithm only marks those edges
and vertices on the path from the root to the leaf currently being examined
-- thus, as it follows a path through the tree, it marks edges and vertices,
and as it backtracks in its traversal, it unmarks edges and vertices. If
this modified algorithm ever encounters a marked vertex as it walks deeper
into the graph (as opposed to encountering such a vertex on backtracking),
it has found a cycle reachable from the root R.
Deadlock Graphs
It is worth noting that the details of the system being studied determines
the nature of the graph algorithm needed for deadlock detection!
In a resource model, where processes
wait for resources, if no process may wait for more than one resource at a
time, the number of outgoing edges from any vertex will never excede one, so
the deadlock graph has a trivial structure -- it consists of a number of
chains or loops. This special case is called the single resource model.
Running
process
_
(_) Running
process
Waiting that owns
process Resource resource
_ _ _
(_)-------->|_|-------->(_)
Waiting Running
process process
Waiting that owns Another everyone
process Resource resource resource waits for
_ _ _ _ _
(_)-------->|_|-------->(_)-------->|_|-------->(_)
In this simple case, deadlock detection doesn't involve a general graph
algorithm. The constraint that no process waits for more than one resource
at a time translates to the constraint that no process vertex in the graph
has more than one outgoing edge. Similarly, the constraint that no resource
is owned by more than one process translates to the constraint that no
resource vertex has more than one outgoing edge. As a result, we can
detect deadlocks by following the single chain of outgoing edges from a
blocked process, marking vertices as we go; if the chain has an open end,
the process is not deadlocked; if we encounter a marked vertex, there is
a deadlock. In any case, we always finish by unmarking all the vertices
we marked.
Note that the unconstrained resource model allows a process to
wait for a number of resources, for example, using the operation
block P until all resources in set S can be claimed. If the set
S contains resources X and Y, we can replace this general wait operation
by block P until X can be claimed followed by block P until Y can
be claimed. This allows many more general deadlock problems to be
translated into single-resource deadlock problems where deadlocks can be
detected using this trivial chain-following deadlock detection algorithm.
Unfortunately, in some systems of processes, the straightforward translation
from the more general the primitive to the single-resource primitive will
sometimes convert a system that is deadlock free into a system that is prone
to deadlock. The reason for this is that the
block P until all resources in set S can be claimed primitive
claims none of the resources in S until it can claim all of them, while
the sequential claim one at a time approach will have some states
in which a process holds a claim on some subset of the resources in S.
If two processes are competing for the same set of resources and they claim
them in different orders, there will be a deadlock under the single resource
model that would not have been present under the more general model.
The most common application of the single resource model in real systems is
in database systems, where resources correspond to database records that must
be claimed before some update can be performed. Some database systems use
exactly this trivial chain following algorithm to detect deadlocks.
Deadlock Resolution
Once a deadlock is detected, an important question must be answered:
what should be done about it? The classic solution for resource models
is to break the cycle of the deadlock by aborting one of the processes and
freeing its resources. This is only applicable to systems
using a resource model, and of course, it is only applicable in systems
where we can afford to abort processes!
What process should be aborted?
Ideally, the process to be aborted should be the one where that act
will cause the least damage, that is, the least critical process.
Few systems provide information sufficient to determine how critical
a process is, however. The common alternative of aborting the
process with the lowest priority approximates this, since high
priority processes are frequently more critical than low priority
processes, but this isn't necessarily so. Priority depends on real-time
constraints! The audio entertainment process must meet tight real time
constraints but it isn't very critical compared to monitoring the level of
fuel in the fuel tank, something that is only done occasionally but is very
important.
An alternative is to delete the process that closed the cycle, the last
process to block in the group of processes involved in the deadlock.
Unfortunately, just because a process was the one that completed the cycle
does not mean that it is the least critical or the most appropriate victim.
If a process can be coded to anticipate the possibility of allocation
failures, for example, by having it deallocate resources it has already
claimed and begin again, we can have the resource claiming operation return
a failure code if that operation would have closed the cycle in the resource
graph, leading to deadlock. This approach is taken in several database
managers.
The classic solution of aborting a process is clearly not oriented towards
communication deadlocks! Aborting a process that is waiting for a message
does nothing to eliminate the deadlock, since the only way to resolve the
deadlock may be for that same process to send a message.
Aborting a process is also very undesirable in critical applications,
where all processes are expected to run forever or until the power fails.
Deadlock Avoidance
If deadlock recovery is difficult, we are far better off if we can construct
our applications in such a way that we can prove that they are deadlock free.
This may be done statically, or it may be done, in some more complex settings,
by dynamic algorithms.
There is a trivial solution to the problem of deadlock avoidance in a pure
resource allocation context:
All you need to do is prohibit processes from incrementally allocating
resources.
This trivial solution has been widely used, particularly back in
the days of batch operating systems taking input from a stream
of punched card. These systems typically required
each process to declare all resources it might want as a field
of the job card -- the head card on the card deck. This is
clumsy, and it requires that the process hold all resources it
might need for the duration of its run even
if it only needs some of them briefly.
Claiming Resources in a Strict Order Prevents Deadlock
A less trivial solution to the deadlock avoidance problem is to require that
all resources be acquired in a strictly predetermined order. For example,
any process that requires both the modem and the laser printer must acquire
the modem first and the printer second. As long as this rule is enforced, it
will be possible to guarantee that the system is deadlock free with regard
to these two resources because it will be impossible for one process to
claim the modem and then request the printer while the other claims the
printer and then requests the modem.
It is always possible to impose a total ordering over all resources.
Any ordering will suffice, so long as all processes adhere to it. For
example, resources may be ordered in alphabetical order by resource class,
and then in numerical order, within each resource class, by resource address.
Many applications can be rewritten to comply with such an ordering
constraint, but many others cannot.
Atomic transactions
Introduction
The usual assumption surrounding such assignments as
A := 0; A := -1
is that assignment is atomic, that is, it either happens or it
doesn't. If the above two statements are executed in
parallel, or if one statement is interrupted in order to complete
the other, a programmer will typically expect only two possible
outcomes. These are expressed by the following postcondition:
(A = 0) or (A = -1)
What if the variable A is 64 bytes long on a machine where all loads and
stores are in terms of 32 bit quantities? Or, on a small microcontroller,
what if the variable A is 16 bits long, on a machine where all load and
store operations operated on 8 bit quantities. Suddenly, there
are a number of other possible outcomes! Each of the following outcomes
is now possible on the 8/16 bit microcontroller:
(A = 0) or (A = -1)
or (A = 255)
or (A = -256)
If the low half of the word A is A0 and the high half is A1, the following
sequence of assignments will give the value 255:
Process 1 Process 2
A0 := 0;
A0 := 255;
A1 := 255;
A1 := 0;
The usual solution to this is to use some kind of mutual exclusion mechanism,
so that the assignments to A0 and A1 are done in a critical section, but
this cannot prevent failures. Consider the following:
Process 1 Process 2
A0 := 0;
A1 := 0;
A0 := 255;
< A1 := 255; > -- fails
Such a failure, affecting one process but not the other, is easily imagined
on a multiprocessor, but even on a uniprocessor, if two different users are
running code that shares a memory segment, and then one user hits control-C
to kill a process, this kind of consequence is possible.
The consequence of this partial failure is that the illegal value 255
still appears in the shared variable, so that, even if the mutual exclusion
mechanism can recover from the failure of one process inside the critical
section, and grant entry to another, the shared data is corrupted.
This problem shows up whenever the value of a shared variable can
only be updated one piece at a time. For example, if A is a logical
record on disk that consists of multiple physical records.
The problem may occur not only on parallel computers, but also
on purely sequential machines when there is asynchronous
context switching.
Analogous problems can occur in the absence of context switching, if a
sequence of assignments is interrupted by a failure. Consider:
A0 := 0;
A1 := 0;
... -- some time later
A0 := 255;
< A1 := 255; > -- fails
Here, if the data being updated is on disk or in "persistant RAM" such as
RAM with battery backup or old-fashioned core memory, the failure can
leave an illegal value in memory, a value that was only partially updated.
Pointer Assignment
One way to assure that a shared variable is updated on an all or none basis
is to perform the updates "off line". This requires a way to atomically
move the updated copy into public view. Consider the following scheme:
The publically visible copy of a composite shared variable is pointed to
by a pointer; a process wishing to update the composite variable updates
a copy, and then atomically updates the pointer. We assume that updates
to pointers are atomic -- that is, that pointers occupy exactly one hardware
word, where a word is defined as the unit of atomic update in memory.
P a pointer to the shared value
Inspect(V):
V := P^; -- return the value pointed to by P
Update(V):
new( temp );
temp^ := V;
P := temp; -- make P point to the desired value
The above code uses a newly allocated cell from the heap for each update
to the atomicly updated variable. This generality is unnecessary.
If N processes compete for access to the shared variable, it must be possible
for each process to independently update its own private copy of the
variable; thus, the atomically updatable pointer must have at least
N possible values, corresponding to the private variables of each process.
Therefore, if the machine allows atomic updates of n bit quantities in
memory, this scheme allows up to 2n processes to compete.
In fact, we do not need a completely general heap! If each process has
two cells able to hold a single value each, and if each process uses these
cells in turn, updating the least recently updated cell each time an update
is required, and finishing the update with a pointer assignment, this scheme
will work. Therefore, pointer assignment schemes can support atomic update
on a 4-bit machine with as many as 8 processes, or on an 8-bit machine with
as many as 128 processes.
Stable Storage
How do you make an assignments to a multi-byte object look atomic without
using pointer assignment and without using memory proportional to the number
of processes?
This is easy if there is no problem with failure, but it is harder when
failures are possible!
Leslie Lamport developed algorithms for updating a shared variable
in the face of failures. This assumes that a mutual exclusion algorithm
guarantees that only one process tries to update the variable at a time,
and in this context, it guarantees that the result of the update will
be correct, in the face of failure.
The basic operations offered by Lamport's
stable storage algorithms are:
inspect( V ) returns value (or V.inspect() )
update( V, value ) ( V.update( value ) )
Lamport's stable storage rests on a new and redundant
representation for the stored value and it rests on two
procedures (or protocols) for inspecting and updating the
stored value.
Conceptually, it is reasonable to use a client-server world
view and imagine the inspect and update procedures as
being the services offered by the server to clients. If the
server fails, we can easily start a new server, as long as the
variable itself is stored in such a way that it survives the
failure.
A stable variable V is represented as a record where every field
is duplicated:
Copy1 -- the value stored
Time1 -- the time it was stored
Sum1 -- a checksum over the value and time
Copy2
Time2
Sum2
There are two copies of the value, and for each copy, there is a record
of the time at which the value was last updated and a record of the checksum
computed as of the last update.
The fault tolerance of Lamport's scheme improves if the two
(or more) copies of the <value, time, checksum> tuple are
stored in such a way that failures only destroy one copy at
a time. Thus, they should be in different memory modules, or
on different disk drives. If the system is physically
distributed, they should be geographically separated and
connected to different computer systems.
The update and inspect operations must be performed inside critical
sections, and if failure is to be survived, these critical sections
must use some algorithm that can detect failure of a process inside
a critical section and release any associated mutual exclusion semaphores.
Procedure Update( V, value )
begin
update time
V.Copy1 := value
V.Time1 := time
V.Sum1 := Checksum( value, time )
-- wait for the assignment to really finish
V.Copy2 := value
V.Time2 := time
V.Sum2 := Checksum( value, time )
-- wait for the assignments to really finish
end
The utility of this code relies on keeping two (or more)
copies of the value, where no failure will corrupt more than
one copy at a time. The wait for one update to finish before
starting another update is very important, in this regard.
The assignment statements shown above will not, typically,
execute instantaneously. On a cache machine, for example, there
may be a delay before the values are written back to main memory.
If disks are involved, there may be a considerable time between
the issuing of a write request and the actual write to disk.
If disk cache software is involved, it is even possible that
the write to disk will never happen.
This illustrates something that was pointed out earlier, in the
context of the discussion of disk caches. In general, fancy
cacheing algorithms can improve performance, but they can get in
the way when fault tolerance is the goal! We must have a way
to force cached values out into the real memory before we can
rely on this stable storage algorithm. In systems derived from
UNIX, the kernel primitive that does this is fsync().
Procedure Inspect( V )
begin
-- optionally start by reading all fields from disk
if V.Sum1 = checksum( V.Copy1, V.Time1 )
if V.Sum2 = checksum( V.Copy2, V.Time2 )
if V.Time1 > V.Time2
value = V.Copy1
else
value = V.Copy2
endif
else
value = V.Copy1
endif
else
if V.Sum2 = checksum( V.Copy2, V.Time2 )
value = V.Copy2
else
-- failure --
endif
endif
return value
end
This code is fairly simple -- there are four cases to
consider:
1) There were no errors
In this case, the checksums on both copies will be valid,
both will be the same, and they will have the same
timestamp.
In this case, either copy may be returned.
2) There was a failure such that update managed to write one
updated copy of the data, but it didn't get to start
writing the other updated copy.
In this case, the checksums on both copies will be valid,
but their timestamps will differ. Return the copy with
the most recent timestamp.
3) There was a failure during one of the updates, or there
was a failure that corrupted one of the stored values between
updates.
In this case, the checksum on the copy in question will
be invalid. The other copy should still be OK and can be
returned.
If the failure occurs during the write of the second copy,
it will be as if the write was successful because the first
copy will have already been written and will be available
for use.
If the failure occurs during the write of the first copy,
it will be as if the write never occurred.
4) A failure or failures destroy all copies. There is nothing that
can be done about this, but it takes multiple failures to cause
this kind of damage, and the probability of this multiple failure
can be made arbitrarily low by storing more copies in more widely
distributed locations.
Note that the stable storage update procedure may be made to be
even more reliable if it does a write-read-verify cycle on each
copy it writes out, that is, it writes each to memory and then reads
it back and verifies correct storage before continuing.
Also, note that no checksum algorithm can detect all possible errors in
the stored data. Checksums can be made arbitrarily error resistant,
but but they cannot be made perfect. For any checksum algorithm, there
is some combination of errors that will defeat it!
Transactions
Atomic assignment and inspection are not enough to guarantee that
a shared variable is used correctly through possible fault sequences! For
example, consider the problem of transferring money from one checking
account to another:
Jones pays Smith some Amount:
Jones := Jones - Amount
Smith := Smith + Amount
To avoid accidentally creating or destroying money, either both updates
must be made before an error occurs, or neither must be made!
In the general case, there may be three machines here, one belonging
to a bank where Smith has an account, one belonging to a bank
where Jones has an account, and a third machine that mediates the
transaction.
The term transaction comes from the financial world, as is
suggested by the above example. Transactions can be quite complex:
For example, if I deposit a check in a bank, the transaction involves
debting the account of the check writer, crediting my account, and
crediting the sum representing the current day's checks received.
Each transaction typically involves three sets of variables:
V -- the set of variables
I in V -- the subset of V inspected
U in V -- the subset of V updated
In the following, we will make the slightly stronger assumption:
U in I -- updated variables are all inspected
Typically, we assume that every variable is associated with a lock,
a binary mutual exclusion semaphore; this allows processes to claim
exclusive use of that variable. If a variable is to be updated but
does not need inspection, we lock it as if we needed to inspect it first.
In one sense, atomic transactions are trivial to implement. The
entire database needed for the transaction could be stored as a
single variable using Lamport's stable storage. A single mutual
exclusion semaphore suffices to guard access to this variable.
This trivial solution is unacceptable! Most interesting databases
cannot be claimed totally for each update. Instead, the process
performing an update must claim only part of the structure, locking
only those variables needed while allowing other processes to operate
on other parts of the structure at the same time.
Consider, for example, the Bank of America. This is a large bank
with branches all over California. It would be impossible to
operate such a bank if each transaction required giving the teller
in charge of that transaction exclusive use of the bank's entire
data base for the duration of the transaction.
The sets I (inspected variables) and U (updated variables) are not
known in advance.
Typically, some subset of I must be inspected in order
to determine the other variables in I and U.
This leads to the following two-phase model of transactions.
Phase 1
Claim locks and inspect variables
until all needed variables are locked.
All variables must be locked prior to inspection.
If multiple variables are locked, deadlock is possible. Deadlocks
can be resolved by aborting and restarting transactions because
they always occur in phase 1, before the transaction has modified
any shared variables.
Deadlocks can be avoided by claiming locks in a standard order, but
this is not always practical, in the general case. For example,
if each database record contains pointers to other records that may need to
be locked and inspected, you must claim the record containing the pointer
in order to follow the pointer, and any cycles in the pointer structure are
potential sources of deadlock.
Phase 2
Update variables and release locks!
The 2 phase model can be improved by using more interesting locks.
Instead of 2-state locks -- where the two states are free and in-use,
for example, 3-state locks can be used, with the following states:
free
read-only
in-use (read-write)
If a process only needs to read a variable, it may claim the variable
by setting the lock to read-only. This is legal if the previous
state of the lock was free or read-only. If a process may need to
write a variable, it must claim the lock as in-use, a claim that can
only be made if the lock was free. Thus, a variable may have multiple
readers. Locks that support this are said to solve the
readers-writers problem.
The 2 phase model guarantees that things will work, but it is overly
restrictive. There are specific problems where a greater degree of
concurrent access can be allowed by violation of the two-phase model.
For example, consider a shared lexicographic binary tree. The
two-phase model requires that a user of any node in this tree must
lock all nodes on the path from the root to the node in question
before releasing any of the locks.
In fact, the data structure will work just as well if the user only
holds a few locks at a time. Specifically, the user first claims the
lock on the root, and then claims the lock on the child of the root
that is on the path to the target node in the tree. As soon as it
is determined that the root is not the direct parent of the target,
the root can be released and the process can be continued recursively.
In the end, the process holds a lock on the target node and on the
parent of the target node.
Note: The above definitions of the two-phase model of transactions
do not guarantee atomicity. They only describe the mutual exclusion model
that atomic transaction systems provide for the user. The implementation of
atomicity in this model will be the subject of the next lecture!