This glossary is intended as a brief description of some of the acronyms and terms you may hear
during discussion of the Linux kernel. Some general use terms such as UTSL are included
for convenience. As Officially Clueless (TM) myself, some of these definitions are
probably technically incorrect or misleading, so caveat emptor. Even more of them are incomplete.
This glossary is heavily x86 biased, as I have no knowledge of other architectures. Where relevant,
the entries are based on v2.3/2.4.
Entries containing "FIXME" are either wrong, or incomplete, or both. Further apologies if some of this
is a bit patronising in tone; I'm not the world's greatest writer (alas).
Comments/additions/corrections are very much welcome. If you have trouble understanding something,
I would especially like an email (if you're having trouble, most probably others are as well).
For other computing terms, the Free On-Line Dictionary Of Computing
the Dictionary of Algorithms, Data Structures, and Problems
Memory Management Glossary
the Jargon File (http://www.tuxedo.org/~esr/jargon/ or
may be able to help.
A B C D E F
G H I J K L
M N O P Q R
S T U V W X
(Last modified 06/08/2001)
- 2Q algorithm
- MM algorithm based on two areas, one managed as a FIFO queue, and one as an LRU list.
- 8259 PIC
- Outdated interrupt controller present on Intel hardware.
- Application Binary Interface; the interface of passed structures
between the user processes (and libraries) and the kernel. For compatibility,
it is important that these remain as static as possible (i.e. making sure
that variables and structure members have the same bytesize as before, and in
the same ordering). Occasionally breakage is necessary, requiring re-compilation
of the user-space sources (note that this does not affect source-compatibility;
that is a separate issue).
- Advanced Configuration and Power Interface - replacement for APM that has the advantage of allowing
O/S control of power management facilities.
- Address Generation Interlocking, on x86. When execution of an instruction requires an address resulting from a non-completed
instrcution, the CPU must wait - this is known as an AGI stall.
- Accelerated Graphics Port, on x86 boxes.
- Generally, used for something which doesn't have the usual associated
object. For example an anonymous address space is not interested in user address
space (that is, no process context). Some common ones are :
- Anonymous page
- A page of memory that is not associated with a file on a file system. This can come from
expanding the process's data segment with brk(), shared memory segments, or mmap()
with a MAP_ANON or MAP_PRIVATE flag. MAP_PRIVATE, although it maps in
data from a file, is considered anonymous because any changes do not get written back to the file
(any dirty pages have to be moved to swap if the page is freed from main memory).
- Anonymous buffer
- The buffer cache contains buffers of data on their way to/from the disk. An anonymous buffer
is not associated with a file. One example of this is data from a deleted file - it will not be written
to any file, but is kept around until it is flushed.
- See local APIC and IO-APIC.
- Advanced Power Management, power management standard superceded by ACPI. APM and SMP just don't mix.
- This is an acronym for the Address Resolution Protocol and this
is how a network machine associates an IP Address with a
- Abstract Syntax Notation, a protocol for structured data, used, for example, in the Q.3 managament protocol.
- Professor Andrew S. Tanenbaum, writer of MINIX and several essential O/S books.
- ATA Packet Interface, used by most CD-ROMs, and other devices.
- Technique used in the VM code, referring to balancing various parameters such as the number of pages currently free, to avoid
thrashing and other bad memory capacity artefacts. See zones, kswapd bug.
- Base Address Registers, for PCI devices.
- Binary-Coded Decimal - see a textbook.
- See highmem.
- big lock
- kernel_lock, which locks the entire kernel from entry (no other task may run
in the kernel code). It is recursive per process
and dropped automatically when a process gives up the CPU, then regained on wake-up, in contrast
to other spinlocks.
- bit error
- Used colloquially to mean a single bit error in some memory address. Often due to faulty memory (ECC memory
can correct single bit errors). Often results in fake oopsen, with addresses like 0x0008000. Also
seen are values some small offset from zero, plus a bit error, which is where the value passed a NULL check due to
the bit error, and then the kernel tried to access a structure member by means of the pointer, leading to the offset.
- block bitmap
- In UNIX-like filesystems, the usage of disks blocks is recorded in the block bitmap, where each set bit indicates a specific allocated
- bottom-half handler
- A set of standard kernel threads that execute tasks
on a queue that have been registered with that type
of bottom-half handler for execution. The code is run on return to user space or at the
end of a hardware interrupt. In 2.3.43 a more general solution with softirqs
and tasklets was implemented.
Sometimes abbreviated to "bh", which should
not be confused with buffer head, which
is also abbreviated to "bh".
- bounce buffer
- An intermediate buffer. Used for example, in "faking" alignment to a client from non-aligned resources.
- Big-reader locks, used when there are many contending for read access to a resource, and very few contending
for writes (thus the balance is towards very fast read locking, and very slow write locking).
- BootStrap Processor, or the CPU which enables the other CPUs in an SMP system.
- Block storage segment. This is the memory mapping section containing the data allocated for a binary image at execution time.
Also known as "Block Started by Symbol" and "Bull-Shit Storage".
- Branch Target Buffer - on x86 processors, the cache of recent conditional jump results.
- buddy allocator
- The memory allocation scheme used in the kernel. A vector of lists of free pages is kept,
ordered by the size of the chunk (in powers of two). When a chunk is allocated, it is removed
from the relevant list. When a chunk is freed back to the free pages pool, it is placed
in the relevant list, starting from the top. If it is physically contiguous with a present
chunk, they are merged and placed in the list above (i.e. where the chunks are twice the size),
and this operation percolates up the vector. As regions are merged whenever possible, this
design helps to reduce memory fragmentation. FIXME
- buffer cache
- The buffer cache is a hash table
of buffers, indexed by device and block number. LRU lists are maintained for the
buffers in the various states, with separate lists for buffers of different sizes.
With 2.3's unification of the buffer and page caches, each buffer head points to part or all of a page structure, through which the buffer's actual contents are available. FIXME
- buffer head
- A structure containing information on I/O for some page in real memory. A buffer
can be locked during I/O, or in several other states depending
on its usage or whether it is free. Each buffer is associated with one page, but every
page may have several buffers (consider the floppy on x86, where the I/O blocksize is
512 bytes, but each page is commonly 4096 bytes).
- Used in kernel code in tests for "impossible" conditions. Signify a kernel bug
or faulty hardware.
- bus mastering
- Giving a card on a bus (e.g. ISA,PCI) the ability to read/write directly to main memory. This is how DMA is performed
on PCI busses.
- byte sex
- cache affinity
- Where the cache of a CPU represents the current memory set used by a task,
there is said to be cache affinity with that task. A good thing if the task is regularly
scheduled on that CPU. See processor affinity.
- cache coherency
- On an SMP system, ensuring that the local memory cache of each CPU is consistent
with respect to the values which may be stored in other CPUs' caches, avoiding coherency problems
such as the "lost update". This is acheived by the hardware in concert with the operating system.
- cache line
- A section of the hardware cache, around 32 bytes large. Kernel structures are often
designed such that the commonly-accessed members all fit into one cache-line, which
reduces cache pollution. Structures such as this are cache line aligned.
- cache ping-pong
A hardware phenomenon in an SMP system, where two tasks on different CPUs are both
accessing the same physical memory in a cache line. This means as each task runs,
when it changes the memory, it must invalidate the other CPU's relevant cache line (to
ensure cache coherency). Then, when the task on the other CPU
runs, it must reload the cache line (as it's set invalid), before changing it. Repeat
ad jocularum. A bad thing (TM). A common reason for putting a lock on a different cache line
than the data mutexed by the lock : then the "other" task can grab and drop the lock without
having to necessarily invalidate the cache line on the first CPU. FIXME
- cache pollution
- Where during execution of a task, another task is scheduled onto that CPU which disrupts
useful lines of the current cache contents, which will be used soon. That is, cache pollution
is a non-optimal situation where the other process would have been bettered scheduled on a different
CPU or at a different time. The aim is to minimise the need to replace cache lines, obviously
- call gate
- x86 hardware support for mode switch to kernel (i.e. system call). In Linux, int 0x80
will trigger the call gate.
- These are defined names of capabilities for specific tasks provided by the kernel, e.g. CAP_SYS_NICE.
- chroot jail
- A process under the aegis of a chroot() syscall is in a chroot jail, and cannot access the file system
above its notion of root directory /.
- x86 assembler instructions for disabling and enabling interrupts, respectively. There are CPU-local
and global variants of these. Code running with interrupts disabled must be fast, for obvious reasons (this is
called interrupt latency).
- Eric Raymond's proposal for a replacement to the current kernel build system. See http://www.tuxedo.org/~esr/kbuild.
- cold cache
- A cache whose content is invalid or irrelevant with respect to some task to be run.
- completion ports
- I/O interface used in O/S's such as Windows NT. Userspace notifies the kernel of each file descriptor the
program is interested. The O/S uses a callback for each fd to indicate that I/O is ready.
- Where two tasks each want an exclusive resource. You may hear talk of, for example, spinlock
contention, which is where one or more tasks is commonly busy-waiting for a spinlock to become
unlocked, as it is being taken by other tasks.
- context switch
- Refers to the changes necessary in the CPU when the scheduler schedules a different process to run on the CPU.
This involves invalidating the TLB, loading the registers with
the saved values, etc.
There is an associated cost with such a switch, so it
is best to avoid un-necessary context switch when possible. Note that the division of kernel-mode and user-mode
means a similar, but simpler, operation is necessary when a syscall moves into kernel mode. However this is not
called a context switch, as the mode switch doesn't change the current process. See lazy TLB.
One good of feature of Linux is its extremely low context and mode switch cost, compared to an operating system
- Copy-On-Write, efficiency method where a page or other resource is shared until an attempt to
write is made. In that case a copy is made, and the write is done to the copy.
- Current Privilege Level
- critical path
- A vital code path which should be optimised for the common case. Critical paths are executed frequently
and form the important trunk routes of various kernel operations. An example would be buffer head manipulation
during file I/O.
- Code storage segment, aka text section. This is the memory mapping containing the executable code (text) for a binary image.
- Directed Acyclic Graph
- dancing makefiles
- An experimental new Makefile set up for configuring and compiling the kernel, written by Michael Elizabeth Chastain.
- The cache of dentry structures. Under UNIX an entry
in a particular directory must be searched for linearly, so even if the disk block containing the directory entry list
is in-core, there is an associated cost. The dcache stores recent results of these searches which in general speeds
up these disk searches by a large factor. Recent 2.3 work uses the dentries to allow multiple mounting, union mount, and more.
The hardware data cache is usually referred to as the D-cache.
Any of a number of situations where two or more processes cannot proceed because they are both waiting for the other to release some resource.
FIXME(give good references).
- delayed write
- See write behind.
- demand zero
- In demand paging, where the page is to be zeroed when actually created (common case: bss segment of an executable image,
which is uninitialised heap data for the executable). Also called ZFOD.
- Directory entry, in-core structure defining a file's details: inode, parent dentry etc. Cached in a hash table indexed
by hashed filename (see dcache).
- IP packet bit indicating it should not be fragmented. The remote host will return ICMP notifications if the packet
had to be split anyway, and these are used in MTU discovery.
- directory notification
- Provides hooks for notifying tasks when the contents of a directory has changed. Note "contents" can refer to dentries, the file inodes,
or even the file contents themselves (file notification).
- Dial-On-Demand for net connections over POTS.
- drop behind
- In stream I/O conditions, data that has already been read and processed is not needed again. The VM ideally should recognise this and mark the used pages as un-needed, so they can be discarded first. This technique is called "drop behind".
- Data storage segment, aka data section. This is the memory mapping containing the initialised data for a binary image.
- Processors such as the Pentium Pro, that can decode and execute two instructions simultaneously.
- Abbrev. fr. duplication.
- Double word, i.e. 4 bytes on x86.
- See extended attributes.
- eager coalescing
- What the buddy allocator currently does, i.e. merge adjacent blocks as soon as possible.
- edge-triggered interrupt
- The interrupt is triggered by the rising or falling edge of the interrupt line. This
makes IRQ line sharing difficult, as an edge may occur whilst an ISR
is running, and it could be easily missed; to allow sharing level-triggered
interrupts are usually used.
- Extended Instruction Pointer. This register contains the PC value of a task, that is, it points to the
next instruction to be fetched, decoded etc.
- elevator algorithm
- This algorithm, often used in disk accesses, keeps an ordered list of requests. When the current request on the disk (e.g. the disk
block) has been satisfied, the next strictly greater request on the list is dealt with. When a new request arrives, it is inserted
into the ordered list in position (e.g. if the new requested block number is less than the current handled request, it goes
before it in the list). When reaching the end of the list, the elevator changes direction, and the situation is reversed.
- Explicitly-Parallel Instruction set Computing, an instruction set architecture where every dependency for an instruction
is encoded into the instruction itself. This has the potential to be faster as the compiler can encode the data dependencies in
- exponential back-off
- A general algorithm for dealing with contention cases; for example, collisions on a network bus, or contention for a spinlock.
- extended attributes
- Also known as multi-part or multi-stream files, files with extended attributes deviate from the principle of files being a simple single
data stream. An example of extended attributes is the Macintosh's "resource fork", which is associated with a specific file (known as the "data fork").
- fair scheduler
- A scheduler which ensures fairness between users, such that a user's process count and associated cost only impacts
that user, rather than the whole system as currently. Rik van Riel and Borislav Deianov have both produced different patches to implement this.
- false sharing
- On SMP caches, when two parts of single block are accessed, neither of which collide with the other, the cache coherency protocol may not
be able to detect this, and mark the block as "shared" even when it isn't. This is known as false sharing.
- The code path most commonly taken, often optimised heavily at the expense of less frequently-taken blocks of code. This is the
reason you see so many gotos in core functions - it produces common-path code far more efficient than an optimising
compiler can manage.
- file descriptor
- The mapping of a file's contents into memory.
- fixed mmap
- A user-space request for a mmap starting at a fixed virtual address. Generally not useful or guaranteed to work; a notable exception is overlayed mmaps, where a mmaped area has further mmaps of different types at fixed positions in the map.
- Fully-Qualified Domain Name, e.g. martyr.darrenemerson.co.uk.
- For AGP setups, Graphics Aperture Relocation Table.
- GNOME's source documentation system (similar to javadoc). Available by CVS from gnome. Kernel
driver interface descriptions, built from source using gdoc, are currently being written in 2.3.
- Global Descriptor Table. Something to do with x86 memory segmentation I think (FIXME). See ldt.
- In the kernel, often means "get a reference to". This may be as simple as incrementing a usage count,
or it may imply attempting to retrieve an object from a cache of some sort, or allocating kernel memory. See put.
- Generalised Kernel Hook Infrastructure, an IBM patch to implement hooks into the kernel code.
- I just had to point out that lkml is for Linux kernel development discussions. Please please don't engage in any threads concerning licensing
issues, Microsoft, or Richard Stallman. Please.
- group descriptor
- On-disk filesystem structure, containing information for a block group, such as the inode bitmap and
- Human Interface Device (for USB).
- On the x86 platform, memory addresses accessed using Intel's Physical Address Extension, which
allows real memory of up to 64Gb on IA32. Only certain things can be stored in this space (FIXME: only userspace ? or what ? no bh or irqhandler
can use it ...)
- Possibly byte-swapping conversion from host-endian to network-endian format. The network is big-endian.
- A standard for executable image interfacing (syscalls etc.). This is orthogonal to binary image file format standards
such as ELF.
- On lkml, this almost always refers to the inode cache, rather than the hardware instruction cache (I-cache).
- Internet Engineering Task Force, a standards organisation for protocols used on the internet.
- Integrated Kernel Debugger. A patched version of the kernel containing additional facilities for debugging the kernel.
- Instruction-Level Parallelism, i.e. executing more than one instruction at once. See dual-issue.
- What you might expect: code with knowledge of used code's internals beyond the API. This breaks modularity,
increases complexity and causes bugs, and a design goal is to minimise this behaviour.
- As per UNIX, the structure defining a file (except for the filename which is stored in the directory
entry). The inode cache is a closed hash table indexed by superblock and inode number, and is only hit
when the dentry lookup failed.
- inode bitmap
- In UNIX-like filesystems, the usage of on-disk inodes is recorded in the inode bitmap, where each set bit indicates a specific allocated
- Abbrev. fr. "instruction".
- interrupt context
- This is code executing in the kernel in response to a generated
interrupt, without a process context.
For more time-consuming work, a bottom-half handler task
can be added to be executed at some later time.
- interrupt gate
x86 hardware support for hardware interrupts. Interrupt gate facilities are provided to allow a known entry path from
untrusted privilege levels to the interrupt routines.
- Receives interrupts external to a CPU (e.g. from a device) and routes them through the local APIC to the CPU.
- I/O privilege level system call and bit in x86 EFLAGS register.
- Inter-processor Interrupt. On SMP machines this refers to an interrupt sent between
CPUs, indicating some event that the other CPU needs to be aware of, for example smp_invalidate_interrupt,
which invalidates a CPU's TLB.
- On-disk format often used for CDs. See Joliet.
- Interrupt Service Routine, or interrupt handler. Also, on x86 APICs, In-Service Register, confusingly enough.
- Basic packet of kernel time, around 10ms on x86. Related to HZ, the basic resolution of the operating system.
The timer interrupt is raised each 10ms, which then performs some h/w timer related stuff, and marks a couple
of bh's ready to run if applicable.
- Microsoft extension to ISO9660, commonly used.
- A filesystem that writes all changes to some log file, to preserve filesystem consistency (for example,
an inode's link count should equal the number of directory entries pointing to that inode). There are two types;
FIXME log only metadata such as link count, whereas FIXME logs both meta-data and data, effectively writing
all data twice. ext3, reiserfs and XFS are all meta-data loggers, the second type are rare as the same effect
can be acheived using RAID.
- This allows user-space memory access from other contexts, such as in a bh. This
also allows kernel drivers to set up DMA from a device directly to user-space pages.
- Kernel memory allocation routine. See vmalloc. kmalloc() ensures physical address contiguity.
- kswapd bug
- Problems with VM balancing in late 2.3 kernels. People in a variety of situations found that the VM subsystem
was far from optimal, leading to the kernel swapping daemon kswapd taking an inordinate amount of CPU resources; thrashing
was also a common problem. Work is undergoing to fix these scenarios.
- See L2.
- The L2 (level 2) cache is a memory cache in the order of half a megabyte wide, located on the motherboard. The much faster
L1 cache is in-CPU, but correspondingly smaller. x86 has a two-level cache, whilst some architectures such as Alpha AXP have an additional L3 cache.
Often referred to is the memory hierarchy, which goes something like (with increasing
latency and capacity, decreasing cost) registers, L1 cache, L2 cache, L3 cache, main memory, local mass storage device, network.
- lazy TLB
- Some tasks, such as kernel threads like kswapd, don't have a process context,
which renders the costly invalidation of the TLB un-necessary (as the user-level page tables
are not used). The task structure member mm indicates the user-level address space,
which will be NULL for kernel threads. In this case active_mm indicates
the used address space (kernel). These address spaces are known as anonymous.
Note also that certain tasks may occasionally temporarily drop the user-level address space
and have an anonymous address space for a short while.
- Local Descriptor Table. The ldt is a per-process memory management table used on x86. See gdt.
- General term used in unreliably-connected components. A resource is given a lease which eventually expires. This means the connection
can break, without leaving state hanging. NFS, being stateless, does not have the concept of leases.
- level-triggered interrupt
- Also level-sensitive. As used on PCI. PCI's ability to have multiple devices sharing a single interrupt
line is made possible by its use of level-triggered interrupts. If interrupts were edge-triggered,
it is possible that some of the incoming interrupts might be lost, for example if a second interrupt
were to come in while the processor was still in the midst of processing a previous one,
the processor would return from servicing the first interrupt but not be aware
that a second interrupt had been asserted. With level-triggered interrupts,
by contrast, the interrupt line remains active until specifically deactivated by an ISR.
This does mean, however, that if a buggy routine forgets to reset the interrupt line after
agreeing to deal with the interrupt, then the machine may hang in a loop of constant interrupts.
- Large File Support, specifically files larger than 2Gb on 32-bit systems.
- Least Frequently Used, another MM technique. The tension between recency of access and frequency of access of a an area of
memory is one of the crucial parts of any memory management system.
- Remember that the kernel build makes no use of libc !
- Loop Initialization Primitive, related to target IDs on SCSI busses.
- When tasks are fighting for an exclusive resource, but are also ready to run. For example both tasks may be in a loop checking for a resource's availability.
- local APIC
- On-chip interrupt controller provided on P6 and above Intel CPUs. Linux uses the timer interrupt register if a
local APIC is available to provide its timer interrupt (is this true ?). The local APIC is part of a replacement for the old-style
8259 PIC, and receives external interrupts through an IO-APIC if there is one.
- loopback device
- The fake net device representing the local host net interface, lo. This can also refer to file loopback mounts.
- loopback mount
- There are two types of mount called loopback mounting. First is the file loopback mount, where a single file on a filesystem
itself containing a filesystem is mounted (useful in constructing ISO9660 images for one).
The other type, which is possible in 2.4, is "true" loopback mount, where you can do something like mount -t bind /safebin /chrootjail/bin.
- As per the CS literature, Least Recently Used. A selection algorithm where a list is kept,
ordered by last usage of each element. Often used in page replacement algorithms. Also Most Recently Used and LFU.
- LUN blacklist
- This is a blacklist of SCSI devices that do not properly handle probes with a Logical UNit number other than zero.
Some mischevious devices respond to all LUNs, whilst others can hang with a LUN more than zero.
LUNs are used in CD changers amongst other devices.
- Logical Volume Manager. This allows several physical partitions to be represented as a single block device, amongst other things.
- major fault
- A major page fault occurs when an attempt to access a page not currently present in physical memory was made. The page must
be swapped in to physical memory by the fault fixup code.
- Memory barrier; ensures on SMP systems that each CPU's view of memory is the same, i.e. enforces order. rmb()
ensures reads are done in order, and wmb() ensures writes are done in order. Placing a barrier after
some code basically makes sure that all previous memory reads or writes are done (as CPUs can re-order usually
to increase performance). (FIXME: this needs to be more clear).
- medium-often race
- A race condition scenario. FIXME
- memory rusting
- Name for problems with mm code in v2.1, where pages got rusted and stuck in memory when they
should really have been swapped out. Was a problem on small-memory machines. Now fixed (right ?)
- The name for a superset of cache coherency protocols. There are five possible states for a
cacheline to be in :
|Modified||The data in the cacheline has been changed.|
|Owned||The cache owner has exclusive rights to the contents of this data.|
|Exclusive||The cacheline is not present on any other caches.|
|Shared||The cacheline is shared with another cache, but does not need writing back to memory.|
|Invalid||The cacheline is invalid and does not contain usefule data.|
This protocol is described in more detail in the Schimmel book.
- Information about a file, such as permissions, in contrast to the actual data in the file.
- minor fault
- A minor page fault occurs when an attempt to access a page present in physical memory, but without the correct permissions.
An example is the first write to a second reference to a shared page, when the kernel must perform the copy-on-write and allow
the task to update the copied page.
- Memory-mapped I/O, that is I/O memory (memory on hardware devices) accessible through a memory mapping. Also known as an I/O region.
- Technical term, meaning very roughly "new data doesn't affect existing data". Used to mean a wide range of
different things. For example, a monotonic clock is one which never gives a time prior to the time given previously
(that is, sequential reads of the clock always increase).
- Model-Specific Register. For the x86 platform, registers that are not guaranteed as part of the Intel Architecture (i.e.
could be not present in future CPU models).
- The Maximum Segment Size is the largest quantity of data
that can be transmitted at one time over a net interface.
- Memory Type Range Registers
- The Maximum Transmission Unit is a parameter that
determines the largest datagram than can be transmitted by an IP
interface without it needing to be broken down into smaller
units. Typical values are 1500 bytes for an ethernet interface, or 576 bytes
for a SLIP interface.
MTU discovery is the process of discovering the MTU of a remote site that can be used without causing
expensive fragmentation. See DF.
- Used occasionally for "network".
- Nagle's Algorithm
- TCP algorithm for reducing network traffic in a connection. The local machine will accumulate outgoing
data into packets until receiving a packet from the machine, and then
the packets are sent. This way small amounts of outgoing data are amalgamated into the packet's larger payloads.
Setting the TCP option NODELAY disables this behaviour. FIXME
- Network Block Device - presents a local block device to a remote device.
- negative entry
- Meaning an error return rather than the expected entry from some vector or cache. A negative dentry
is used when an NFS client tries to operate on a stale file, and with other situations. Without the negative dentry, a lookup
is done on each request. This would allow a DoS attack by stealing CPU time doing lookups which will
fail. So the dentry is kept around to provide a near-immediate "stale file" response to the client. See fs/dcache.c:d_delete().
Apparently positive dentries are not allowed to become negative. A negative dentry is one with a NULL d_inode field.
- Non-maskable interrupt - highest-priority interrupt that can even interrupt standard h/w interrupt ISRs. The NMI watchdog
is a kernel facility, running off NMIs, which checks CPUs for recent activity (in the last five minutes), to detect
permanently stalled CPUs. NMIs are also delivered by motherboards in the case of serious hardware problems (like bad RAM).
- A routine executed when a virtual address has been accessed when the page is for some reason
not available in real memory. An example would be a swapped out page, hence the nopage operation
would need to bring the page into real memory from the swap file.
- "Number of", e.g. nr_free_pages(). Preferred to the "no_" form as it's not ambiguous
with the negative case (compare no_free_pages()).
- Possibly byte-swapping conversion from network-endian to host-endian format. The network is big-endian.
- Non-Uniform Memory Architecture. Usually on an SMP system, all memory beyond the caches costs an equal
amount to reach for each CPU. In NUMA systems, some memory can be accessed more quickly
than other parts. Later kernels have some NUMA support for preferring to use memory "nearer"
to the CPU, rather than higher-latency distant memory. See Documentation/vm/numa for details.
- Produce a proprietary non-open-source kernel module. Complaints about this module go to NVidia;
such complaints are not welcome on linux-kernel. Also check #nvidia on irc.openprojects.net
- Notation from Computer Science literature for complexity of algorithms. An O(1) algorithm
takes constant time; an O(n) algorithm takes time proportional to N, the number of data, and so on. Also used for
space complexity, the number of memory "units" used in the algorithm.
- Compaq standard for USB controllers.
- one-hand algorithm
- VM algorithm. A one-handed clock algorithm has one "clock hand" sweeping through pages, performing ageing
and other activities.
- In networking Out-Of-Band data. FIXME
- An oops is generated on some kernel errors. This may be due to kernel bugs or faulty hardware (especially bad RAM). Often
the kernel can continue to operate mostly correctly after an oops. More serious or potentially dangerous errors can cause
a kernel panic (the O/S freezes). This is sometimes done to protect a filesystem from incurring further harm.
An oops report will appear in the kernel logs, but is only useful to developers after decoding it with ksymoops (see
Documentation/oops-tracing.txt). Note also that klogd can corrupt oops reports by incorrectly decoding it automatically.
It is recommended to pass the -x option to klogd on startup to prevent this from happening.
The plural form is oopsen.
- out of line code
- Code placed apart from the next set of instructions, like function calls.
- out-of-order execution
- Some processors (such as Intel Pentium Pro and above) can decode and execute more than one instruction
at a time in parallel. This means a sequence of instructions is not necessarily the order in which those
instructions are completed. See mb().
- Physical Address Extension - Intel's 36-bit physical address extensions for highmem.
- page ageing
- Marking a page as "old" in some way, that is, it hasn't been used for some time. Old pages are preferred
for swapping out (although this is a simplification of the page swap selection algorithm), and are aged by
kernel threads via a clock algorithm. Confusingly, page->age==0 indicates an infinitely old page
(FIXME: true ?)
The pte's are also aged for selection for flushing from the page table to the page or swap cache.
- page cache
- The page cache caches pages mapped from files (e.g. mapping in an executable image).
- page colouring
- A system to ensure that virtually-contiguous pages are not mapped to the same cache lines, improving performance.
FreeBSD does this, but Linux, after some discussion, doesn't (yet).
- page stealer
- A task which steals from a process's vm, taking pages for itself. An example is kswapd, which runs through
the pages of the system swapping out pages to keep freepages.min number of free pages.
- By analogy with optimal, the worst situation possible in some context.
- Page Global Directory, or the abstracted top level of the multi-level page tables. See pmd, pte. On x86, there are only
two real levels of page tables; this fact is abstracted away by macros in the source. Each level of page table deals with different sizes of memory.
For example, this global directory may deal with areas 4Mb in size. Each entry will be a pointer to a lower table of a smaller-sized directory.
On x86, the next level (middle) is non present in hardware, and is folded in to the pgd in the kernel code. The
bottom level deals in pages directly (PAGE_SIZE). So the pgd is a directory of page tables. Code which traverses this structure
(such as the bttv driver) is said to walk the page tables.
- Programmable Interrupt Controller. See 8259 PIC, local APIC, IO-APIC.
- The old-fashioned transfer protocol for devices on an IDE bus. Now mostly used is DMA,
and standards like UDMA.
- A pinned page cannot be swapped out of main memory.
- Page Middle Directory. The middle level of page tables.
- Used in the slab cache for debugging purposes. Slabs are filled with a special value after
they have been freed in order to catch double frees.
- A standard specifying semantics and interfaces for a UNIX-like kernel
interface, along with standards for user-space facilities.
There is a core POSIX definition which must be supported by all POSIX-conforming OSes,
and several optional standards for specific facilities; so you may see references
to "POSIX shm" or "POSIX real-time signals".
- PPP (point-to-point protocol, often used in modem up-links) running over Ethernet.
- Involuntary switching of a CPU from one task to another. User-space is pre-empted by interrupts, which can then
either return to the process, or schedule another process (a process switch will also occur when the process "voluntarily" gives up the CPU by e.g. waiting
for a disk block in kernel mode).
Kernel mode tasks are never pre-empted (except by interrupts) - they are guaranteed use of the CPU until they sleep or yield the CPU. Some
kernel code runs with interrupts disabled, meaning nothing except an NMI can interrupt the execution of the code.
- Many pages (such as those for a process's BSS segment) need to be zeroed when accessed. As the kernel only
takes the page on the page fault for these areas, it is possible to pre-zero some amount of pages before
the page faults occur. FreeBSD does this, and there has been some inconclusive talk of including this feature
in Linux. See demand zero.
Obviously, the equivalent of printf for kernel code (remember, you can't use
libc functions in the kernel). Can be called from interrupt context (although
it does take the console spinlock)
but floating-point cannot be used, as throughout the kernel.
- priority inheritance
- See priority inversion.
- priority inversion
A situation where a low-priority task holds a resource which a higher-priority task is waiting for.
This can occur with tasks scheduled with SCHED_OTHER or SCHED_IDLE, where a task
with a higher scheduling priority will always be scheduled before in preference. This leads to deadlock
as the low-priority task cannot be scheduled in order to release the lock, and the high-priority task
can't continue without the lock on the resource. Priority inheritance
is a putative solution, where the low-priority task temporarily takes on the priority
of the other task, enabling its scheduling and release of the lock. However this approach
can itself lead to deadlock (FIXME: is this really true ?).
- processor affinity
- Some OS's have system calls for binding a process to a particular CPU. The kernel ensures that when
the process is re-scheduled it will be run on that processor.
Linux currently doesn't support this and its relative merits are debatable.
- processor ordering
- The ordering of memory read and writes as seen from the CPU's point of view.
- process context
- Code executing in the kernel on behalf of a user process (from a syscall), is said to
have a process context. Code not in a process context is not allowed to sleep or wait (see interrupt
- process group
- A set of user processes sharing the same group pid. One of these processes will be the group leader (same
pid as pgid).
- process suspension
When a VM is capable of completely suspending processes for a while. Also known as process swapping.
- Of a network device, sucks all packets regardless of destination.
- Page Table Entry, or a value containing the physical address of a page along with
associated bits indicating, for example, that the entry is valid and the related page is present in real memory.
- In the kernel, means "drop a reference to". This indicates that whatever task was using the
object previously is no longer interested. This may just decrement a usage count, or perform
a more complicated reaping routine.
- race condition
- Generally, where the ordering of events can affect the outcome of a situation. These bugs are caused
by incomplete mutual exclusion, so a task or interrupt can come in and change something vital under the
feet of another task. They are notoriously hard to debug as they are often marginal (often only occur
in pathological=="very unlikely" situations).
- Redundant Array of Independent/Inexpensive Disks. See the standard literature.
- Reliability, Availability & Serviceability
- rate limit
- To avoid DoS attacks, some facilities have upper limits on how often they can trigger. Most trite example is logging.
- raw I/O
- I/O that does not go through the caches that Linux provides as standard to all block devices. This feature
can be useful to database systems, where the database back-end has a better idea of the data access pattern than
the kernel can, hence increasing performance.
- Where I/O is performed before strictly required in an attempt to improve performance. This is
performed by the Linux kernel, and often by hardware as well.
- Readiness Event Queues
- See completion ports.
- red zone
- Adding protection (e.g. read or write) to kernel pages, for use in detecting stack overflow and other situations.
This often works by adding a guard area at the end of each buffer in the slab allocator,
which is checked to make sure it hasn't been overwritten.
- register spilling
- On out-of-order processors, a retired instruction is one that has been executed and completed.
- ring 0
- ring 0 is the protection level provided by x86 hardware for kernel mode. Ring 3 is used in Linux for user mode,
and rings 1 and 2 are unused. Ring transition to a higher protection level
is only possible through gates. Interrupt gates are
set up for transitions due to interrupts, and call gates are used for system calls.
- ring buffer
- A buffer of data which is of fixed size; when it fills, further data is placed back at
the start of the buffer, overwriting the old data, in a "ring". Commonly used in device drivers.
- root squashing
- An NFS option whereby any actions performed by a remote client's root account are actually attempted by the user "nobody"
on the server.
- Common abbreviation used in net code for "receive".
- An I/O mechanism, whereby devices can write to memory at several physically non-contiguous addresses (scatter), and compose reads
from physically non-contiguous addresses in memory (gather).
- As per the literature, a mutual exclusion or resource management method where tasks sleep waiting for a semaphore
to be granted to them. Later versions of the Linux kernel also provide a more complicated type, the read-write
semaphore, which allows a task to specify its intentions with regard to the resource. Hence several readers
are allowed for the resource, but a writer will have mutual exclusion.
- Sessions are related to terminal handling. Sessions can have a controlling terminal; by becoming a session group leader,
a daemon process can detach itself from any controlling terminal. A session group is a set of processes created by the session
The set_fs() facility enables the copy_*_user routines to access kernel (KERNEL_DS) or user (KERNEL_FS) memory.
This is required in kernel code that uses functions that expect to copy data from user-space, so that the correct segment is used (i.e. copying
from kernel memory rather than user space memory).
- Shared memory. This is memory accessible through more than one processes' virtual address
map. There are standardised shm interfaces derived from System V, as well as POSIX.
- silly delete
- Silly-deleting in NFS is used when somebody tries to delete a file
that is held open by some other process. The deletion is delayed, and
the file is instead renamed to something along the lines of
.nfsxxxxxxxxx. The last process to close the file will then attempt
to delete it.
- slab allocator
- Cache-based kernel memory allocation scheme used by kmalloc(). Bonwick's paper can be
found at ftp://ftp.bitmover.com/pub/slab.ps.
When actually requiring new
pages, __get_free_pages() is called which uses alloc_pages(), with the standard
Interrupts run on return to user space or after a hardware interrupt, and there is a fixed
number available. See tasklet, bottom-half handler.
These methods are usually used to run kernel code resulting from a hardware interrupt of some
kind, where the necessary operations take too long to execute in hardirq context. A notable case
is routines queueing on TIMER_BH which are executed depending on the hardware timer tick.
These are different from DOS soft interrupts as they don't make use of hardware soft interrupt support (int
instruction), although these types of soft interrupt are used in kernel code, for example kernel_thread().
- A busy-wait method of ensuring mutual exclusion for a resource. Tasks waiting on a spin-lock
sit in a busy loop until the spinlock becomes available. On a UP (single processor) system,
spinlocks are not used and are optimised out of the kernel. There are also read-write spinlocks,
where multiple readers are allowed, but only one writer. See Documentation/spinlocks.txt
in the kernel source for details.
- See cli/sti.
- An fs-specific data structure, stored on disk, that holds various bits of
information about the fs on the disk, such as a pointer to the free blocks structure,
the number of free inodes, etc.
In the code, the inode manipulation routines are entered in super_operations
vector of function pointers. Often abbreviated to sb.
- The Single Unix Specification, a standard defined with the aim of
standardisation of Unix-like OS semantics and APIs. There is also a second
version of this standard, referred to as SuS v2.
- System V Release 4, as in the UNIX source that POSIX is mostly based on.
- SYN cookies
- Protection against TCP/IP SYN flooding DoS attack by storing state information about SYN packets from
hosts. See ftp://koobera.math.uic.edu/syncookies.html.
- system calls
- The method by which a user process requests facilities provided by the kernel. Examples are
wait4(2), open(2), and sched_setscheduler(2). By an architecture-specific method, the kernel is executed
on behalf of the process in kernel mode at the entry point of the system call. Generally there is
a convention in the source that syscall entry points are prefixed with sys_,
for example, the entry point to open(2) is the function sys_open in the file
fs/open.c. Code executing in kernel mode (that is, with access to the kernel's memory) from
a syscall entry point is said to have a process context.
On x86, all system calls executed from user mode use int 0x80 and came through the syscall call gate in arch/i386/kernel/entry.S.
- tail merging
Tail merging is a technique which collects tails of multiple files together in one block to save space, reducing internal fragmentation,
especially at larger block sizes.
- The kernel abstraction for every user process, user thread, and kernel thread. All of these are
handled as tasks in the kernel. Every task is described by its task_struct. User processes/threads
have an associated user_struct. When in process context, the process's task_struct is accessible
through the routine get_current, which does assembly magic to access the struct, which is stored
at the bottom of the kernel stack. When running in kernel mode without process context, the struct at the bottom
of the kernel stack refers to the idle task.
"Software interrupt" routines running when the software interrupt is received at a return to user space
or after a hardware interrupt. They do not run concurrently on multiple CPUs, but are dynamically
allocatable. See softirq and bottom-half handler.
- task queue
- A linked list of processes or threads. Tasks, for example, can queue on bottom-half handlers waiting
to be executed. See wait queue.
- Tagged Command Queueing. This facility allows the kernel to send several asynchronous requests at once to a
block device. The device can service the requests as it sees fit; the advantage being that the device can arrange
the transfers as best suits the device hardware. TCQ works for both reads and writes.
- thundering herd
- Consider a set of tasks on a wait queue, waiting for some condition. When the event occurs, it is often the case that
only of the tasks has real work to do. Older kernels always woke up all tasks on the wait queue - this meant that each task
immediately vied for scheduling - the "thundering herd". More recent kernels have added the TASK_EXCLUSIVE flag to tasks in
the wait queue. This ensures that one of the tasks with this flag set will be the only one to be woken up.
- TLB flushing
- Flagging part or all of a CPU's translation lookaside buffer (which is a virtual-address-indexed cache
of physical addresses) as invalid. Depending on hardware, the whole TLB may be flushed at any one time,
or just a part.
- One of those cyclical lkml threads. O/S's such as Solaris have a "tmpfs" filesystem, providing a fs interface using main memory as storage.
This differs from the usual ram disk in that it is swappable. The fs is often used for /tmp, which provides performance gains. Under Linux,
ext2fs is already faster than Solaris' tmpfs, so no-one has ever implemented this.
- The process by which a compiler has special code to deal with code presented as data within
a binary's segment, which is to be executed, amongst other scenarios. A short routine is called
before the real procedure which sets up the stack environment correctly. This facility is used in languages such as
Ada, and complicates provision of non-executable stack support at kernel level (useful for
preventing simplistic buffer over-runs). FIXME
- Time Stamp Counter, a very high resolution counter on the Intel platform.
- two-hand algorithm
- VM algorithm. A two-handed clock algorithm has two "clock hands" sweeping through pages, performing ageing
and other activities. Imagine one hand ageing pages, whereas the later one decides to mark pages as unused.
- Common abbreviation used in net code for "transmit".
- Universal Asynchronous Receiver/Transmitter. UARTs are hardware devices that convert byte-wide data streams in to serial (bit-wide) data streams.
- Unified Buffer Cache - the combination of the buffer and page caches to use the same memory pool (disk I/O and VM use the same pool), and data I/O utilises the page cache.
- Intel standard for USB controllers.
- union mount
- Each mount structure is represented in the kernel by a dentry tree. Sharing of a tree allows
union mount, where two separate mounts are accessible under one mount point.
- For micro-operation, or the basic set of CPU operations which are generated from machine code in microcode
processors. Intel P6 processors and above have the facility for upgrading the microcode (stepping of the processor).
- USB Request Block, or the message struct for Universal Serial Bus.
- user-mode port
- A patched version of the kernel modified to run on top of user-space. This is intended to be used for
testing, experimentation and debugging. See http://user-mode-linux.sf.net.
- Use The Source, Luke !
- Video 4 Linux
- versioned filesystem
- A filesystem that tracks changes to files, for example by storing the entire previous file on a change in data.
- VFS is the filesystem-independent in-kernel interface used to access each particular filesystem supported by
- Very Long Instruction Word, instruction sets with large-sized complex instructions encoded into one instruction.
- Allocates a virtual address space. Can be slow for some purposes as it requires a TLB flush
on all processors. Does not ensure physical address contiguity, but can allocate
abritrary sized areas. Also requires (along with related vfree()) the big kernel lock.
- Commercial virtual O/S software. If the modules won't compile for you, DO NOT complain to linux-kernel ! And see FreeMWare.
- From the C keyword. Used throughout the source and in discussion, signifying a variable or inline asm
that cannot be optimised as usual by the compiler. This usually means the compiler shouldn't store the
variable in a register, as another task can alter the value of the real memory contents at the time. For example,
this is necessary with I/O address spaces, where the hardware device can fill up a buffer asynchronously to a
reading task. Variables which are changed from both interrupts and other contexts are also volatile.
- wait queue
- A linked list of tasks (processes or threads) waiting on some specific condition. For example,
a locked page may have a wait queue of tasks waiting for I/O on that page to
complete (that is, !PageLocked(page)).
- wall time
- Real time, that is, the sort you would expect to find on a clock on your wall ...
- wandering logs
- On journalling filesystems, the metadata must be logged. A wandering
log system spreads log blocks throughout the filesystem for performance, rather than using a static log area on the disk.
- warm cache
- A cache whose contents are not invalid and are useful to the task to be run next. It is important to
benchmark code in both cold-cache and warm-cache circumstances.
- write bomb logic
- Code to protect interactivity from being harmed by large writes. FIXME
- write behind
- A.K.A. delayed write, asynchronously writing after the write request was made.
- External Data Representation, the marshalled form of RPC communications. XDR deals with sizing and endianness issues.
- "Execute in-place". For normal storage devices, executables are loaded into main memory before execution. For devices like
a ram disk or flash system, it is possible to execute the program straight from the device, without duplication.
- Networking where no data is copied to/from userspace during network transmit or receive. This is often touted as a universally
good thing, but in fact it is a trade-off between the cost of the copy, and the cost of setting up page tables etc. such that
user-space can read the data.
- Zero-fill-on-demand, a page filled with zeros for which no space has yet been allocated.
- A completed process whose parent has not waited for it. When a child exits, the return status must remain available to the parent,
as it may later do a wait4(). A zombie holds its place in the process table until the parent waits for it.
- A particular area of free memory. Zoned allocators differentiate between intended uses, and are generally
used to model different characteristics of the memory. For example, on the x86 there is only 16Mb of DMA-able
memory; the zoned allocator will try to save DMA pages for processes specifically requesting ZONE_DMA memory.
There are currently three zone types: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM (see highmem).
Several people have contributed (some indirectly) to this glossary: the #kernelnewbies crew, Cesar Eduardo Barros, Trond Myklebust, and the NET 3 Howto.