Linux Address Space
Linux processes interact with virtual memory and not the physical memory. Every process has a notion that it is the only process running in the system and hence, has unlimited access to the memory present in the system.
Various processes may have the same virtual memory address space but it doesn’t collide because the kernel takes care of the virtual memory to physical memory mapping. An example when a process may have to share it’s virtual memory is when it spawns threads, or threads of execution.
The process doesn’t have permission to access certain parts of the address space which is reserved by the kernel. A process can access a memory address only if it is in the valid area. Memory addresses can have associated permissions that a process must respect. If this is not respected by the process, then the kernel throws a Segmentation Fault message and kills the process.
Memory areas may have the following content:
- Executable file’s code, which is known as the text section
- Executable file’s initialized global variables, which is known as the data section
- Uninitialized variables called the bss (block started by symbol) section
- Stack
- Heap
Memory Descriptor:
In the linux kernel code, the processes’ address space can be defined in the following data structure.
struct mm\_struct {
struct vm\_area\_struct \*mmap; /\* list of memory areas \*/
struct rb\_root mm\_rb; /\* red-black tree of VMAs \*/
struct vm\_area\_struct \*mmap\_cache; /\* last used memory area \*/
unsigned long free\_area\_cache; /\* 1st address space hole \*/
pgd\_t \*pgd; /\* page global directory \*/
atomic\_t mm\_users; /\* address space users \*/
atomic\_t mm\_count; /\* primary usage counter \*/
int map\_count; /\* number of memory areas \*/
struct rw\_semaphore mmap\_sem; /\* memory area semaphore \*/
spinlock\_t page\_table\_lock; /\* page table lock \*/
struct list\_head mmlist; /\* list of all mm\_structs \*/
unsigned long start\_code; /\* start address of code \*/
unsigned long end\_code; /\* final address of code \*/
unsigned long start\_data; /\* start address of data \*/
unsigned long end\_data; /\* final address of data \*/
unsigned long start\_brk; /\* start address of heap \*/
unsigned long brk; /\* final address of heap \*/
unsigned long start\_stack; /\* start address of stack \*/
unsigned long arg\_start; /\* start of arguments \*/
unsigned long arg\_end; /\* end of arguments \*/
unsigned long env\_start; /\* start of environment \*/
unsigned long env\_end; /\* end of environment \*/
unsigned long rss; /\* pages allocated \*/
unsigned long total\_vm; /\* total number of pages \*/
unsigned long locked\_vm; /\* number of locked pages \*/
unsigned long def\_flags; /\* default access flags \*/
unsigned long cpu\_vm\_mask; /\* lazy TLB switch mask \*/
unsigned long swap\_address; /\* last scanned address \*/
unsigned dumpable:1; /\* can this mm core dump? \*/
int used\_hugetlb; /\* used hugetlb pages? \*/
mm\_context\_t context; /\* arch-specific data \*/
int core\_waiters; /\* thread core dump waiters \*/
struct completion \*core\_startup\_done; /\* core start completion \*/
struct completion core\_done; /\* core end completion \*/
rwlock\_t ioctx\_list\_lock; /\* AIO I/O list lock \*/
struct kioctx \*ioctx\_list; /\* AIO I/O list \*/
struct kioctx default\_kioctx; /\* AIO default I/O context \*/
};
The number of processes/threads using the same address space can be checked via the mm_users variable. The mmap and mm_rb point to the memory addresses in the address space. Both the variables point to the same information but in different representations. mmap is a linked list whereas mm_rb is a red black tree. This is done so that the mmap can be used for simple traversal need and the mm_rb can be used for searching purposes.
The kernel represents the process address space via the memory descriptor. The memory descriptor of the process is pointed to via the mm field in the task_struct structure.
struct task\_struct {
volatile long state; /\* -1 unrunnable, 0 runnable, >0 stopped \*/
long counter;
long priority;
unsigned long signal;
unsigned long blocked; /\* bitmap of masked signals \*/
unsigned long flags; /\* per process flags, defined below \*/
int errno;
long debugreg\[8\]; /\* Hardware debugging registers \*/
struct exec\_domain \*exec\_domain;
struct linux\_binfmt \*binfmt;
struct task\_struct \*next\_task, \*prev\_task;
struct task\_struct \*next\_run, \*prev\_run;
unsigned long saved\_kernel\_stack;
unsigned long kernel\_stack\_page;
int exit\_code, exit\_signal;
unsigned long personality;
int dumpable:1;
int did\_exec:1;
int pid;
int pgrp;
int tty\_old\_pgrp;
int session;
/\* boolean value for session group leader \*/
int leader;
int groups\[NGROUPS\];
struct task\_struct \*p\_opptr, \*p\_pptr, \*p\_cptr,
\*p\_ysptr, \*p\_osptr;
struct wait\_queue \*wait\_chldexit;
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
unsigned long timeout, policy, rt\_priority;
unsigned long it\_real\_value, it\_prof\_value, it\_virt\_value;
unsigned long it\_real\_incr, it\_prof\_incr, it\_virt\_incr;
struct timer\_list real\_timer;
long utime, stime, cutime, cstime, start\_time;
unsigned long min\_flt, maj\_flt, nswap, cmin\_flt, cmaj\_flt, cnswap;
int swappable:1;
unsigned long swap\_address;
unsigned long old\_maj\_flt; /\* old value of maj\_flt \*/
unsigned long dec\_flt; /\* page fault count of the last time \*/
unsigned long swap\_cnt; /\* number of pages to swap on next pass \*/
struct rlimit rlim\[RLIM\_NLIMITS\];
unsigned short used\_math;
char comm\[16\];
int link\_count;
struct tty\_struct \*tty;
struct sem\_undo \*semundo;
struct sem\_queue \*semsleeping;
struct desc\_struct \*ldt;
struct thread\_struct tss;
struct fs\_struct \*fs;
struct files\_struct \*files;
struct mm\_struct \*mm;
struct signal\_struct \*sig;
#ifdef \_\_SMP\_\_
int processor;
int last\_processor;
int lock\_depth;
#endif
};
The current->mm points to the memory descriptor of the process. The copy_mm() is used to copy the parent’s memory descriptor to the child during fork(). Each process receives a unique mm_struct, hence a unique address space. In some cases where the address space is shared by multiple processes, they are known as threads and are done by calling the _clone()_with CLONE_VM flag set. This is why threads are just another process according to the linux kernel who happen to share the address space i.e some of its resources with another process.
When the process exits, it calls the exit_mm() function which in turn calls free_mm() if the reference count of the process is 0 and does some housekeeping and statistics update.
Virtual memory areas
The memory areas are represented in the kernel code via the vm_area_struct, which are also called virtual memory areas.
struct vm\_area\_struct {
struct mm\_struct \*vm\_mm; /\* associated mm\_struct \*/
unsigned long vm\_start; /\* VMA start, inclusive \*/
unsigned long vm\_end; /\* VMA end , exclusive \*/
struct vm\_area\_struct \*vm\_next; /\* list of VMA's \*/
pgprot\_t vm\_page\_prot; /\* access permissions \*/
unsigned long vm\_flags; /\* flags \*/
struct rb\_node vm\_rb; /\* VMA's node in the tree \*/
union { /\* links to address\_space->i\_mmap or i\_mmap\_nonlinear \*/
struct {
struct list\_head list;
void \*parent;
struct vm\_area\_struct \*head;
} vm\_set;
struct prio\_tree\_node prio\_tree\_node;
} shared;
struct list\_head anon\_vma\_node; /\* anon\_vma entry \*/
struct anon\_vma \*anon\_vma; /\* anonymous VMA object \*/
struct vm\_operations\_struct \*vm\_ops; /\* associated ops \*/
unsigned long vm\_pgoff; /\* offset within file \*/
struct file \*vm\_file; /\* mapped file, if any \*/
void \*vm\_private\_data; /\* private data \*/
};
It describes a single memory area over a contiguous interval. Each memory area has certain associated permissions and flags which help to denote the type of memory area — for example, memory-mapped areas or the processes’s user-space stack.
The vm_mm struct points to the corresponding mm_struct that it belongs to which confirms the uniqueness of the address space of a process.
Although the applications operate on the virtual memory address space, the processors operate on the physical memory. Therefore, whenever an application accesses a virtual memory address, it is first converted to the physical memory, i.e where the data actually resides. This lookup is done via page tables. Virtual memory is divided up into chunks and the index is stored. The index can point to another table or to the physical page.
Linux, by default, maintains 3 levels of page tables to further optimize the page lookup. Even on systems which have no hardware support, it still optimizes the 3 level page table as it is necessary to have indexed page tables for faster lookups.
The top page table is known as the Page Global Directory (PGD) which contains an array of unsigned long entries. The entry in the PGD point to the PMD.
The second page table is known as the Page Middle Directory (PMD) which further points to the PTE.
The Page Table Entries (PTE) point to the actual physical pages.
Every process has its own page tables and is pointed to the PGD via the pgd data structure in the memory descriptor.
Even after maintain 3 levels of page tables, the lookup can only be so fast as it is vast searchable area. In order to further improve upon this, most processors implement a Translation Lookaside Buffer (TLB) which acts as a hardware cache between virtual to physical mappings. Therefore, if the cache is hit, it returns directly from the TLB or it further processes the virtual to physical memory mapping.
Most of the data in the article is inspired by Linux Kernel Development book by Robert Love. This is a must read for anybody who wishes to actually understand the underneath workings of the linux kernel.