Linux Address Space ::

Linux processes interact with virtual memory and not the physical memory. Every process has a notion that it is the only process running in the system and hence, has unlimited access to the memory present in the system.

Various processes may have the same virtual memory address space but it doesn’t collide because the kernel takes care of the virtual memory to physical memory mapping. An example when a process may have to share it’s virtual memory is when it spawns threads, or threads of execution.

The process doesn’t have permission to access certain parts of the address space which is reserved by the kernel. A process can access a memory address only if it is in the valid area. Memory addresses can have associated permissions that a process must respect. If this is not respected by the process, then the kernel throws a Segmentation Fault message and kills the process.

Memory areas may have the following content:

Executable file’s code, which is known as the text section
Executable file’s initialized global variables, which is known as the data section
Uninitialized variables called the bss (block started by symbol) section
Stack
Heap

Memory Descriptor:

In the linux kernel code, the processes’ address space can be defined in the following data structure.

struct mm\_struct {  
        struct vm\_area\_struct  \*mmap;               /\* list of memory areas \*/  
        struct rb\_root         mm\_rb;               /\* red-black tree of VMAs \*/  
        struct vm\_area\_struct  \*mmap\_cache;         /\* last used memory area \*/  
        unsigned long          free\_area\_cache;     /\* 1st address space hole \*/  
        pgd\_t                  \*pgd;                /\* page global directory \*/  
        atomic\_t               mm\_users;            /\* address space users \*/  
        atomic\_t               mm\_count;            /\* primary usage counter \*/  
        int                    map\_count;           /\* number of memory areas \*/  
        struct rw\_semaphore    mmap\_sem;            /\* memory area semaphore \*/  
        spinlock\_t             page\_table\_lock;     /\* page table lock \*/  
        struct list\_head       mmlist;              /\* list of all mm\_structs \*/  
        unsigned long          start\_code;          /\* start address of code \*/  
        unsigned long          end\_code;            /\* final address of code \*/  
        unsigned long          start\_data;          /\* start address of data \*/  
        unsigned long          end\_data;            /\* final address of data \*/  
        unsigned long          start\_brk;           /\* start address of heap \*/  
        unsigned long          brk;                 /\* final address of heap \*/  
        unsigned long          start\_stack;         /\* start address of stack \*/  
        unsigned long          arg\_start;           /\* start of arguments \*/  
        unsigned long          arg\_end;             /\* end of arguments \*/  
        unsigned long          env\_start;           /\* start of environment \*/  
        unsigned long          env\_end;             /\* end of environment \*/  
        unsigned long          rss;                 /\* pages allocated \*/  
        unsigned long          total\_vm;            /\* total number of pages \*/  
        unsigned long          locked\_vm;           /\* number of locked pages \*/  
        unsigned long          def\_flags;           /\* default access flags \*/  
        unsigned long          cpu\_vm\_mask;         /\* lazy TLB switch mask \*/  
        unsigned long          swap\_address;        /\* last scanned address \*/  
        unsigned               dumpable:1;          /\* can this mm core dump? \*/  
        int                    used\_hugetlb;        /\* used hugetlb pages? \*/  
        mm\_context\_t           context;             /\* arch-specific data \*/  
        int                    core\_waiters;        /\* thread core dump waiters \*/  
        struct completion      \*core\_startup\_done;  /\* core start completion \*/  
        struct completion      core\_done;           /\* core end completion \*/  
        rwlock\_t               ioctx\_list\_lock;     /\* AIO I/O list lock \*/  
        struct kioctx          \*ioctx\_list;         /\* AIO I/O list \*/  
        struct kioctx          default\_kioctx;      /\* AIO default I/O context \*/  
};

The number of processes/threads using the same address space can be checked via the mm_users variable. The mmap and mm_rb point to the memory addresses in the address space. Both the variables point to the same information but in different representations. mmap is a linked list whereas mm_rb is a red black tree. This is done so that the mmap can be used for simple traversal need and the mm_rb can be used for searching purposes.

The kernel represents the process address space via the memory descriptor. The memory descriptor of the process is pointed to via the mm field in the task_struct structure.

struct task\_struct {

  volatile long        state;          /\* -1 unrunnable, 0 runnable, >0 stopped \*/  
  long                 counter;  
  long                 priority;  
  unsigned             long signal;  
  unsigned             long blocked;   /\* bitmap of masked signals \*/  
  unsigned             long flags;     /\* per process flags, defined below \*/  
  int errno;  
  long                 debugreg\[8\];    /\* Hardware debugging registers \*/  
  struct exec\_domain   \*exec\_domain;

  struct linux\_binfmt  \*binfmt;  
  struct task\_struct   \*next\_task, \*prev\_task;  
  struct task\_struct   \*next\_run,  \*prev\_run;  
  unsigned long        saved\_kernel\_stack;  
  unsigned long        kernel\_stack\_page;  
  int                  exit\_code, exit\_signal;

  unsigned long        personality;  
  int                  dumpable:1;  
  int                  did\_exec:1;  
  int                  pid;  
  int                  pgrp;  
  int                  tty\_old\_pgrp;  
  int                  session;  
  /\* boolean value for session group leader \*/  
  int                  leader;  
  int                  groups\[NGROUPS\];

  struct task\_struct   \*p\_opptr, \*p\_pptr, \*p\_cptr,   
                       \*p\_ysptr, \*p\_osptr;  
  struct wait\_queue    \*wait\_chldexit;    
  unsigned short       uid,euid,suid,fsuid;  
  unsigned short       gid,egid,sgid,fsgid;  
  unsigned long        timeout, policy, rt\_priority;  
  unsigned long        it\_real\_value, it\_prof\_value, it\_virt\_value;  
  unsigned long        it\_real\_incr, it\_prof\_incr, it\_virt\_incr;  
  struct timer\_list    real\_timer;  
  long                 utime, stime, cutime, cstime, start\_time;

  unsigned long        min\_flt, maj\_flt, nswap, cmin\_flt, cmaj\_flt, cnswap;  
  int swappable:1;  
  unsigned long        swap\_address;  
  unsigned long        old\_maj\_flt;    /\* old value of maj\_flt \*/  
  unsigned long        dec\_flt;        /\* page fault count of the last time \*/  
  unsigned long        swap\_cnt;       /\* number of pages to swap on next pass \*/

  struct rlimit        rlim\[RLIM\_NLIMITS\];  
  unsigned short       used\_math;  
  char                 comm\[16\];

  int                  link\_count;  
  struct tty\_struct    \*tty;  
  struct sem\_undo      \*semundo;  
  struct sem\_queue     \*semsleeping;  
  struct desc\_struct \*ldt;  
  struct thread\_struct tss;  
  struct fs\_struct     \*fs;  
  struct files\_struct  \*files;  
  struct mm\_struct     \*mm;  
  struct signal\_struct \*sig;  
#ifdef \_\_SMP\_\_  
  int                  processor;  
  int                  last\_processor;  
  int                  lock\_depth;       
#endif     
};

The current->mm points to the memory descriptor of the process. The copy_mm() is used to copy the parent’s memory descriptor to the child during fork(). Each process receives a unique mm_struct, hence a unique address space. In some cases where the address space is shared by multiple processes, they are known as threads and are done by calling the _clone()_with CLONE_VM flag set. This is why threads are just another process according to the linux kernel who happen to share the address space i.e some of its resources with another process.

When the process exits, it calls the exit_mm() function which in turn calls free_mm() if the reference count of the process is 0 and does some housekeeping and statistics update.

Virtual memory areas

The memory areas are represented in the kernel code via the vm_area_struct, which are also called virtual memory areas.

struct vm\_area\_struct {  
        struct mm\_struct             \*vm\_mm;        /\* associated mm\_struct \*/  
        unsigned long                vm\_start;      /\* VMA start, inclusive \*/  
        unsigned long                vm\_end;        /\* VMA end , exclusive \*/  
        struct vm\_area\_struct        \*vm\_next;      /\* list of VMA's \*/  
        pgprot\_t                     vm\_page\_prot;  /\* access permissions \*/  
        unsigned long                vm\_flags;      /\* flags \*/  
        struct rb\_node               vm\_rb;         /\* VMA's node in the tree \*/  
        union {         /\* links to address\_space->i\_mmap or i\_mmap\_nonlinear \*/  
                struct {  
                        struct list\_head        list;  
                        void                    \*parent;  
                        struct vm\_area\_struct   \*head;  
                } vm\_set;  
                struct prio\_tree\_node prio\_tree\_node;  
        } shared;  
        struct list\_head             anon\_vma\_node;     /\* anon\_vma entry \*/  
        struct anon\_vma              \*anon\_vma;         /\* anonymous VMA object \*/  
        struct vm\_operations\_struct  \*vm\_ops;           /\* associated ops \*/  
        unsigned long                vm\_pgoff;          /\* offset within file \*/  
        struct file                  \*vm\_file;          /\* mapped file, if any \*/  
        void                         \*vm\_private\_data;  /\* private data \*/  
};

It describes a single memory area over a contiguous interval. Each memory area has certain associated permissions and flags which help to denote the type of memory area — for example, memory-mapped areas or the processes’s user-space stack.

The vm_mm struct points to the corresponding mm_struct that it belongs to which confirms the uniqueness of the address space of a process.

Although the applications operate on the virtual memory address space, the processors operate on the physical memory. Therefore, whenever an application accesses a virtual memory address, it is first converted to the physical memory, i.e where the data actually resides. This lookup is done via page tables. Virtual memory is divided up into chunks and the index is stored. The index can point to another table or to the physical page.

Linux, by default, maintains 3 levels of page tables to further optimize the page lookup. Even on systems which have no hardware support, it still optimizes the 3 level page table as it is necessary to have indexed page tables for faster lookups.

The top page table is known as the Page Global Directory (PGD) which contains an array of unsigned long entries. The entry in the PGD point to the PMD.

The second page table is known as the Page Middle Directory (PMD) which further points to the PTE.

The Page Table Entries (PTE) point to the actual physical pages.

Every process has its own page tables and is pointed to the PGD via the pgd data structure in the memory descriptor.

Even after maintain 3 levels of page tables, the lookup can only be so fast as it is vast searchable area. In order to further improve upon this, most processors implement a Translation Lookaside Buffer (TLB) which acts as a hardware cache between virtual to physical mappings. Therefore, if the cache is hit, it returns directly from the TLB or it further processes the virtual to physical memory mapping.

Most of the data in the article is inspired by Linux Kernel Development book by Robert Love. This is a must read for anybody who wishes to actually understand the underneath workings of the linux kernel.