The Memory Layout of a 64-bit Linux Process

This is a tiny writeup (mostly for me such that I don’t forget about it :) how a simple, 64-bit Linux process is laid out in memory and how a processes memory consumption can be analyzed. A future article will then focus on the native memory layout and consumption of a Java process.

The basics

The precesses memory space layout is platform specific. On current x86_64 CPU’s the memory will be laid out according to "virtual memory map with 4 level page tables" which is specified in the Linux kernel documentation under Documentation/x86/x86_64/mm.txt:

0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
hole caused by [47:63] sign extension
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory (1)
...
ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

As you can see, this gives us a virtual address space of 48 bits where 47 bits can be used for user space programs, to address at most 128 TB of memory. Because this article focuses on the user space, I’ve omitted most of the predefined kernel space regions. It’s just interesting to see ➊, that the kernel maps the complete physical memory into the kernel address space for convenience. Notice, that current x86_64 CPUs can only address a maximum of 64 TB because they limit the physical addresses to 46 bits (i.e. a program can use a virtual address space of 128 TB but only 64 TB out of the 128 can be physically mapped). Future versions of x86_64 CPUs will be able to use 57 bits for the virtual address space (resulting in a 56 bits or 128 PiB user space) and up to 52 bits for physical addresses (resulting in up to 4 PiB of physical memory). This new hardware generation requires an extended, 5-level page table as described in Intel’s "5-Level Paging and 5-Level EPT" white paper and recently implemented in the Linux 4.12 kernel. The resulting, new virtual memory map is described in the previously mentioned mm.txt kernel documentation file as well.

Hello world…

As a baseline example we take a slightly modified "Hello world" example as shown in Listing Listing 1. We simply add a call to getchar() at the end of the program ➊ such that we can easily analyze its memory layout.

Listing 1. The famous "Hello world example"

#include <stdio.h>

int main(int argc, char **argv) {
  printf("Hello world\n");
  getchar(); (1)
}

As demonstrated in Listing Listing 2, we will use the pmap utility to display the memory layout of a process. pmap is actually just a front-end for the memory map of a process as recorded in the smaps file of the proc file system.

Listing 2. The memory layout of our "Hello world" example

$ pmap -X `pidof hello`
14792:   ./examples/c/hello
         Address Perm   Offset Device Inode Size  Rss Pss Referenced Anonymous Swap Locked Mapping
        00400000 r-xp 00000000  00:26  2118    4    4   4          4         0    0      0 hello
        00600000 r--p 00000000  00:26  2118    4    4   4          4         4    0      0 hello
        00601000 rw-p 00001000  00:26  2118    4    4   4          4         4    0      0 hello
    7ffff7a11000 r-xp 00000000  08:01  8106 1784  952   5        952         0    0      0 libc-2.19.so
    7ffff7bcf000 ---p 001be000  08:01  8106 2048    0   0          0         0    0      0 libc-2.19.so
    7ffff7dcf000 r--p 001be000  08:01  8106   16   16  16         16        16    0      0 libc-2.19.so
    7ffff7dd3000 rw-p 001c2000  08:01  8106    8    8   8          8         8    0      0 libc-2.19.so
    7ffff7dd5000 rw-p 00000000  00:00     0   20   12  12         12        12    0      0
    7ffff7dda000 r-xp 00000000  08:01  8176  140  140   0        140         0    0      0 ld-2.19.so
    7ffff7fdd000 rw-p 00000000  00:00     0   12   12  12         12        12    0      0
    7ffff7ff4000 rw-p 00000000  00:00     0   16   12  12         12        12    0      0
    7ffff7ff8000 r--p 00000000  00:00     0    8    0   0          0         0    0      0 [vvar]              (4)
    7ffff7ffa000 r-xp 00000000  00:00     0    8    4   0          4         0    0      0 [vdso]              (3)
    7ffff7ffc000 r--p 00022000  08:01  8176    4    4   4          4         4    0      0 ld-2.19.so
    7ffff7ffd000 rw-p 00023000  08:01  8176    4    4   4          4         4    0      0 ld-2.19.so
    7ffff7ffe000 rw-p 00000000  00:00     0    4    4   4          4         4    0      0
    7ffffffde000 rw-p 00000000  00:00     0  136    8   8          8         8    0      0 [stack]
    7ffffffff000 ---p 00000000  00:00     0    1    0   0          0         0    0      0 [kernel-guard-page] (1)
ffffffffff600000 r-xp 00000000  00:00     0    4    0   0          0         0    0      0 [vsyscall]          (2)
                                            ==== ==== === ========== ========= ==== ======
                                            4224 1188  97       1188        88    0      0 KB

If called only with a PID argument, the output of pmap will be more compact and only contain the Address, Perm, Size and Mapping columns. If called with -X or -XX, it will display more of / all off the information exposed by the kernel through /proc/PID/smaps. This content may vary across kernel versions. In our concrete example (with -X) the Address column contains the start address of a mapping. Perm displays the permission bits (r = read, w = write, x = execute, p/s = private/shared mapping). Offset contains the offset into the mapped file and will be zero for all non-file (i.e. anonymous mappings). The same holds true for the Device and Inode columns which are relevant for file mappings only. The Size column contains the size of the memory mapping in kilobytes (notice however that the kernel can only map memory in chunks which are a multiple of the "kernel page size" - normally 4 KB on x86). The RSS ("Resident Set Size") column displays the amount of memory which is actually in RAM while the PSS ("Proportional Set Size") column is the amount of memory in RAM divided by the number of processes sharing it.

As you can see, all the lines excluding the last one describe memory regions in user space. Notice that the last page of the user space (i.e. 0x7ffffffff000-0x7ffffffff000 ➊) will not be displayed by pmap nor is it recorded in the corresponding /proc/<pid>/smaps file. As described in the kernel sources, this special guard page (i.e. 4 KB) is an implementation detail only required on x86_64 CPUs (and included here for completeness).

The last line (i.e. 0xffffffffff600000…[vsyscall] ➋) refers to a 4 KB page in the kernel space which has been mapped into the user space. It is relict of an early and now deprecated implementation of so called "virtual system calls". Virtual system calls are used to speed up the implementation of certain system calls like gettimeofday() which only need read access to certain kernel variables. The idea is to implement these system calls such that they don’t switch into kernel mode because the kernel maps the required data read-only into the user space memory. This is exactly what the [vsyscall] mapping is used for: it contains the code and the variables for the mentioned virtual system calls.

Unfortunately the size of the [vsyscall] region is restricted to 4 KB and it’s location is fixed which is considered an unnecessarily security risk in the meantime. The vsyscall implementation has therefore been deprecated in favour of the new vDSO (i.e. "virtual Dynamic Shared Object") implementation. This is a small shared library which is exported by the kernel (see vdso(7)) and mapped into the user address space ➌. With vDSO the kernel variables are mapped into an extra, read-only memory region called [vvar] ➍. Both these regions have no size limitations and are subject to ASLR.

In order to get comparable results, we have to switch of Address Space Layout Randomization (ASLR) by executing sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space". ASLR was added to the Linux kernel back in version 2.6. It is a technique for hardening the OS against various attacks (i.e. the "return to libc attack") which exploit memory corruption vulnerabilities in order to jump to well known code in the users address space. Randomizing the address where shared libraries, stacks, the heap and even the executable itself (if compiled with -fPIE and linked with -pie) are being loaded, makes it much more harder to effectively execute such attacks. But for our analysis, as well as during development (e.g. debugging) it is of course much more convenient, if the addresses of an executable remain constant across different runs.

RSS

Files are read by the Linux kernel with a certain read-ahead which is device dependent and which can be configured with the blockdev(8) command.

Listing 3. Query the kernel and file system read-ahead with blockdev

$ sudo blockdev --getra --getfra /dev/sda1
256
256

According to the man page, the output is in 512-byte sectors so a value of 256 means a read-ahead of 128 KB.

Where it all starts…

Let’s now start our journey with the execution of the standard C-library function execve() which in turn executes the execve system call. execve is usually called right after fork() and replaces (i.e. overlays) the old programs stack, heap and data segments with the ones of the new program (see the man page of execve(2)).

The Memory Layout of a 64-bit Linux Process

The basics

Hello world…​

RSS

Where it all starts…​

Hello world…

Where it all starts…