This is a tiny writeup (mostly for me such that I don’t forget about it :) how a simple, 64-bit Linux process is laid out in memory and how a processes memory consumption can be analyzed. A future article will then focus on the native memory layout and consumption of a Java process.
The basics
The precesses memory space layout is platform specific. On current x86_64 CPU’s the memory will be laid out according to "virtual memory map with 4 level page tables" which is specified in the Linux kernel documentation under Documentation/x86/x86_64/mm.txt
:
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
hole caused by [47:63] sign extension
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory (1)
...
ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
As you can see, this gives us a virtual address space of 48 bits where 47 bits can be used for user space programs, to address at most 128 TB of memory. Because this article focuses on the user space, I’ve omitted most of the predefined kernel space regions. It’s just interesting to see ➊, that the kernel maps the complete physical memory into the kernel address space for convenience. Notice, that current x86_64 CPUs can only address a maximum of 64 TB because they limit the physical addresses to 46 bits (i.e. a program can use a virtual address space of 128 TB but only 64 TB out of the 128 can be physically mapped). Future versions of x86_64 CPUs will be able to use 57 bits for the virtual address space (resulting in a 56 bits or 128 PiB user space) and up to 52 bits for physical addresses (resulting in up to 4 PiB of physical memory). This new hardware generation requires an extended, 5-level page table as described in Intel’s "5-Level Paging and 5-Level EPT" white paper and recently implemented in the Linux 4.12 kernel. The resulting, new virtual memory map is described in the previously mentioned mm.txt
kernel documentation file as well.
Hello world…
As a baseline example we take a slightly modified "Hello world" example as shown in Listing Listing 1. We simply add a call to getchar()
at the end of the program ➊ such that we can easily analyze its memory layout.
#include <stdio.h>
int main(int argc, char **argv) {
printf("Hello world\n");
getchar(); (1)
}
As demonstrated in Listing Listing 2, we will use the pmap
utility to display the memory layout of a process. pmap
is actually just a front-end for the memory map of a process as recorded in the smaps
file of the proc file system
.
$ pmap -X `pidof hello`
14792: ./examples/c/hello
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous Swap Locked Mapping
00400000 r-xp 00000000 00:26 2118 4 4 4 4 0 0 0 hello
00600000 r--p 00000000 00:26 2118 4 4 4 4 4 0 0 hello
00601000 rw-p 00001000 00:26 2118 4 4 4 4 4 0 0 hello
7ffff7a11000 r-xp 00000000 08:01 8106 1784 952 5 952 0 0 0 libc-2.19.so
7ffff7bcf000 ---p 001be000 08:01 8106 2048 0 0 0 0 0 0 libc-2.19.so
7ffff7dcf000 r--p 001be000 08:01 8106 16 16 16 16 16 0 0 libc-2.19.so
7ffff7dd3000 rw-p 001c2000 08:01 8106 8 8 8 8 8 0 0 libc-2.19.so
7ffff7dd5000 rw-p 00000000 00:00 0 20 12 12 12 12 0 0
7ffff7dda000 r-xp 00000000 08:01 8176 140 140 0 140 0 0 0 ld-2.19.so
7ffff7fdd000 rw-p 00000000 00:00 0 12 12 12 12 12 0 0
7ffff7ff4000 rw-p 00000000 00:00 0 16 12 12 12 12 0 0
7ffff7ff8000 r--p 00000000 00:00 0 8 0 0 0 0 0 0 [vvar] (4)
7ffff7ffa000 r-xp 00000000 00:00 0 8 4 0 4 0 0 0 [vdso] (3)
7ffff7ffc000 r--p 00022000 08:01 8176 4 4 4 4 4 0 0 ld-2.19.so
7ffff7ffd000 rw-p 00023000 08:01 8176 4 4 4 4 4 0 0 ld-2.19.so
7ffff7ffe000 rw-p 00000000 00:00 0 4 4 4 4 4 0 0
7ffffffde000 rw-p 00000000 00:00 0 136 8 8 8 8 0 0 [stack]
7ffffffff000 ---p 00000000 00:00 0 1 0 0 0 0 0 0 [kernel-guard-page] (1)
ffffffffff600000 r-xp 00000000 00:00 0 4 0 0 0 0 0 0 [vsyscall] (2)
==== ==== === ========== ========= ==== ======
4224 1188 97 1188 88 0 0 KB
If called only with a PID argument, the output of pmap
will be more compact and only contain the Address, Perm, Size and Mapping columns. If called with -X
or -XX
, it will display more of / all off the information exposed by the kernel through /proc/PID/smaps
. This content may vary across kernel versions. In our concrete example (with -X
) the Address column contains the start address of a mapping. Perm displays the permission bits (r
= read, w
= write, x
= execute, p
/s
= private/shared mapping). Offset contains the offset into the mapped file and will be zero for all non-file (i.e. anonymous mappings). The same holds true for the Device and Inode columns which are relevant for file mappings only. The Size column contains the size of the memory mapping in kilobytes (notice however that the kernel can only map memory in chunks which are a multiple of the "kernel page size" - normally 4 KB on x86). The RSS ("Resident Set Size") column displays the amount of memory which is actually in RAM while the PSS ("Proportional Set Size") column is the amount of memory in RAM divided by the number of processes sharing it.
As you can see, all the lines excluding the last one describe memory regions in user space. Notice that the last page of the user space (i.e. 0x7ffffffff000-0x7ffffffff000
➊) will not be displayed by pmap
nor is it recorded in the corresponding /proc/<pid>/smaps
file. As described in the kernel sources, this special guard page (i.e. 4 KB) is an implementation detail only required on x86_64 CPUs (and included here for completeness).
The last line (i.e. 0xffffffffff600000…[vsyscall]
➋) refers to a 4 KB page in the kernel space which has been mapped into the user space. It is relict of an early and now deprecated implementation of so called "virtual system calls". Virtual system calls are used to speed up the implementation of certain system calls like gettimeofday()
which only need read access to certain kernel variables. The idea is to implement these system calls such that they don’t switch into kernel mode because the kernel maps the required data read-only into the user space memory. This is exactly what the [vsyscall]
mapping is used for: it contains the code and the variables for the mentioned virtual system calls.
Unfortunately the size of the [vsyscall]
region is restricted to 4 KB and it’s location is fixed which is considered an unnecessarily security risk in the meantime. The vsyscall
implementation has therefore been deprecated in favour of the new vDSO
(i.e. "virtual Dynamic Shared Object") implementation. This is a small shared library which is exported by the kernel (see vdso(7)
) and mapped into the user address space ➌. With vDSO
the kernel variables are mapped into an extra, read-only memory region called [vvar]
➍. Both these regions have no size limitations and are subject to ASLR.
In order to get comparable results, we have to switch of Address Space Layout Randomization (ASLR) by executing sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" . ASLR was added to the Linux kernel back in version 2.6. It is a technique for hardening the OS against various attacks (i.e. the "return to libc attack") which exploit memory corruption vulnerabilities in order to jump to well known code in the users address space. Randomizing the address where shared libraries, stacks, the heap and even the executable itself (if compiled with -fPIE and linked with -pie ) are being loaded, makes it much more harder to effectively execute such attacks. But for our analysis, as well as during development (e.g. debugging) it is of course much more convenient, if the addresses of an executable remain constant across different runs.
|
RSS
Files are read by the Linux kernel with a certain read-ahead which is device dependent and which can be configured with the blockdev(8)
command.
blockdev
$ sudo blockdev --getra --getfra /dev/sda1
256
256
According to the man page, the output is in 512-byte sectors so a value of 256 means a read-ahead of 128 KB.
Where it all starts…
Let’s now start our journey with the execution of the standard C-library function execve()
which in turn executes the execve
system call. execve
is usually called right after fork()
and replaces (i.e. overlays) the old programs stack, heap and data segments with the ones of the new program (see the man page of execve(2)
).