Recently, I was solving one of the Codewars.com problems in NASM (x86_64/amd64 Assembly) where I had to use some of the C functions available in the standard C library. During the code refactoring and optimization, I came across a segmentation fault while trying to run the executable. The program broke at this instruction:
1 |
movaps XMMWORD PTR [rsp+0x50],xmm0 |
The code looked correct, so finding the bug took some time. Finally, I was able to find the root of this behavior. The problem was in the stack alignment.
Minimal working example
To illustrate this issue, we can reproduce a similar minimal working example. Consider a program, which asks for two numbers (long integers) and prints their sum. The prototype in C looks like this:
1 2 3 4 5 6 7 8 9 10 11 |
#include <stdio.h> int int1, int2; int main(){ printf("Enter the first number: "); scanf("%i", &int1); printf("Enter the second number :"); scanf("%i", &int2); printf("The sum is equal to: %i\n", int1 + int2); return 0; } |
Let's compile it and test it to see its functionality:
1 2 3 4 5 |
[johndoe@ArchLinux]% gcc add_two_numbers.c -o add_two_numbers [johndoe@ArchLinux]% ./add_two_numbers Enter the first number: 125 Enter the second number :-43 The sum is equal to: 82 |
Now let's rewrite it in NASM (x86_64):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
; Declare external functions found in glibc extern printf extern scanf SECTION .data msg_in1: db "Enter the first number: ", 0 msg_in2: db "Enter the second number: ", 0 fmt_in: db "%i", 0 msg_out: db `The sum is equal to: %i\n`, 0 int1: dd 0 ; 32-bit integer = 4 bytes int2: dd 0 ; 32-bit integer = 4 bytes SECTION .text global main main: push rbp mov rbp, rsp mov rdi, msg_in1 ; first argument to printf call printf mov rdi, fmt_in ; first argument to scanf mov rsi, int1 ; second argument to scanf (i.e. &int1) call scanf ; get 1st integer mov rdi, msg_in2 ; first argument to printf call printf mov rdi, fmt_in ; first argument to scanf mov rsi, int2 ; second argument to scanf (i.e. &int2) call scanf ; get 2nd integer mov eax, DWORD [int1] ; move 1st integer to register rax add eax, DWORD [int2] ; add two integers mov rdi, msg_out ; first argument to printf mov rsi, rax ; second argument to printf (the sum value) call printf ; print the sum xor rax, rax ; return value is zero mov rsp, rbp pop rbp ret |
Integers int1 and int2 are global variables and are stored in the .data section of the ELF object file. Entry point here is not "_start" but "main" as in the C function. Let's compile the code using NASM and link it with glibc using gcc:
1 2 3 4 5 6 |
[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o [johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie [johndoe@ArchLinux]% ./add_two_numbers Enter the first number: 123 Enter the second number: -64 The sum is equal to: 59 |
Now try to add two instructions: one before the first printf and the other - after it:
1 2 3 4 5 6 7 |
... mov rdi, msg_in1 ; first argument to printf push rbx ; => this breaks stack alignment call printf pop rbx ; => this breaks stack alignment mov rdi, fmt_in ; first argument to scanf ... |
Saving the value of RBX on the stack before executing printf and restoring it by popping it out from the stack is a perfectly logical thing to do, since printf may modify the content of RBX and we don't want to loose it.
Recompile and run the program:
1 2 3 4 |
[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o [johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie [johndoe@ArchLinux]% ./add_two_numbers [1] 148308 segmentation fault (core dumped) ./add_two_numbers |
The program execution stops and throws a segmentation fault. More detailed examination reveals the following memory state of the program just before the segmentation fault:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
gdb-peda$ [----------------------------------registers-----------------------------------] RAX: 0x401140 (<main>: push rbp) RBX: 0x4011c0 (<__libc_csu_init>: endbr64) RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh") RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers") RDI: 0x404038 ("Enter the first number: ") RBP: 0x7fffffffdd50 --> 0x0 RSP: 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp) RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0) R8 : 0x0 R9 : 0x7ffff7fdc070 (<_dl_fini>: endbr64) R10: 0x404038 ("Enter the first number: ") R11: 0x7ffff7e0a590 (<printf>: endbr64) R12: 0x401050 (<_start>: endbr64) R13: 0x0 R14: 0x0 R15: 0x0 EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow) [-------------------------------------code-------------------------------------] 0x7ffff7e0a5b2 <printf+34>: mov QWORD PTR [rsp+0x48],r9 0x7ffff7e0a5b7 <printf+39>: test al,al 0x7ffff7e0a5b9 <printf+41>: je 0x7ffff7e0a5f2 <printf+98> => 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0 0x7ffff7e0a5c0 <printf+48>: movaps XMMWORD PTR [rsp+0x60],xmm1 0x7ffff7e0a5c5 <printf+53>: movaps XMMWORD PTR [rsp+0x70],xmm2 0x7ffff7e0a5ca <printf+58>: movaps XMMWORD PTR [rsp+0x80],xmm3 0x7ffff7e0a5d2 <printf+66>: movaps XMMWORD PTR [rsp+0x90],xmm4 [------------------------------------stack-------------------------------------] 0000| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp) 0008| 0x7fffffffdc70 --> 0x90000 ('') 0016| 0x7fffffffdc78 --> 0x800 0024| 0x7fffffffdc80 --> 0x8 0032| 0x7fffffffdc88 --> 0x40 ('@') 0040| 0x7fffffffdc90 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers") 0048| 0x7fffffffdc98 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh") 0056| 0x7fffffffdca0 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 [------------------------------------------------------------------------------] Legend: code, data, rodata, value Stopped reason: SIGSEGV 0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6 |
The value of RSP is 0x7fffffffdc68 which is not a multiple of 16 because the value of RSP was decreased by 8 bytes after pushing RBX to the stack before calling printf function in the modified version of the code.
This becomes crucial when instructions such as "movaps" (move packed single-precision floating-point values from xmm2/m128 to xmm1) require 16 byte alignment of the memory operand. Misaligned stack together with "movaps" instruction in the printf function cause segmentation fault.
For this reason, the System V Application Binary Interface AMD64 provides the stack alignment requirement in subsection "The Stack Frame":
"The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point."
The value of RSP before calling any function has to be in the format of 0x???????????????0 where ? represents any 4-bit hexadecimal number. Zero value of the least significant nibble provides 16 byte alignment of the stack. The layout of the stack should look like this:
Position | Contents | Frame |
---|---|---|
lower addresses | ||
rsp - 128 | red zone | Current |
rsp | local variable at higher boundary | |
... | ||
rbp - 8 | local variable at lower boundary | |
rbp | previous rbp value | |
rbp + 8 | return address | |
rbp + 16 | memory argument eightbyte 0 | Previous |
... | ||
rbp + 16 - 8*n | memory argument eightbyte n | |
higher addresses |
Furthermore, one may think about solving this issue manually with preserving the functionality of the program. If pushing the value of a 64-bit register to the stack (e.g. "PUSH RBX") decreases the stack by 8 bytes, we can realign the stack by pushing another 64-bit register (e.g. RCX) just after the "PUSH RBX":
1 2 3 4 5 6 7 8 9 |
... mov rdi, msg_in1 ; first argument to printf push rbx ; => this breaks stack alignment push rcx ; => but this realigns the stack back call printf pop rcx ; => order is important! rcx should be popped before rbx pop rbx mov rdi, fmt_in ; first argument to scanf ... |
and popping its stored value from the stack just before restoring the value of RBX.
Recompiling and running the program yields:
1 2 3 4 5 6 |
[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o [johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie [johndoe@ArchLinux]% ./add_two_numbers Enter the first number: 4321 Enter the second number: -42 The sum is equal to: 4279 |
There is no segmentation fault observed, since the stack is aligned:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
gdb-peda$ [----------------------------------registers-----------------------------------] RAX: 0x401140 (<main>: push rbp) RBX: 0x4011c0 (<__libc_csu_init>: endbr64) RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh") RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers") RDI: 0x404038 ("Enter the first number: ") RBP: 0x7fffffffdd50 --> 0x0 RSP: 0x7fffffffdc60 --> 0x8000 RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0) R8 : 0x0 R9 : 0x7ffff7fdc070 (<_dl_fini>: endbr64) R10: 0x404038 ("Enter the first number: ") R11: 0x7ffff7e0a590 (<printf>: endbr64) R12: 0x401050 (<_start>: endbr64) R13: 0x0 R14: 0x0 R15: 0x0 EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow) [-------------------------------------code-------------------------------------] 0x7ffff7e0a5b2 <printf+34>: mov QWORD PTR [rsp+0x48],r9 0x7ffff7e0a5b7 <printf+39>: test al,al 0x7ffff7e0a5b9 <printf+41>: je 0x7ffff7e0a5f2 <printf+98> => 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0 0x7ffff7e0a5c0 <printf+48>: movaps XMMWORD PTR [rsp+0x60],xmm1 0x7ffff7e0a5c5 <printf+53>: movaps XMMWORD PTR [rsp+0x70],xmm2 0x7ffff7e0a5ca <printf+58>: movaps XMMWORD PTR [rsp+0x80],xmm3 0x7ffff7e0a5d2 <printf+66>: movaps XMMWORD PTR [rsp+0x90],xmm4 [------------------------------------stack-------------------------------------] 0000| 0x7fffffffdc60 --> 0x8000 0008| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp) 0016| 0x7fffffffdc70 --> 0x90000 ('') 0024| 0x7fffffffdc78 --> 0x800 0032| 0x7fffffffdc80 --> 0x8 0040| 0x7fffffffdc88 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers") 0048| 0x7fffffffdc90 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh") 0056| 0x7fffffffdc98 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 [------------------------------------------------------------------------------] Legend: code, data, rodata, value 0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6 |
The value of RSP is 0x7fffffffdc60 which is a multiple of 16 because the value of RSP was decreased by 16 bytes after pushing RBX and RCX to the stack before calling printf function in the modified version of the code.
Local variables in the stack frame
Now let's try to replace global variables with local variables, i.e. implement this C function in NASM:
1 2 3 4 5 6 7 8 9 10 11 |
#include <stdio.h> int main(){ int int1, int2; printf("Enter the first number: "); scanf("%i", &int1); printf("Enter the second number :"); scanf("%i", &int2); printf("The sum is equal to: %i\n", int1 + int2); return 0; } |
We need to allocate at least 8 bytes (2 integers of 4 bytes) on the stack. However, in order to comply with the stack alignment requirement, the allocated local variable space has to be a multiple of 16 bytes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
; Declare external functions found in glibc extern printf extern scanf SECTION .data msg_in1: db "Enter the first number: ", 0 msg_in2: db "Enter the second number: ", 0 fmt_in: db "%i", 0 msg_out: db `The sum is equal to: %i\n`, 0 int1: dd 0 ; 32-bit integer = 4 bytes int2: dd 0 ; 32-bit integer = 4 bytes SECTION .text global main main: push rbp mov rbp, rsp sub rsp, 16 ; this cannot be equal to 8 although 8 bytes are enough to store two integers mov rdi, msg_in1 ; first argument to printf call printf mov rdi, fmt_in ; first argument to scanf lea rsi, [rbp - 8] ; second argument to scanf call scanf ; get 1st integer mov rdi, msg_in2 ; first argument to printf call printf mov rdi, fmt_in ; first argument to scanf lea rsi, [rbp - 4] ; second argument to scanf call scanf ; get 2nd integer mov eax, DWORD [rbp - 8] ; move 1st integer to register rax add eax, DWORD [rbp - 4] ; add 2nd integer to the 1st one mov rdi, msg_out ; first argument to printf mov rsi, rax ; second argument to printf (the sum) call printf ; print the sum xor rax, rax ; return value is zero mov rsp, rbp pop rbp ret |
Subtracting 16 from the stack pointer (RSP) will allocate 16 bytes space for local variables.
If we use 8 bytes instead:
1 2 3 4 5 6 |
... main: push rbp mov rbp, rsp sub rsp, 8 ; this results in segmentation fault ... |
then the compiled code will throw a segmentation fault.
Aligning stack inside a callee function
In the assembly listings generated by some compilers, one can see the following construct in the stack frame initialization:
1 2 3 4 5 |
... push rbp mov rbp, rsp and rsp, 0xFFFFFFFFFFFFFFF0 ; i.e. all bits in the mask are equal to logical 1, except last four bits ... |
or
1 2 3 4 5 |
... push rbp mov rbp, rsp and rsp, -16 ; decimal "-16" corresponds to 0xFFFFFFFFFFFFFFF0 in two's-complement number system ... |
If the least significant nibble of stack pointer is not equal to zero, its value (say N) is subtracted from itself yielding zero which corresponds to aligning the stack to 16 bytes boundary. In other words, it can be considered as allocating N bytes of memory for local variables although this space will not be used for this purpose.
In this manner, the compiler tries to align the stack which was misaligned in the caller (parent) function.
Stack alignment for memory-based arguments of a function
Another way to misalign the stack is to pass more than 6 arguments to a function so that the total number arguments is an odd number. These arguments should be of class INTEGER, which includes _Bool, char, short, int, long, long long and pointer data types. We are not considering SSE, SSEUP, X87, X87UP, COMPLEX_X87 and NO_CLASS arguments (e.g. floating point numbers) because they are passed through other special registers (e.g. XMM).
According to the AMD64 Linux ABI, first six INTEGER class input arguments are passed through registers:
- rdi, rsi, rdx, rcx, r8, r9
The 7th input argument and beyond are passed through stack as in the 32-bit x86 Linux ABI. Look at the "Stack frame with base pointer" table above, the cell "memory argument eightbyte 0" corresponds to the 7th input argument. Since it takes 8 bytes of memory on the stack, we have a stack alignment problem. To illustrate it, consider the C code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#include <stdio.h> void calc_sum(int a, int b, int c, int d, int e, int f, int g){ printf("The sum is equal to: %i\n", a + b + c + d + e + f + g); } int main(){ int a; printf("Enter a number: "); scanf("%i", &a); calc_sum(a, 11, 12, 13, 14, 15, 16); return 0; } |
The variable "a" in the main function contains the value of an integer entered by a user from a command line. Without this "randomness", the C compiler will optimize the code with respect to the constant values of the 7 summands and pass the calculated deterministic value to the printf function inside calc_sum().
Compile the code to generate the assembly listing:
1 |
[johndoe@ArchLinux]% gcc add_seven_numbers.c -S -masm=intel -fno-asynchronous-unwind-tables -fno-stack-protector -O0 |
The generated assembly code looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
.file "add_seven_numbers.c" .intel_syntax noprefix .text .section .rodata .LC0: .string "The sum is equal to: %i\n" .text .globl calc_sum .type calc_sum, @function calc_sum: push rbp mov rbp, rsp sub rsp, 32 ; although 6 local variables (integers) are used (i.e. 24 bytes) ; 32 bytes are allocated because of the stack alignment requirement mov DWORD PTR -4[rbp], edi ; 1st argument mov DWORD PTR -8[rbp], esi ; 2nd argument mov DWORD PTR -12[rbp], edx ; 3rd argument mov DWORD PTR -16[rbp], ecx ; 4th argument mov DWORD PTR -20[rbp], r8d ; 5th argument mov DWORD PTR -24[rbp], r9d ; 6th argument mov edx, DWORD PTR -4[rbp] mov eax, DWORD PTR -8[rbp] add edx, eax ; start adding integers mov eax, DWORD PTR -12[rbp] add edx, eax mov eax, DWORD PTR -16[rbp] add edx, eax mov eax, DWORD PTR -20[rbp] add edx, eax mov eax, DWORD PTR -24[rbp] add edx, eax mov eax, DWORD PTR 16[rbp] ; 7th argument add eax, edx ; the sum is stored in eax mov esi, eax ; print integer stored in eax (the sum) lea rdi, .LC0[rip] mov eax, 0 call printf@PLT nop leave ret .size calc_sum, .-calc_sum .section .rodata .LC1: .string "Enter a number: " .LC2: .string "%i" .text .globl main .type main, @function main: push rbp mov rbp, rsp sub rsp, 16 lea rdi, .LC1[rip] mov eax, 0 call printf@PLT lea rax, -4[rbp] ; &a, address of variable "a" mov rsi, rax lea rdi, .LC2[rip] mov eax, 0 call __isoc99_scanf@PLT mov eax, DWORD PTR -4[rbp] ; variable "a" is moved to eax, see 1st argument below sub rsp, 8 ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used push 16 ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes) mov r9d, 15 ; 6th argument to calc_sum() mov r8d, 14 ; 5th argument to calc_sum() mov ecx, 13 ; 4th argument to calc_sum() mov edx, 12 ; 3rd argument to calc_sum() mov esi, 11 ; 2nd argument to calc_sum() mov edi, eax ; 1st argument to calc_sum() call calc_sum add rsp, 16 mov eax, 0 leave ret .size main, .-main .ident "GCC: (GNU) 10.2.0" .section .note.GNU-stack,"",@progbits |
I added comments to show the arguments which are passed to the calc_sum() function. Since no optimization was used, the generated assembly code is straightforward. Input arguments a, b, c, d, e, f are passed through registers rdi, rsi, rdx, rcx, r8, r9. The 7th argument g is passed through the stack:
1 2 3 4 |
... sub rsp, 8 ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used push 16 ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes) ... |
Apparently, the C compiler aligns the stack to 16-byte boundary by subtracting 8 from the stack pointer before pushing the 7th argument.
The compiled executable works as expected:
1 2 3 4 |
[johndoe@ArchLinux]% gcc add_seven_numbers.c -fno-asynchronous-unwind-tables -fno-stack-protector -O0 -o add_seven_numbers [johndoe@ArchLinux]% ./add_seven_numbers Enter a number: 10 The sum is equal to: 91 |
where 91 is the sum of all integers in the range 10 to 16 inclusive.
Conclusion
Read the Linux ABI which is only 128 pages long. Investing time into education will save time in the future by spending less time on debugging.
When you use other C functions or libraries in your assembly code, obviously you have to comply not just with the calling convention, but also with the other specifications of the ABI.
Sooner or later your assembly code, which uses external C functions, will start to resemble the code generated by a C compiler. If you have many local variables, the number of available registers will not be enough to store them and you will have to use the stack for this purpose. These local variables will be accessible at a fixed offset from the base pointer (RBP).
The default stack alignment in recent gcc versions (e.g. v10.2.0) is 16 bytes because the default value for -mpreferred-stack-boundary parameter is 4 which corresponds to 2^4=16 bytes.
References
Michael Matz, Jan Hubicka, Andreas Jaeger, Mark Mitchell, "System V Application Binary Interface AMD64 Architecture Processor Supplement".
Chris Wellons, "Raw Linux Threads via System Calls".
Eli Bendersky, Stack frame layout on x86-64
Multiple 16-byte stack alignment violations when calling C code from assembly #231