Stack alignment when mixing assembly and C code

Recently, I was solving one of the Codewars.com problems in NASM (x86_64/amd64 Assembly) where I had to use some of the C functions available in the standard C library. During the code refactoring and optimization, I came across a segmentation fault while trying to run the executable. The program broke at this instruction:

movaps XMMWORD PTR [rsp+0x50],xmm0

1	movaps XMMWORD PTR [rsp+0x50],xmm0

The code looked correct, so finding the bug took some time. Finally, I was able to find the root of this behavior. The problem was in the stack alignment.

Minimal working example

To illustrate this issue, we can reproduce a similar minimal working example. Consider a program, which asks for two numbers (long integers) and prints their sum. The prototype in C looks like this:

#include <stdio.h>
int int1, int2;

int main(){
        printf("Enter the first number: ");
        scanf("%i", &int1);
        printf("Enter the second number :");
        scanf("%i", &int2);
        printf("The sum is equal to: %i\n", int1 + int2);
        return 0;
}

#include <stdio.h>

int int1, int2;

int main(){

printf("Enter the first number: ");

scanf("%i", &int1);

printf("Enter the second number :");

scanf("%i", &int2);

printf("The sum is equal to: %i\n", int1 + int2);

return 0;

}

Let's compile it and test it to see its functionality:

[johndoe@ArchLinux]% gcc add_two_numbers.c -o add_two_numbers
[johndoe@ArchLinux]% ./add_two_numbers
Enter the first number: 125
Enter the second number :-43
The sum is equal to: 82

[johndoe@ArchLinux]% gcc add_two_numbers.c -o add_two_numbers

[johndoe@ArchLinux]% ./add_two_numbers

Enter the first number: 125

Enter the second number :-43

The sum is equal to: 82

Now let's rewrite it in NASM (x86_64):

; Declare external functions found in glibc
extern printf
extern scanf

SECTION .data
msg_in1: db "Enter the first number: ", 0
msg_in2: db "Enter the second number: ", 0
fmt_in: db "%i", 0
msg_out: db `The sum is equal to: %i\n`, 0
int1: dd 0 ;  32-bit integer = 4 bytes
int2: dd 0 ;  32-bit integer = 4 bytes

SECTION .text
global main
main:
  push rbp
  mov rbp, rsp
  mov rdi, msg_in1 ; first argument to printf
  call printf
  mov rdi, fmt_in  ; first argument to scanf
  mov rsi, int1    ; second argument to scanf (i.e. &int1)
  call scanf       ; get 1st integer

  mov rdi, msg_in2 ; first argument to printf
  call printf
  
  mov rdi, fmt_in  ; first argument to scanf
  mov rsi, int2    ; second argument to scanf (i.e. &int2)
  call scanf       ; get 2nd integer
  
  mov eax, DWORD [int1] ; move 1st integer to register rax
  add eax, DWORD [int2] ; add two integers
  
  mov rdi, msg_out ; first argument to printf
  mov rsi, rax     ; second argument to printf (the sum value)
  call printf      ; print the sum
  xor rax, rax     ; return value is zero
  mov rsp, rbp
  pop rbp
  ret

; Declare external functions found in glibc

extern printf

extern scanf

SECTION .data

msg_in1: db "Enter the first number: ", 0

msg_in2: db "Enter the second number: ", 0

fmt_in: db "%i", 0

msg_out: db `The sum is equal to: %i\n`, 0

int1: dd 0 ; 32-bit integer = 4 bytes

int2: dd 0 ; 32-bit integer = 4 bytes

SECTION .text

global main

main:

push rbp

mov rbp, rsp

mov rdi, msg_in1 ; first argument to printf

call printf

mov rdi, fmt_in ; first argument to scanf

mov rsi, int1 ; second argument to scanf (i.e. &int1)

call scanf ; get 1st integer

mov rdi, msg_in2 ; first argument to printf

call printf

mov rdi, fmt_in ; first argument to scanf

mov rsi, int2 ; second argument to scanf (i.e. &int2)

call scanf ; get 2nd integer

mov eax, DWORD [int1] ; move 1st integer to register rax

add eax, DWORD [int2] ; add two integers

mov rdi, msg_out ; first argument to printf

mov rsi, rax ; second argument to printf (the sum value)

call printf ; print the sum

xor rax, rax ; return value is zero

mov rsp, rbp

pop rbp

ret

Integers int1 and int2 are global variables and are stored in the .data section of the ELF object file. Entry point here is not "_start" but "main" as in the C function. Let's compile the code using NASM and link it with glibc using gcc:

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o
[johndoe@ArchLinux]% gcc  add_two_numbers.o -o add_two_numbers -no-pie
[johndoe@ArchLinux]% ./add_two_numbers
Enter the first number: 123
Enter the second number: -64
The sum is equal to: 59

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o

[johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie

[johndoe@ArchLinux]% ./add_two_numbers

Enter the first number: 123

Enter the second number: -64

The sum is equal to: 59

Now try to add two instructions: one before the first printf and the other - after it:

...
  mov rdi, msg_in1 ; first argument to printf
  push rbx         ; => this breaks stack alignment
  call printf
  pop rbx          ; => this breaks stack alignment
  mov rdi, fmt_in  ; first argument to scanf
  ...

...

mov rdi, msg_in1 ; first argument to printf

push rbx ; => this breaks stack alignment

call printf

pop rbx ; => this breaks stack alignment

mov rdi, fmt_in ; first argument to scanf

...

Saving the value of RBX on the stack before executing printf and restoring it by popping it out from the stack is a perfectly logical thing to do, since printf may modify the content of RBX and we don't want to loose it.

Recompile and run the program:

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o
[johndoe@ArchLinux]% gcc  add_two_numbers.o -o add_two_numbers -no-pie
[johndoe@ArchLinux]% ./add_two_numbers
[1]    148308 segmentation fault (core dumped)  ./add_two_numbers

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o

[johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie

[johndoe@ArchLinux]% ./add_two_numbers

[1] 148308 segmentation fault (core dumped) ./add_two_numbers

The program execution stops and throws a segmentation fault. More detailed examination reveals the following memory state of the program just before the segmentation fault:

gdb-peda$
[----------------------------------registers-----------------------------------]
RAX: 0x401140 (<main>:    push   rbp)
RBX: 0x4011c0 (<__libc_csu_init>: endbr64)
RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 
RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")
RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")
RDI: 0x404038 ("Enter the first number: ")
RBP: 0x7fffffffdd50 --> 0x0 
RSP: 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>:  mov    rsi,rbp)
RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0)
R8 : 0x0 
R9 : 0x7ffff7fdc070 (<_dl_fini>:  endbr64)
R10: 0x404038 ("Enter the first number: ")
R11: 0x7ffff7e0a590 (<printf>:    endbr64)
R12: 0x401050 (<_start>:  endbr64)
R13: 0x0 
R14: 0x0 
R15: 0x0
EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x7ffff7e0a5b2 <printf+34>:    mov    QWORD PTR [rsp+0x48],r9
   0x7ffff7e0a5b7 <printf+39>:    test   al,al
   0x7ffff7e0a5b9 <printf+41>:    je     0x7ffff7e0a5f2 <printf+98>
=> 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0
   0x7ffff7e0a5c0 <printf+48>:    movaps XMMWORD PTR [rsp+0x60],xmm1
   0x7ffff7e0a5c5 <printf+53>:    movaps XMMWORD PTR [rsp+0x70],xmm2
   0x7ffff7e0a5ca <printf+58>:    movaps XMMWORD PTR [rsp+0x80],xmm3
   0x7ffff7e0a5d2 <printf+66>:    movaps XMMWORD PTR [rsp+0x90],xmm4
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov    rsi,rbp)
0008| 0x7fffffffdc70 --> 0x90000 ('')
0016| 0x7fffffffdc78 --> 0x800 
0024| 0x7fffffffdc80 --> 0x8 
0032| 0x7fffffffdc88 --> 0x40 ('@')
0040| 0x7fffffffdc90 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")
0048| 0x7fffffffdc98 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")
0056| 0x7fffffffdca0 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6

gdb-peda$

[----------------------------------registers-----------------------------------]

RAX: 0x401140 (<main>: push rbp)

RBX: 0x4011c0 (<__libc_csu_init>: endbr64)

RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0

RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")

RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")

RDI: 0x404038 ("Enter the first number: ")

RBP: 0x7fffffffdd50 --> 0x0

RSP: 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp)

RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0)

R8 : 0x0

R9 : 0x7ffff7fdc070 (<_dl_fini>: endbr64)

R10: 0x404038 ("Enter the first number: ")

R11: 0x7ffff7e0a590 (<printf>: endbr64)

R12: 0x401050 (<_start>: endbr64)

R13: 0x0

R14: 0x0

R15: 0x0

EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow)

[-------------------------------------code-------------------------------------]

0x7ffff7e0a5b2 <printf+34>: mov QWORD PTR [rsp+0x48],r9

0x7ffff7e0a5b7 <printf+39>: test al,al

0x7ffff7e0a5b9 <printf+41>: je 0x7ffff7e0a5f2 <printf+98>

=> 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0

0x7ffff7e0a5c0 <printf+48>: movaps XMMWORD PTR [rsp+0x60],xmm1

0x7ffff7e0a5c5 <printf+53>: movaps XMMWORD PTR [rsp+0x70],xmm2

0x7ffff7e0a5ca <printf+58>: movaps XMMWORD PTR [rsp+0x80],xmm3

0x7ffff7e0a5d2 <printf+66>: movaps XMMWORD PTR [rsp+0x90],xmm4

[------------------------------------stack-------------------------------------]

0000| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp)

0008| 0x7fffffffdc70 --> 0x90000 ('')

0016| 0x7fffffffdc78 --> 0x800

0024| 0x7fffffffdc80 --> 0x8

0032| 0x7fffffffdc88 --> 0x40 ('@')

0040| 0x7fffffffdc90 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")

0048| 0x7fffffffdc98 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")

0056| 0x7fffffffdca0 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0

[------------------------------------------------------------------------------]

Legend: code, data, rodata, value

Stopped reason: SIGSEGV

0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6

The value of RSP is 0x7fffffffdc68 which is not a multiple of 16 because the value of RSP was decreased by 8 bytes after pushing RBX to the stack before calling printf function in the modified version of the code.
This becomes crucial when instructions such as "movaps" (move packed single-precision floating-point values from xmm2/m128 to xmm1) require 16 byte alignment of the memory operand. Misaligned stack together with "movaps" instruction in the printf function cause segmentation fault.

For this reason, the System V Application Binary Interface AMD64 provides the stack alignment requirement in subsection "The Stack Frame":

"The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point."

The value of RSP before calling any function has to be in the format of 0x???????????????0 where ? represents any 4-bit hexadecimal number. Zero value of the least significant nibble provides 16 byte alignment of the stack. The layout of the stack should look like this:

Stack frame with base pointer
Position	Contents	Frame
lower addresses
rsp - 128	red zone	Current
rsp	local variable at higher boundary
	...
rbp - 8	local variable at lower boundary
rbp	previous rbp value
rbp + 8	return address
rbp + 16	memory argument eightbyte 0	Previous
	...
rbp + 16 - 8*n	memory argument eightbyte n
higher addresses

Furthermore, one may think about solving this issue manually with preserving the functionality of the program. If pushing the value of a 64-bit register to the stack (e.g. "PUSH RBX") decreases the stack by 8 bytes, we can realign the stack by pushing another 64-bit register (e.g. RCX) just after the "PUSH RBX":

...
  mov rdi, msg_in1 ; first argument to printf
  push rbx         ; => this breaks stack alignment
  push rcx         ; => but this realigns the stack back
  call printf
  pop rcx          ; => order is important! rcx should be popped before rbx
  pop rbx
  mov rdi, fmt_in  ; first argument to scanf
  ...

...

mov rdi, msg_in1 ; first argument to printf

push rbx ; => this breaks stack alignment

push rcx ; => but this realigns the stack back

call printf

pop rcx ; => order is important! rcx should be popped before rbx

pop rbx

mov rdi, fmt_in ; first argument to scanf

...

and popping its stored value from the stack just before restoring the value of RBX.

Recompiling and running the program yields:

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o
[johndoe@ArchLinux]% gcc  add_two_numbers.o -o add_two_numbers -no-pie
[johndoe@ArchLinux]% ./add_two_numbers
Enter the first number: 4321
Enter the second number: -42
The sum is equal to: 4279

[johndoe@ArchLinux]% nasm add_two_numbers.asm -f elf64 -o add_two_numbers.o

[johndoe@ArchLinux]% gcc add_two_numbers.o -o add_two_numbers -no-pie

[johndoe@ArchLinux]% ./add_two_numbers

Enter the first number: 4321

Enter the second number: -42

The sum is equal to: 4279

There is no segmentation fault observed, since the stack is aligned:

gdb-peda$
[----------------------------------registers-----------------------------------]
RAX: 0x401140 (<main>:    push   rbp)
RBX: 0x4011c0 (<__libc_csu_init>: endbr64)
RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 
RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")
RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")
RDI: 0x404038 ("Enter the first number: ")
RBP: 0x7fffffffdd50 --> 0x0 
RSP: 0x7fffffffdc60 --> 0x8000 
RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0)
R8 : 0x0 
R9 : 0x7ffff7fdc070 (<_dl_fini>:  endbr64)
R10: 0x404038 ("Enter the first number: ")
R11: 0x7ffff7e0a590 (<printf>:    endbr64)
R12: 0x401050 (<_start>:  endbr64)
R13: 0x0 
R14: 0x0 
R15: 0x0
EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x7ffff7e0a5b2 <printf+34>:    mov    QWORD PTR [rsp+0x48],r9
   0x7ffff7e0a5b7 <printf+39>:    test   al,al
   0x7ffff7e0a5b9 <printf+41>:    je     0x7ffff7e0a5f2 <printf+98>
=> 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0
   0x7ffff7e0a5c0 <printf+48>:    movaps XMMWORD PTR [rsp+0x60],xmm1
   0x7ffff7e0a5c5 <printf+53>:    movaps XMMWORD PTR [rsp+0x70],xmm2
   0x7ffff7e0a5ca <printf+58>:    movaps XMMWORD PTR [rsp+0x80],xmm3
   0x7ffff7e0a5d2 <printf+66>:    movaps XMMWORD PTR [rsp+0x90],xmm4
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffdc60 --> 0x8000 
0008| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov    rsi,rbp)
0016| 0x7fffffffdc70 --> 0x90000 ('')
0024| 0x7fffffffdc78 --> 0x800 
0032| 0x7fffffffdc80 --> 0x8 
0040| 0x7fffffffdc88 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")
0048| 0x7fffffffdc90 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")
0056| 0x7fffffffdc98 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0 
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6

gdb-peda$

[----------------------------------registers-----------------------------------]

RAX: 0x401140 (<main>: push rbp)

RBX: 0x4011c0 (<__libc_csu_init>: endbr64)

RCX: 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0

RDX: 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")

RSI: 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")

RDI: 0x404038 ("Enter the first number: ")

RBP: 0x7fffffffdd50 --> 0x0

RSP: 0x7fffffffdc60 --> 0x8000

RIP: 0x7ffff7e0a5bb (<printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0)

R8 : 0x0

R9 : 0x7ffff7fdc070 (<_dl_fini>: endbr64)

R10: 0x404038 ("Enter the first number: ")

R11: 0x7ffff7e0a590 (<printf>: endbr64)

R12: 0x401050 (<_start>: endbr64)

R13: 0x0

R14: 0x0

R15: 0x0

EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow)

[-------------------------------------code-------------------------------------]

0x7ffff7e0a5b2 <printf+34>: mov QWORD PTR [rsp+0x48],r9

0x7ffff7e0a5b7 <printf+39>: test al,al

0x7ffff7e0a5b9 <printf+41>: je 0x7ffff7e0a5f2 <printf+98>

=> 0x7ffff7e0a5bb <printf+43>: movaps XMMWORD PTR [rsp+0x50],xmm0

0x7ffff7e0a5c0 <printf+48>: movaps XMMWORD PTR [rsp+0x60],xmm1

0x7ffff7e0a5c5 <printf+53>: movaps XMMWORD PTR [rsp+0x70],xmm2

0x7ffff7e0a5ca <printf+58>: movaps XMMWORD PTR [rsp+0x80],xmm3

0x7ffff7e0a5d2 <printf+66>: movaps XMMWORD PTR [rsp+0x90],xmm4

[------------------------------------stack-------------------------------------]

0000| 0x7fffffffdc60 --> 0x8000

0008| 0x7fffffffdc68 --> 0x7ffff7fe6808 (<init_cpu_features.constprop.0+1016>: mov rsi,rbp)

0016| 0x7fffffffdc70 --> 0x90000 ('')

0024| 0x7fffffffdc78 --> 0x800

0032| 0x7fffffffdc80 --> 0x8

0040| 0x7fffffffdc88 --> 0x7fffffffde48 --> 0x7fffffffe18c ("/home/johndoe/Programming/NASM/add_two_numbers/add_two_numbers")

0048| 0x7fffffffdc90 --> 0x7fffffffde58 --> 0x7fffffffe1d1 ("SSH_AUTH_SOCK=/run/user/1000/keyring/ssh")

0056| 0x7fffffffdc98 --> 0x7ffff7f73598 --> 0x7ffff7f75960 --> 0x0

[------------------------------------------------------------------------------]

Legend: code, data, rodata, value

0x00007ffff7e0a5bb in printf () from /usr/lib/libc.so.6

The value of RSP is 0x7fffffffdc60 which is a multiple of 16 because the value of RSP was decreased by 16 bytes after pushing RBX and RCX to the stack before calling printf function in the modified version of the code.

Local variables in the stack frame

Now let's try to replace global variables with local variables, i.e. implement this C function in NASM:

#include <stdio.h>

int main(){
    int int1, int2;
    printf("Enter the first number: ");
    scanf("%i", &int1);
    printf("Enter the second number :");
    scanf("%i", &int2);
    printf("The sum is equal to: %i\n", int1 + int2);
    return 0;
}

#include <stdio.h>

int main(){

int int1, int2;

printf("Enter the first number: ");

scanf("%i", &int1);

printf("Enter the second number :");

scanf("%i", &int2);

printf("The sum is equal to: %i\n", int1 + int2);

return 0;

}

We need to allocate at least 8 bytes (2 integers of 4 bytes) on the stack. However, in order to comply with the stack alignment requirement, the allocated local variable space has to be a multiple of 16 bytes:

; Declare external functions found in glibc
extern printf
extern scanf

SECTION .data
msg_in1: db "Enter the first number: ", 0
msg_in2: db "Enter the second number: ", 0
fmt_in: db "%i", 0
msg_out: db `The sum is equal to: %i\n`, 0
int1: dd 0 ;  32-bit integer = 4 bytes
int2: dd 0 ;  32-bit integer = 4 bytes

SECTION .text
global main
main:
  push rbp
  mov rbp, rsp
  sub rsp, 16 ; this cannot be equal to 8 although 8 bytes are enough to store two integers
  mov rdi, msg_in1 ; first argument to printf
  call printf

  mov rdi, fmt_in     ; first argument to scanf
  lea rsi, [rbp - 8]  ; second argument to scanf
  call scanf          ; get 1st integer

  mov rdi, msg_in2 ; first argument to printf
  call printf
  
  mov rdi, fmt_in     ; first argument to scanf
  lea rsi, [rbp - 4]  ; second argument to scanf
  call scanf          ; get 2nd integer
  
  mov eax, DWORD [rbp - 8] ; move 1st integer to register rax
  add eax, DWORD [rbp - 4] ; add 2nd integer to the 1st one
  
  mov rdi, msg_out ; first argument to printf
  mov rsi, rax     ; second argument to printf (the sum)
  call printf      ; print the sum
  xor rax, rax     ; return value is zero
  mov rsp, rbp
  pop rbp
  ret

; Declare external functions found in glibc

extern printf

extern scanf

SECTION .data

msg_in1: db "Enter the first number: ", 0

msg_in2: db "Enter the second number: ", 0

fmt_in: db "%i", 0

msg_out: db `The sum is equal to: %i\n`, 0

int1: dd 0 ; 32-bit integer = 4 bytes

int2: dd 0 ; 32-bit integer = 4 bytes

SECTION .text

global main

main:

push rbp

mov rbp, rsp

sub rsp, 16 ; this cannot be equal to 8 although 8 bytes are enough to store two integers

mov rdi, msg_in1 ; first argument to printf

call printf

mov rdi, fmt_in ; first argument to scanf

lea rsi, [rbp - 8] ; second argument to scanf

call scanf ; get 1st integer

mov rdi, msg_in2 ; first argument to printf

call printf

mov rdi, fmt_in ; first argument to scanf

lea rsi, [rbp - 4] ; second argument to scanf

call scanf ; get 2nd integer

mov eax, DWORD [rbp - 8] ; move 1st integer to register rax

add eax, DWORD [rbp - 4] ; add 2nd integer to the 1st one

mov rdi, msg_out ; first argument to printf

mov rsi, rax ; second argument to printf (the sum)

call printf ; print the sum

xor rax, rax ; return value is zero

mov rsp, rbp

pop rbp

ret

Subtracting 16 from the stack pointer (RSP) will allocate 16 bytes space for local variables.
If we use 8 bytes instead:

...
main: 
    push rbp
    mov rbp, rsp
    sub rsp, 8 ; this results in segmentation fault
...

...

main:

push rbp

mov rbp, rsp

sub rsp, 8 ; this results in segmentation fault

...

then the compiled code will throw a segmentation fault.

Aligning stack inside a callee function

In the assembly listings generated by some compilers, one can see the following construct in the stack frame initialization:

...
push rbp
mov rbp, rsp
and rsp, 0xFFFFFFFFFFFFFFF0 ; i.e. all bits in the mask are equal to logical 1, except last four bits 
...

...

push rbp

mov rbp, rsp

and rsp, 0xFFFFFFFFFFFFFFF0 ; i.e. all bits in the mask are equal to logical 1, except last four bits

...

...
push rbp
mov rbp, rsp
and rsp, -16 ; decimal "-16" corresponds to 0xFFFFFFFFFFFFFFF0 in two's-complement number system
...

...

push rbp

mov rbp, rsp

and rsp, -16 ; decimal "-16" corresponds to 0xFFFFFFFFFFFFFFF0 in two's-complement number system

...

If the least significant nibble of stack pointer is not equal to zero, its value (say N) is subtracted from itself yielding zero which corresponds to aligning the stack to 16 bytes boundary. In other words, it can be considered as allocating N bytes of memory for local variables although this space will not be used for this purpose.

In this manner, the compiler tries to align the stack which was misaligned in the caller (parent) function.

Stack alignment for memory-based arguments of a function

Another way to misalign the stack is to pass more than 6 arguments to a function so that the total number arguments is an odd number. These arguments should be of class INTEGER, which includes _Bool, char, short, int, long, long long and pointer data types. We are not considering SSE, SSEUP, X87, X87UP, COMPLEX_X87 and NO_CLASS arguments (e.g. floating point numbers) because they are passed through other special registers (e.g. XMM).

According to the AMD64 Linux ABI, first six INTEGER class input arguments are passed through registers:

rdi, rsi, rdx, rcx, r8, r9

The 7th input argument and beyond are passed through stack as in the 32-bit x86 Linux ABI. Look at the "Stack frame with base pointer" table above, the cell "memory argument eightbyte 0" corresponds to the 7th input argument. Since it takes 8 bytes of memory on the stack, we have a stack alignment problem. To illustrate it, consider the C code below:

#include <stdio.h>

void calc_sum(int a, int b, int c, int d, int e, int f, int g){
    printf("The sum is equal to: %i\n", a + b + c + d + e + f + g);
}

int main(){
    int a;
    printf("Enter a number: ");
    scanf("%i", &a);
    calc_sum(a, 11, 12, 13, 14, 15, 16);
    return 0;
}

#include <stdio.h>

void calc_sum(int a, int b, int c, int d, int e, int f, int g){

printf("The sum is equal to: %i\n", a + b + c + d + e + f + g);

}

int main(){

int a;

printf("Enter a number: ");

scanf("%i", &a);

calc_sum(a, 11, 12, 13, 14, 15, 16);

return 0;

}

The variable "a" in the main function contains the value of an integer entered by a user from a command line. Without this "randomness", the C compiler will optimize the code with respect to the constant values of the 7 summands and pass the calculated deterministic value to the printf function inside calc_sum().

Compile the code to generate the assembly listing:

[johndoe@ArchLinux]% gcc add_seven_numbers.c -S -masm=intel -fno-asynchronous-unwind-tables  -fno-stack-protector -O0

1	[johndoe@ArchLinux]% gcc add_seven_numbers.c -S -masm=intel -fno-asynchronous-unwind-tables -fno-stack-protector -O0

The generated assembly code looks like this:

.file   "add_seven_numbers.c"
        .intel_syntax noprefix
        .text
        .section        .rodata
.LC0:
        .string "The sum is equal to: %i\n"
        .text
        .globl  calc_sum
        .type   calc_sum, @function
calc_sum:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32 ; although 6 local variables (integers) are used (i.e. 24 bytes)
                    ; 32 bytes are allocated because of the stack alignment requirement
        mov     DWORD PTR -4[rbp], edi   ; 1st argument 
        mov     DWORD PTR -8[rbp], esi   ; 2nd argument 
        mov     DWORD PTR -12[rbp], edx  ; 3rd argument 
        mov     DWORD PTR -16[rbp], ecx  ; 4th argument 
        mov     DWORD PTR -20[rbp], r8d  ; 5th argument 
        mov     DWORD PTR -24[rbp], r9d  ; 6th argument 
        mov     edx, DWORD PTR -4[rbp]
        mov     eax, DWORD PTR -8[rbp]   
        add     edx, eax                 ; start adding integers
        mov     eax, DWORD PTR -12[rbp]
        add     edx, eax
        mov     eax, DWORD PTR -16[rbp]
        add     edx, eax
        mov     eax, DWORD PTR -20[rbp]
        add     edx, eax
        mov     eax, DWORD PTR -24[rbp]
        add     edx, eax
        mov     eax, DWORD PTR 16[rbp]  ; 7th argument
        add     eax, edx                ; the sum is stored in eax
        mov     esi, eax                ; print integer stored in eax (the sum)
        lea     rdi, .LC0[rip]          
        mov     eax, 0
        call    printf@PLT
        nop
        leave
        ret
        .size   calc_sum, .-calc_sum
        .section        .rodata
.LC1:
        .string "Enter a number: "
.LC2:
        .string "%i"
        .text
        .globl  main
        .type   main, @function
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        lea     rdi, .LC1[rip]
        mov     eax, 0
        call    printf@PLT
        lea     rax, -4[rbp]  ; &a, address of variable "a"
        mov     rsi, rax
        lea     rdi, .LC2[rip]
        mov     eax, 0
        call    __isoc99_scanf@PLT
        mov     eax, DWORD PTR -4[rbp] ; variable "a" is moved to eax, see 1st argument below
        sub     rsp, 8    ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used
        push    16   ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes)
        mov     r9d, 15   ; 6th argument to calc_sum()
        mov     r8d, 14   ; 5th argument to calc_sum()
        mov     ecx, 13   ; 4th argument to calc_sum()
        mov     edx, 12   ; 3rd argument to calc_sum()
        mov     esi, 11   ; 2nd argument to calc_sum()
        mov     edi, eax  ; 1st argument to calc_sum()
        call    calc_sum
        add     rsp, 16
        mov     eax, 0
        leave
        ret
        .size   main, .-main
        .ident  "GCC: (GNU) 10.2.0"
        .section        .note.GNU-stack,"",@progbits

.file "add_seven_numbers.c"

.intel_syntax noprefix

.text

.section .rodata

.LC0:

.string "The sum is equal to: %i\n"

.text

.globl calc_sum

.type calc_sum, @function

calc_sum:

push rbp

mov rbp, rsp

sub rsp, 32 ; although 6 local variables (integers) are used (i.e. 24 bytes)

; 32 bytes are allocated because of the stack alignment requirement

mov DWORD PTR -4[rbp], edi ; 1st argument

mov DWORD PTR -8[rbp], esi ; 2nd argument

mov DWORD PTR -12[rbp], edx ; 3rd argument

mov DWORD PTR -16[rbp], ecx ; 4th argument

mov DWORD PTR -20[rbp], r8d ; 5th argument

mov DWORD PTR -24[rbp], r9d ; 6th argument

mov edx, DWORD PTR -4[rbp]

mov eax, DWORD PTR -8[rbp]

add edx, eax ; start adding integers

mov eax, DWORD PTR -12[rbp]

add edx, eax

mov eax, DWORD PTR -16[rbp]

add edx, eax

mov eax, DWORD PTR -20[rbp]

add edx, eax

mov eax, DWORD PTR -24[rbp]

add edx, eax

mov eax, DWORD PTR 16[rbp] ; 7th argument

add eax, edx ; the sum is stored in eax

mov esi, eax ; print integer stored in eax (the sum)

lea rdi, .LC0[rip]

mov eax, 0

call printf@PLT

nop

leave

ret

.size calc_sum, .-calc_sum

.section .rodata

.LC1:

.string "Enter a number: "

.LC2:

.string "%i"

.text

.globl main

.type main, @function

main:

push rbp

mov rbp, rsp

sub rsp, 16

lea rdi, .LC1[rip]

mov eax, 0

call printf@PLT

lea rax, -4[rbp] ; &a, address of variable "a"

mov rsi, rax

lea rdi, .LC2[rip]

mov eax, 0

call __isoc99_scanf@PLT

mov eax, DWORD PTR -4[rbp] ; variable "a" is moved to eax, see 1st argument below

sub rsp, 8 ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used

push 16 ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes)

mov r9d, 15 ; 6th argument to calc_sum()

mov r8d, 14 ; 5th argument to calc_sum()

mov ecx, 13 ; 4th argument to calc_sum()

mov edx, 12 ; 3rd argument to calc_sum()

mov esi, 11 ; 2nd argument to calc_sum()

mov edi, eax ; 1st argument to calc_sum()

call calc_sum

add rsp, 16

mov eax, 0

leave

ret

.size main, .-main

.ident "GCC: (GNU) 10.2.0"

.section .note.GNU-stack,"",@progbits

I added comments to show the arguments which are passed to the calc_sum() function. Since no optimization was used, the generated assembly code is straightforward. Input arguments a, b, c, d, e, f are passed through registers rdi, rsi, rdx, rcx, r8, r9. The 7th argument g is passed through the stack:

...
sub rsp, 8 ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used
push 16    ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes)
...

...

sub rsp, 8 ; stack alignment, together with the next instruction, 16 bytes of the stack memory will be used

push 16 ; 7th argument to calc_sum(), note that it's "pushq" instruction ("q" for quad, i.e. 64 bits=8 bytes)

...

Apparently, the C compiler aligns the stack to 16-byte boundary by subtracting 8 from the stack pointer before pushing the 7th argument.

The compiled executable works as expected:

[johndoe@ArchLinux]% gcc add_seven_numbers.c -fno-asynchronous-unwind-tables  -fno-stack-protector -O0 -o add_seven_numbers
[johndoe@ArchLinux]% ./add_seven_numbers
Enter a number: 10
The sum is equal to: 91

[johndoe@ArchLinux]% gcc add_seven_numbers.c -fno-asynchronous-unwind-tables -fno-stack-protector -O0 -o add_seven_numbers

[johndoe@ArchLinux]% ./add_seven_numbers

Enter a number: 10

The sum is equal to: 91

where 91 is the sum of all integers in the range 10 to 16 inclusive.

Conclusion

Read the Linux ABI which is only 128 pages long. Investing time into education will save time in the future by spending less time on debugging.

When you use other C functions or libraries in your assembly code, obviously you have to comply not just with the calling convention, but also with the other specifications of the ABI.

Sooner or later your assembly code, which uses external C functions, will start to resemble the code generated by a C compiler. If you have many local variables, the number of available registers will not be enough to store them and you will have to use the stack for this purpose. These local variables will be accessible at a fixed offset from the base pointer (RBP).

The default stack alignment in recent gcc versions (e.g. v10.2.0) is 16 bytes because the default value for -mpreferred-stack-boundary parameter is 4 which corresponds to 2^4=16 bytes.

References

Michael Matz, Jan Hubicka, Andreas Jaeger, Mark Mitchell, "System V Application Binary Interface AMD64 Architecture Processor Supplement".

Chris Wellons, "Raw Linux Threads via System Calls".

Eli Bendersky, Stack frame layout on x86-64

Multiple 16-byte stack alignment violations when calling C code from assembly #231

Altynbek Isabekov

Machine Learning and Embedded Systems

Stack alignment when mixing assembly and C code

Minimal working example

Local variables in the stack frame

Aligning stack inside a callee function

Stack alignment for memory-based arguments of a function

Conclusion

References