Stack alignment when mixing assembly and C code

Recently, I was solving one of the Codewars.com problems in NASM (x86_64/amd64 Assembly) where I had to use some of the C functions available in the standard C library. During the code refactoring and optimization, I came across a segmentation fault while trying to run the executable. The program broke at this instruction:

The code looked correct, so finding the bug took some time. Finally, I was able to find the root of this behavior. The problem was in the stack alignment.

Minimal working example

To illustrate this issue, we can reproduce a similar minimal working example. Consider a program, which asks for two numbers (long integers) and prints their sum. The prototype in C looks like this:

Let's compile it and test it to see its functionality:

Now let's rewrite it in NASM (x86_64):

Integers int1 and int2 are global variables and are stored in the .data section of the ELF object file. Entry point here is not "_start" but "main" as in the C function. Let's compile the code using NASM and link it with glibc using gcc:

Now try to add two instructions: one before the first printf and the other - after it:

Saving the value of RBX on the stack before executing printf and restoring it by popping it out from the stack is a perfectly logical thing to do, since printf may modify the content of RBX and we don't want to loose it.

Recompile and run the program:

The program execution stops and throws a segmentation fault. More detailed examination reveals the following memory state of the program just before the segmentation fault:

The value of RSP is 0x7fffffffdc68 which is not a multiple of 16 because the value of RSP was decreased by 8 bytes after pushing RBX to the stack before calling printf function in the modified version of the code.
This becomes crucial when instructions such as "movaps" (move packed single-precision floating-point values from xmm2/m128 to xmm1) require 16 byte alignment of the memory operand. Misaligned stack together with "movaps" instruction in the printf function cause segmentation fault.

For this reason, the System V Application Binary Interface AMD64 provides the stack alignment requirement in subsection "The Stack Frame":

"The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point."

The value of RSP before calling any function has to be in the format of 0x???????????????0 where ? represents any 4-bit hexadecimal number. Zero value of the least significant nibble provides 16 byte alignment of the stack. The layout of the stack should look like this:

Stack frame with base pointer
Position Contents Frame
lower addresses
rsp - 128 red zone Current
rsp local variable at higher boundary
...
rbp - 8 local variable at lower boundary
rbp previous rbp value
rbp + 8 return address
rbp + 16 memory argument eightbyte 0 Previous
...
rbp + 16 - 8*n memory argument eightbyte n
higher addresses

Furthermore, one may think about solving this issue manually with preserving the functionality of the program. If pushing the value of a 64-bit register to the stack (e.g. "PUSH RBX") decreases the stack by 8 bytes, we can realign the stack by pushing another 64-bit register (e.g. RCX) just after the "PUSH RBX":

and popping its stored value from the stack just before restoring the value of RBX.

Recompiling and running the program yields:

There is no segmentation fault observed, since the stack is aligned:

The value of RSP is 0x7fffffffdc60 which is a multiple of 16 because the value of RSP was decreased by 16 bytes after pushing RBX and RCX to the stack before calling printf function in the modified version of the code.

Local variables in the stack frame

Now let's try to replace global variables with local variables, i.e. implement this C function in NASM:

We need to allocate at least 8 bytes (2 integers of 4 bytes) on the stack. However, in order to comply with the stack alignment requirement, the allocated local variable space has to be a multiple of 16 bytes:

Subtracting 16 from the stack pointer (RSP) will allocate 16 bytes space for local variables.
If we use 8 bytes instead:

then the compiled code will throw a segmentation fault.

Aligning stack inside a callee function

In the assembly listings generated by some compilers, one can see the following construct in the stack frame initialization:

or

If the least significant nibble of stack pointer is not equal to zero, its value (say N) is subtracted from itself yielding zero which corresponds to aligning the stack to 16 bytes boundary. In other words, it can be considered as allocating N bytes of memory for local variables although this space will not be used for this purpose.

In this manner, the compiler tries to align the stack which was misaligned in the caller (parent) function.

Stack alignment for memory-based arguments of a function

Another way to misalign the stack is to pass more than 6 arguments to a function so that the total number arguments is an odd number. These arguments should be of class INTEGER, which includes _Bool, char, short, int, long, long long and pointer data types. We are not considering SSE, SSEUP, X87, X87UP, COMPLEX_X87 and NO_CLASS arguments (e.g. floating point numbers) because they are passed through other special registers (e.g. XMM).

According to the AMD64 Linux ABI, first six INTEGER class input arguments are passed through registers:

  • rdi, rsi, rdx, rcx, r8, r9

The 7th input argument and beyond are passed through stack as in the 32-bit x86 Linux ABI. Look at the "Stack frame with base pointer" table above, the cell "memory argument eightbyte 0" corresponds to the 7th input argument. Since it takes 8 bytes of memory on the stack, we have a stack alignment problem. To illustrate it, consider the C code below:

The variable "a" in the main function contains the value of an integer entered by a user from a command line. Without this "randomness", the C compiler will optimize the code with respect to the constant values of the 7 summands and pass the calculated deterministic value to the printf function inside calc_sum().

Compile the code to generate the assembly listing:

The generated assembly code looks like this:

I added comments to show the arguments which are passed to the calc_sum() function. Since no optimization was used, the generated assembly code is straightforward. Input arguments a, b, c, d, e, f are passed through registers rdi, rsi, rdx, rcx, r8, r9. The 7th argument g is passed through the stack:

Apparently, the C compiler aligns the stack to 16-byte boundary by subtracting 8 from the stack pointer before pushing the 7th argument.

The compiled executable works as expected:

where 91 is the sum of all integers in the range 10 to 16 inclusive.

Conclusion

Read the Linux ABI which is only 128 pages long. Investing time into education will save time in the future by spending less time on debugging.

When you use other C functions or libraries in your assembly code, obviously you have to comply not just with the calling convention, but also with the other specifications of the ABI.

Sooner or later your assembly code, which uses external C functions, will start to resemble the code generated by a C compiler. If you have many local variables, the number of available registers will not be enough to store them and you will have to use the stack for this purpose. These local variables will be accessible at a fixed offset from the base pointer (RBP).

The default stack alignment in recent gcc versions (e.g. v10.2.0) is 16 bytes because the default value for -mpreferred-stack-boundary parameter is 4 which corresponds to  2^4=16 bytes.

References

Michael Matz,  Jan Hubicka, Andreas Jaeger, Mark Mitchell, "System V Application Binary Interface AMD64 Architecture Processor Supplement".

Chris Wellons, "Raw Linux Threads via System Calls".

Eli Bendersky, Stack frame layout on x86-64

Multiple 16-byte stack alignment violations when calling C code from assembly #231