The thorny path of Hello World

astrotycoon 2019-01-18

展开全文

The inspiration for writing this article was obtained after reading a similar publication for the x86 architecture [1].
This material will help those who want to understand how the programs are built from the inside, what happens before entering the main and why all this is done. Also I'll show you how to use some of the features of the glibc library. And in the end, as in the original article [1], the traversed path will be visually represented. Most of the article is a parsing of the glibc library.
So, let's start our trip. We will use Linux x86-64, and as a debugging tool - lldb. Also sometimes we will disassemble the program with objdump.
The source text is normal Hello, world (hello.cpp):

#include <iostream>

int main()

{

std::cout << "Hello, world!" << std::endl;

}

Just in case, information about the system and programs

* Clang - 4.0.1

* lldb -- 4.0.1

* glibc -- 2.25

* `uname -r` -- 4.12.10-1-ARCH

We compile the code and start debugging:

clang++ -stdlib=libc++ hello1.cpp -g -o hello1.out

lldb hello1.out

Note Most of the code in the program almost does not depend on the selected compiler and the c ++ library. It just happened that I'm a little closer to the llvm infrastructure than gcc, so I'll consider the clang compiler with the libc ++ library, but again, there is not much difference, because most of the code in question will be parsed from the glibc library.
The program using bash (and not only) is born by calling the fork function and creating a new process using execve, passing it the command-line arguments. Also, before the control of the first instruction of the executable file is transferred, the input and output descriptors (STDIN, STDOUT, STDERR) are set, then in the case of dynamic linking, libraries are loaded and initialized, and the functions of the ".preinit_array" section are called. Only after all this is called the first function, which is in the executable file (not counting the section ".preinit_array"), traditionally called _start, which is considered the beginning of the program. In the case of static linking, the work of the linker, for example, the initialization of the ".preinit_array" section, is inside the executable file and the functions themselves are slightly different from the dynamically linked programs. We will consider dynamically linked programs.
The entry point of the executable is specified in its header:

readelf -h hello1.out | grep Entry

Next, check which function is at this address usingobjdump -d hello1.out. This is the already mentioned function _start, on which we put a breakpoint and start debugging.

b _start

r

A little bit about ABI Definition of Wikipedia:
ABI (aplication binary interface) is a set of conventions for accessing an application to the operating system and other low-level services designed for portability of executable code between machines that have compatible ABIs. Unlike the API, which regulates compatibility at the source code level. ABI can be considered as a set of rules that allow the linker to combine the compiled component modules without recompiling all the code, while defining the binary interface.
The ABI level is hidden for c / c ++ programmers and all the work of this level is implemented by the compiler and the standard library libc. In my case, the clang compiler and the glibc library follow all ABI rules. ABI rules for Linux x86-64 are specified in the System V document AMD64 ABI [2]. Solaris, Linux, FreeBSD, OS X follow the conventions of this document. Microsoft has its own specific ABI, which they carefully hide. In the first chapter of this document [2] it is said that the architecture also obeys ABI rules for 32-bit processors [3]. Therefore, these are two basic documents on which developers of low-level libraries like glibc rely.
According to ABI, when the program starts, all registers are not defined except for:

% rdx: A pointer to a function that must be called before the program terminates.
% rsp: The stack is aligned on the 16-byte boundary, contains the number of arguments, the arguments themselves and the environment:
0(%rsp) argc
8(%rsp) argv[0]
...
8argc(%rsp) NULL
8(argc+1)(%rsp) envp[0]
...
8*(argc+k+1)(%rsp) envp[k]
NULL
auxiliary vectors
...
NULL
NULL

Auxiliary vectors (auxiliary vectors) contain information about the current machine. You can see their values using

LD_SHOW_AUXV = 1 ./hello1.out

. The values obtained are described fairly well in [4].
And in fact

x `$ rsp` -s8 -fu -c1

is the number of program arguments
p * (char **) ($ rsp + 8) is the name of the program. Further on the stack are program arguments, zero delimiter, environment arguments and auxiliary vectors.
In addition, flags are set, SSE and x87 are configured (§3.4.1 [2]).
You can see that the arguments are already almost ready for the user-defined main function, only the right pointers are left. But in addition to setting up pointers before entering the main procedure, you need to do a lot more work. In the future, any function in its description will be accompanied by the location of its sources and the function itself in a binary form in the form of a tooltip, for example: main.
Let's look at the function _start, it's small and its main task is to transfer control of the __libc_start_main function.
We will disassemble the current function with the help of di (the output here and below is formatted for clarity):

_start:

xor %ebp, %ebp

mov %rdx, %r9

pop %rsi

mov %rsp, %rdx

and $-0x10, %rsp

push %rax

push %rsp

lea 0x1aa(%rip), %r8 ; __libc_csu_fini

lea 0x133(%rip), %rcx ; __libc_csu_init

lea 0xec(%rip), %rdi ; main

call *0x200796(%rip) ; __libc_start_main

hlt

The _start function is connected to our program by the linker as an object file Scrt1.o. There are several varieties of the crt1 (gcrt1, Srct1, Mcrt1) object files that perform similar functions, but are used in different cases. For example, Scrt1.o is used when generating PIC code [5]. You can verify the choice of the object file by compiling the program with the -v switch. "Note that the object file does not contain the object offsets __libc_csu_fini, __libc_csu_init and main because the offsets of these functions are only known at the linking stage.
According to the ABI requirements, you need to zero% ebp to mark the frame as the initial one, which is done by the xor% ebp,% ebp instruction.
Next is the preparation for the call to the function __libc_start_main, the signature of which looks like this:

int __libc_start_main(int (*main) (int, char **, char **),

int argc, char **argv,

__typeof (main) init,

void (*fini) (void),

void (*rtld_fini) (void), void *stack_end)

And the arguments of the function, according to ABI, should be put in the appropriate places:

The	The position for calling the	Description
main	%rdi	The main function of the
argc	%rsi	The number of program arguments
argv	%rdx	Array of arguments. After the arguments are the environment variables, and afterwards the auxiliary vectors
init	%rcx	The global object constructor, called before main. The type of this function is the same as that of the main function.
fini	%r8	The global object destructor, called after main
rtld_fini	%r9	The destructor of the dynamic linker. Releases dynamically allocated libraries
stack_end	%rsp	The current position of the aligned stack

ABI requires that when the function is called, the stack is aligned to 16-byte (sometimes 32, and sometimes 64, depending on the type of arguments) boundary. The request is executed after the execution of the instruction and $ -0x10,% rsp (?). The meaning of this alignment is that SIMD instructions (SSE, MMX) work only with aligned data, and scalar instructions faster read / are written with aligned data.
To store the 16-byte alignment, before registering __libc_start_main on the stack, the% rax register is stored in which an undefined value is stored. This stack cell will never be read.
The program should not return from the libc_start_main function, and to indicate the wrong behavior, use the hlt instruction. The peculiarity of this instruction is that in the protected mode of the processor it can be executed only in the protection ring 0, that is, only the operating system can call it. We are in the 3 ring, which means that when we try to execute a command that the program does not have rights to, we get a segmentation fault.
After the hlt instruction, there is another instruction nopl 0x0 (% rax,% rax, 1), which in turn is needed to align the next function with the 16-byte boundary. ABI does not require this, but compilers align the beginning of the function to improve performance ( 1 , 2 ).
So, let's go further

b __libc_start_main

c

The source code of the __libc_start_main function shows that different code is generated for statically and dynamically linked libraries. You can check what the function code in the libc.so.6 library looks like with gdb or with lldb:

lldb libc.so.6 -b -o 'di -n __libc_start_main'

A bit about __glibc_ [un] likely There are a lot of occurrences of __glibc_likely and __glibc_unlikely in the code of the glibc library. A large number of conditional operations is replaced by this macro. The macro is eventually converted to the following build-in functions:

# define __glibc_unlikely(cond) __builtin_expect ((cond), 0)

# define __glibc_likely(cond) __builtin_expect ((cond), 1)

__builtin_expect is a kind of optimization that helps the compiler to properly allocate portions of code in memory. We tell the compiler which branch is most likely to be executed, and the compiler places this area of memory immediately after the comparison instruction, thereby improving the instruction's caching, and the rest of the branch, if available, hides the end of the function.
The function __libc_start_main is a little cumbersome, it is enough to describe briefly its main actions:

registering rtld_fini with __cxa_atexit
call __libc_csu_init
create cancellation point
main
exit

__cxa_atexit

The function __cxa_atexit, in contrast to atexit, which is a wrapper over the first, can take parameters of the registered function, but the function should not call directly from the user space. It should not be called because the function uses a DSO identifier, which is known only to the compiler. It is needed so that when calling __cxa_atexit (f, p, d), the function f (p) is called when unloading DSO d [8].
However, passing arguments to the function-parameter Example of using __cxa_atexit:

#include <cstdio>

extern "C" int __cxa_atexit (void (*func) (void *), void *arg, void *d);

extern void* __dso_handle;

void printArg(void *a)

{

int arg = *static_cast<int*>(a);

printf("%d\n",arg);

delete (int*)a;

}

int main()

{

int *k = new int(17);

__cxa_atexit(printArg, k, __dso_handle);

}

This trick I recommend to use only for the update. To call the destructor when exiting the program, it is safer to use any similar method .
rtld_fini is a pointer to the linker function _dl_fini. And yes, the linker is part of the glibc library. The _dl_fini function deinitializes and unloads all the loaded libraries.

__libc_csu_init

It is possible to get into the function __libc_csu_init in the same way as we got to the previous one. __libc_csu_init calls _init and function pointers in the .init_array section.

_init

The _init function is entirely in the .init section. Her code is divided into 2 parts: the introduction and the epilogue. The introduction consists of a prologue and an attempt to call the __gmon_start__ function.

_init

subq $0x8, %rsp

leaq 0x105(%rip), %rax ; __gmon_start__

testq %rax, %rax

je 0x5555555548a2 ; je to addq instruction

callq *%rax

addq $0x8, %rsp

retq

The main task of the _init function is initialization of the profiler gprof. The instruction leaq 0x105 (% rip),% rax "takes the address of the function __gmon_start__ - the function that initializes the profiler.If the profiler is not present then in% rax will be the value 0 and the jump je will work.The instructions are subq $ 0x8,% rsp and addq $ 0x8, % rsp makes the stack alignment and return to its original state.This alignment is necessary due to the fact that when we call the function, we put a return address on the stack, the size of which on the x86-64 architecture is 8 bytes.
You can add your own code to the .init section. Consider the hello2.cpp example:

#include <cstdio>

extern "C" void my_init()

{

puts("Hello from init");

}

__asm__(

".section .init\n"

"call my_init"

);

int main(){}

Consider how _init looks now:

subq $0x8, %rsp

movq 0x200835(%rip), %rax

testq %rax, %rax

je 0x5555555547ba

callq *%rax

callq 0x555555554990 ; ::my_init()

addq $0x8, %rsp

retq

As you can see from the listing, the instruction callq 0x555555554990 was added between the introduction and the epilogue of the function, which just calls my_init. Apparently the function _init and implemented in such a way that you can easily add your own initialization of some parts of the program.
An interesting fact: An attentive reader has probably noticed that output to hello2.cpp is output via the puts function. If you output via cout, then when compiling with the libstdc ++ library there will be a segmentation error, and with the libc ++ library the message will be displayed normally. Because of what does this happen? The matter is that in libstdc ++ cout it is initialized as the usual global object, and initialization of global objects occurs hardly later. In the case of libc ++, initialization occurs during the loading of libraries in the _dl_init function from the library ld-linux-x86-64.so.2. This function is just called from _dl_start_user just before passing control to the _start function.
Advantages and disadvantages of each method. When you connect the libc ++ library, even if you do not use the standard output from c ++ like cout, the constructors will in any case be called. In the case of the libstdc ++ library, even if the optimization flags are enabled, the constructor will be called as many times as the iostream header file is connected. Naturally, in the designer itself, the fact that it can be called several times and the re-initialization is skipped is taken into account. This, of course, will not slow down the initialization of the program, but it is still unpleasant. Apparently for this reason, many high-performance projects do not use, do not recommend, and even forbid to connect the iostream header file and, as a result, create their own interfaces for input-output.

.init_array

Further functions are called whose pointers are located in the .init_array section.
Let's check the contents of the section:

objdump hello1.out -s -j .init_array

In my case, the contents of .init_array has the following meaning: a00f0000 00000000, which means the address 0x0fa0 in a 64-bit system with little-endian byte order . At this address is the function frame_dummy.

frame_dummy

Interestingly, frame_dummy is part of the gcc library.
What does gcc have to do with it? We have the same clang! compiler. Do not forget that the gcc project is very large and has already germinated into linux operating systems. The gcc project contains not only the compiler, but also the files needed for compilation. Thus, when linking, crt-files like crtbeginS.o and crtendS.o are used.
Therefore, completely it is not possible to get rid of the gcc project, and at least leave helper crt-files. Unix operating systems that do not use gcc as the main compiler do so.
frame_dummy looks like this:

pushq %rbp

movq %rsp, %rbp

popq %rbp

jmp 0x555555554cc0 ; register_tm_clones

nopw (%rax,%rax)

The task frame_dummy is setting arguments and starting the register_tm_clones function. This layer is needed only to expose the arguments. In this case, the arguments are not exposed, but as you can see from the source code, this is not always so, it depends on the architecture. Interestingly, the first 2 instructions are a prologue, the third is an epilogue. The jmp instruction is the tail optimization of the function call. And as usual, at the end of the alignment.
The register_tm_clones function is needed in order to activate transactional the memory.

Initializing global objects

Global objects, if any, are initialized here.
In the presence of global objects, the address of the function

_ GLOBAL__sub_I_ <compiled file name>

is added to the .init_array section.
Let's consider an example of initialization of global variables:
global1.cpp:

int k = printf("Hello from .init_array");

The variable will be initialized as follows:

push %rbp

mov %rsp, %rbp

lea 0xf59(%rip), %rdi ; + 4

mov $0x0, %al

call 0x555555554e80 ; symbol stub for: printf

mov %eax, 0x202130(%rip) ; k

pop %rbp

ret

The first two instructions are a prologue. Next, we are preparing to call the function printf, putting in% rdi a pointer to our string and aligning% al to zero. According to ABI [2], functions with a variable number of arguments contain a hidden parameter stored in% al, which means the number of variable arguments contained in vector registers. Most likely this is needed to optimize some functions, but printf uses this information in order to move data from vector registers to the stack.
After calling printf, the result of the function is placed in the memory area of the variable k and an epilogue is called.
global2.cpp:
Suppose we have a certain class Global with a non-default constructor and destructor:

Global g;

Then the initialization will look like this:

push %rbp

mov %rsp, %rbp

sub $0x10, %rsp

lea 0x202175(%rip), %rdi ; g

call 0x5555555550e0 ; Global::Global()

lea 0x1c5(%rip), %rdi ; Global::~Global()

lea 0x202162(%rip), %rsi ; g

lea 0x202147(%rip), %rdx ; __dso_handle

call 0x555555554f10 ; symbol stub for: __cxa_atexit

mov %eax, -0x4(%rbp)

add $0x10, %rsp

pop %rbp

ret

Here we see how, after calling the global constructor, the destructor is registered with __cxa_atexit. This is implemented according to Itanium ABI [8].

Initializing function call

From glibc, the initialization is invoked as follows:

(* __ init_array_start [i]) (argc, argv, envp);

Note that the initializing function is passed parameters similar to the main functions, so we can use them. The compilers gcc and clang have the attribute constructor, by means of which the function is called before the initialization of objects.
In him, we can convey these arguments. Check the output of the program using the following global function:

void __attribute__((constructor)) hello(int argc, char **argv, char **env)

{

printf("#args = %d\n", argc);

printf("filename = %s\n", argv[0]);

}

This can be used for more practical purposes (hello3.cpp):

#include <cstdio>

class C

{

public:

C(int i)

{

printf("Program has %d argument(s)\n", i);

}

};

int constructorArg;

const C c(constructorArg);

void __attribute__((constructor (65535))) hello(int argc, char ** argv, char **env){

constructorArg = argc;

}

int main(){}

The parameters of theconstructor attribute are the priority of the call.
As you probably already guessed, the program displays the correct number of arguments, and the most interesting, the object c is a constant. The main disadvantage of this approach is the lack of support for the standard and, as a consequence, the lack of cross-platform. Also, this code is highly dependent on the libc library in use.
I would like to add that global variables of the type int x = 1 + 2 * 3; they are not initialized at all, their values are initially written by the compiler into memory. If you want the variables initialized by simple functions likeint s = sum (4, 5) also to be initialized, add a constexpr identifier from the C ++ 11 standard to the sums function.

Creating cancellation point

The undo point is created by calling setjmp and setting the global variable.
Saving the context setjmp is necessary to set the cancel buffer, so that when canceling the main thread, it could be terminated correctly.
Example of canceling the main stream The file cancel.cpp.

#include <pthread.h>

pthread_t g_thr = pthread_self();

void * thread_start(void *)

{

pthread_cancel(g_thr);

return 0;

}

int main()

{

pthread_t thr;

pthread_create(&thr, NULL, thread_start, NULL);

pthread_detach(thr);

while (1)

{

pthread_testcancel();

}

In the example cancel.cpp, the main thread will terminate by canceling from the secondary thread, and then the exit function will be called. Moreover, if the stream that we created would continue to exist after the main thread is canceled, the thread counter would say that there are still threads of the process, and then only the main thread would terminate, and the auxiliary would continue to exist.
You can verify that the context is really restored by putting a breakpoint immediately after the setjmp call:

br set -n __libc_start_main -R 162

The program stops twice: the first time when the program is initialized, the second time after the main thread is canceled.
Instead of the expected setjmp, the call takes place on __GI__set_mp. This function is the alias of the first and the creation of such an alias is made for each character used inside the library. This is done primarily to preserve the program's performance when replacing global characters and to increase performance [7]. Performance is improved by the fact that there is a direct call, and not through the PLT call table.

main

The source code of the function consists of one line.

std::cout << "Hello, world!" << std::endl;

The same can be written down as:

operator<<(std::cout, "Hello, World!").operator<<(std::endl);

operator<<(std::cout, "Hello, World!");

std::cout.operator<<(std::endl);

In C ++, the operator << is a normal overloaded function. It can be implemented as a member of a class, for example the second instruction, and outside the class, as in the case of the output of a string. There are even some rules, how to implement overload.

endl

is a function both in libc ++ and in libstdc ++ and looks like this:

ostream & endl (ostream &);

In classostream, to call this function, the operator << is declared and implemented as the design pattern visitor .
Calling the first function first computes the length of the string. In my case, the length of the string is an IFUNC-symbol, that is, the implementation of the function is determined during program execution between __strlen_avx2 and _strlen_sse2. In your case, strlen is also possible without the use of vector registers.
When the data is written to stdout for the first time, the memory for the internal buffer in the _IO_file_doallocate function is allocated with malloc, and in my case the buffer size is 1 kb. File descriptors can have a different buffering policy , which is configured with setvbuf .
After the line is put, it is put into the stdout buffer, so the line is not immediately displayed. The second procedure adds a newline to the buffer and calls flush, which means that the internal bufferstdout is emptied and the drain is output to the screen.
In my case, after adding the end of the line, even without calling flush, control is passed to the function fwrite, which calls __libc_write, which calls syscall as follows (for clarity, the function is slightly simplified):

ssize_t __libc_write (int fd, const void *buf, size_t nbytes)

{

return ({

unsigned long int resultvar =

({ unsigned long int resultvar;

long int __arg3 = (long int) (nbytes);

long int __arg2 = (long int) (buf);

long int __arg1 = (long int) (fd);

register long int _a3 asm ("rdx") = __arg3;

register long int _a2 asm ("rsi") = __arg2;

register long int _a1 asm ("rdi") = __arg1;

asm volatile ( "syscall\n\t"

: "=a" (resultvar)

: "0" (1) , "r" (_a1), "r" (_a2), "r" (_a3)

: "memory", "cc", "r11", "cx"); (long int) resultvar; });

resultvar;

});

}

The library actively uses the statement expressions , which is a compilation extension of gcc:

int l = ({int b = 4; int c = 8; c += b});

The result of this expression is the result of evaluating the last statement, that is, c + = b and in this example, == 12.
The function __libc_write (in disassembled form called __GI___libc_write for the same reasons as _setjmp) for output to the console uses the system call interface syscall , and to call syscall, use the extension C language for assembler. The number of the system call is put in registerx. In the code, this is done using the architectural constraint on the used register = a, which means copying the result to the rax register, and "0" (1) in this case requires that before the script is executed, the value of 1 (sys_write) is in the register.
System calls, like ordinary calls, can have arguments, for example, for calling sys_write, the arguments are a file descriptor, a data buffer, and the size of the output text.
The kernel call conventions, according to ABI [2], differ from the usual call conventions. For the kernel, the list of parameters must lie in a certain order in the following registers:% rdi,% rsi,% rdx,% r10,% r8,% r9.
For architectures x86 and x86-64 system call tables are different !
When debugging, you might have noticed that before going into any function of the shared library from the main program for the first time, there are transitions and calls of various functions that we did not write. This is how the function addresses are defined using PIC technology ( 1 , 2 ).

exit

The exit function consists of the following calls:

__call_tls_dtors - causes thread local storage destructors, but in our case we did not use them.
Performs functions registered with atexit

_dl_fini is the same function that was passed to the_start in register9, which must be called before the program terminates.
Destructors of global objects (in our case they do not exist).

Calls the function of the __libc_atexit section

_IO_cleanup - clears and frees the file descriptor buffers.

_exit - terminates execution of all threads of the process.

The _exit function calls a system call 231 (sys_exit_group), which takes the return value of the program as a parameter in the% rdi register. This completes the program.
Linux also has a sys_exit system call. The difference between these two calls is that the latter terminates only the current thread, while sys_exit_group completes all threads of this process. In the case of a single-threaded process, the two data calls are equivalent, but in the case of a multi-threaded program, when the program terminates with sys_exit, the process will not complete threads and the system will not de-initialize the process until all its threads are completed [6].
This is the usual way that the processor goes through each time you run "Hello, World !!!", written in C / C ++, using the glibc library. Beyond the article, there are still a lot of things: the loader's work, the initialization of transactional memory, the implementation of the functions setjmp, atexit ...
It will be more obvious to show the done way in the form of a graph obtained with the help of dot