In the first article in this two-part series, published in the August 2024 issue of OSFY, we discussed the role of the C library in system call execution. We talked of how the C library loads system call arguments into architecture-specific registers, generating syscall interrupt, which switches the mode from the user to the kernel. In this final part we will discuss what happens after this — how the kernel handles and executes the system call request on behalf of the user space application, and sends the return value of the system call value back to the user application.
We have already discussed the role of the C library (glibc) in handling the request of system calls from user applications in the first part of this two-part series of articles. We learnt how glibc sets the system call arguments in the CPU registers and executes syscall instruction, which causes an exception (software interrupt) that transfers the control to an exception handler present in the kernel code. I will now discuss in detail system call execution handling from the perspective of the kernel.
System call table
Once the syscall instruction is executed and exception is generated, an exception handler is triggered. But how does the kernel know which exception handler it should invoke for syscall instruction, and how does the exception handler know which system call handler it should invoke? I will explain this but first let’s discuss the system call table.
The Linux kernel contains a table called the system call table, which is represented by sys_call_table array. This array is defined in /arch/x86/kernel/syscall_64.c. Given below are some of the important implementation details about sys_call_table.
/arch/x86/entry/syscall_64.c #define __SYSCALL_64(nr, sym, compat) extern asmlinkage void sym(void) ; #include <asm/syscalls_64.h> #undef __SYSCALL_64 #define __SYSCALL_64(nr, sym, compat) [nr] = sym, typedef void (*sys_call_ptr_t)(void); extern void sys_ni_syscall(void); const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn’t work * when the & below is removed. */ [0 ... __NR_syscall_max] = &sys_ni_syscall, #include <asm/syscalls_64.h> };
There are a few important points to be noted in the code given above.
- sys_call_table is an array of size __NR_syscall_max+1. __NR_syscall is defined as macro #define __NR_syscall_max 547 (may not be the latest) in the include/generated/asm-offsets.h header file generated at the time of kernel build.
- Type of sys_call_table array is sys_call_ptr_t. This is typedef void (*sys_call_ptr_t)(void);
- Initially all the elements of the array that contain pointers to the system call handlers point to the sys_ni_syscall. The sys_ni_syscall function only returns errno -ENOSYS.
- The most important thing to note is #include <asm/syscalls_64.h> in the last line.
- <asm/syscalls_64.h> header file will be generated by arch/x86/entry/syscalls/syscall_64.tbl through a script arch/x86/entry/syscalls/syscalltbl.sh.
- The generated header file asm/syscalls_64.h> will contain macro definitions similar to what is given below:..
__SYSCALL_COMMON(0, sys_read, sys_read) __SYSCALL_COMMON(1, sys_write, sys_write) __SYSCALL_COMMON(2, sys_open, sys_open) __SYSCALL_COMMON(3, sys_close, sys_close) __SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
- .. The SYSCALL_COMMON macro is defined in /arch/x86/entry/syscall_64.c as below:
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat) #define __SYSCALL_64(nr, sym, compat) [nr] = sym,….
- Finally, after expansion of the macro the sys_call_table will look like this: …
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { [0 ... __NR_syscall_max] = &sys_ni_syscall, [0] = sys_read, [1] = sys_write, [2] = sys_open,
- Now each array index of sys_call_table, where the index is equal to system call number, contains a corresponding system call handler address. Whereas the array indexes corresponding to non-implemented system calls will have the address of sys_ni_syscall, which returns –ENOSYS.
Execution of system call handler
We now understand that the kernel maintains a sys_call_table array that contains addresses of each system call handler function. But how are these system call handler functions invoked when user space applications execute a syscall instruction? In order to understand this, check the description in the box below.
From the Intel manual SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR. IA32_LSTAR_MSR is a model-specific register. A model-specific register (MSR) is any of various control registers in the x86 instruction set used for debugging, program execution tracing, computer performance monitoring, and toggling certain CPU features. Reading and writing to these registers is handled by the rdmsr and wrmsr instructions, respectively. As these are privileged instructions, they can be executed only by the operating system. |
It’s now clear that the kernel must write IA32_LSTAR_MSR with the address of kernel entry code that needs to be executed when a user space application triggers a system call request (by executing syscall instruction). This IA32_LSTAR_MSR is written with the address of system call entry point code during the kernel initialisation. Here are some of the important code snippets for system call handler execution by the Linux kernel. Check the code comments marked in red to understand the details about the implementation.
<span style="color: red;">/*MSR_LSTAR is written with the system call handling entry code *entry_SYSCALL_64 which is defined in arch/x86/entry/entry_64.S. *This means after syscall instruction in user mode, *entry_SYSCALL_64 will be executed in kernel code. */</span>
/arch/x86/kernel/cpu/common.c void syscall_init(void) { wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS); wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); . . }
<span style="color: red;">/*Now let’s see what in detail about entry_SYSCALL_64 */ arch/x86/entry/entry_64.S /</span>
ENTRY(entry_SYSCALL_64) . .
/* This SWAPGS_UNSAFE_STACK macro uses * swapgs instruction switches to kernel stack * for switching the stack swapgs instruction uses segment * register GS and MSR_KERNEL_GS_BASE */
SWAPGS_UNSAFE_STACK /* Move the old stack pointer to per cpu variable * and setup the stack pointer to point * to the top of stack for the current processor: */
GLOBAL(entry_SYSCALL_64_after_swapgs) movq %rsp, PER_CPU_VAR(rsp_scratch) movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
/* Construct struct pt_regs on stack * We are creating stack frame for system call handler * we are actually moving user stack segment, * stack pointer and other general-purpose register * on to the kernel stack. Remember general purpose register * state when system call happens from user space. * we push register flags on the top of the stack, * then user code segment, return address * to the user space, system call number, and system call arguments */
TRACE_IRQS_OFF /* Construct struct pt_regs on stack */ pushq $__USER_DS /* pt_regs->ss */ pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */ pushq %r11 /* pt_regs->flags */ pushq $__USER_CS /* pt_regs->cs */ pushq %rcx /* pt_regs->ip */ pushq %rax /* pt_regs->orig_ax */ pushq %rdi /* pt_regs->di */ pushq %rsi /* pt_regs->si */ pushq %rdx /* pt_regs->dx */ pushq %rcx /* pt_regs->cx */ pushq $-ENOSYS /* pt_regs->ax */ pushq %r8 /* pt_regs->r8 */ pushq %r9 /* pt_regs->r9 */ pushq %r10 /* pt_regs->r10 */ pushq %r11 /* pt_regs->r11 */ sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */ … … … entry_SYSCALL_64_fastpath:
/* * Easy case: enable interrupts and issue the syscall. If the syscall * needs pt_regs, we’ll call a stub that disables interrupts again * and jumps to the slow path. */
TRACE_IRQS_ON ENABLE_INTERRUPTS(CLBR_NONE)
/* Here basically we are comparing the system call number (eax) * with NR_syscall_max. Remeber NR_syscall_max * and sys_call_table that we had discussed */
#if __SYSCALL_MASK == ~0 cmpq $__NR_syscall_max, %rax #else andl $__SYSCALL_MASK, %eax cmpl $__NR_syscall_max, %eax #endif ja 1f /* return -ENOSYS (already in pt_regs->ax) */
/* We move the fourth argument from the r10 to the rcx * This move of r10 to rcx is as per the register calling * convention difference for passing the system call * argument through syscall instruction (we had discussed this earlier) * Now execute the call instruction with the address of system call handler * rax contain system call number and each array element of * sys_call_table contain pointer to system call handler * This pointer to system call handler is of 8 byte * hence *sys_call_table(, %rax, 8)find the correct offset * in sys_call_table array for the given system call handler. * This is all, now system call handler will be called */
movq %r10, %rcx call *sys_call_table(, %rax, 8)
Exiting from system call and resuming execution in user space
We have seen how the kernel executes the entry code for system call handler, saves user space program context onto the kernel stack, prepares stack frame for system call handle and, finally, executes the system call. Let’s now find out how the kernel sends the return value of system call to the user program and resumes execution in the user mode.
Once the system call handler finishes the execution, control will again return to arch/x86/entry/entry_64.S right after where we have called the system call handler: call *sys_call_table(, %rax, 8). Now let’s examine the code given below to understand the steps for returning from a system call.
arch/x86/entry/entry_64.S
/* system call handler set the return value in rax register * this rax value needs to be moved in kernel stack * where user mode rax is saved * After this all general purpose registers are restored except rcx and r11 * rcx will be loaded with return address to the user mode program i.e ip when * user mode program triggers syscall instruction. * This ip is saved in kernel stack * r11 is loaded with flag registers of user mode programs. */
movq %rax, RAX(%rsp) DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF movq PER_CPU_VAR(current_task), %r11 testl $_TIF_ALLWORK_MASK, TASK_TI_flags(%r11) jnz 1f LOCKDEP_SYS_EXIT TRACE_IRQS_ON /* user mode is traced as IRQs on */ movq RIP(%rsp), %rcx movq EFLAGS(%rsp), %r11 RESTORE_C_REGS_EXCEPT_RCX_R11 movq RSP(%rsp), %rsp
<span style="color: red;"> /* Finally USERGS_SYSRET64 macro will be invoked * This macro expands to the call of the swapgs instruction * this swapgs instruction will exchanges again the user GS and kernel GS * ultimately kernel mode stack is swapped with the user mode stack. * finally sysretq will be executed and kernel exit from sytem call handling * and user mode program resume exeuction. */</span>
USERGS_SYSRET64 #define USERGS_SYSRET64 \ swapgs; \ sysretq;
To sum up, we have deep dived into the Linux system call execution model, right from user space application and syscall instruction from the C library to handling syscall from the kernel perspective. Here are the important points.
- User space application or glibc fills the general-purpose register with system call number as well as arguments for system call and generates syscall instruction.
- Mode switches from the user mode to kernel mode and starts execution of the system call entry code entry_SYSCALL_64.
- entry_SYSCALL_64 switches to the kernel stack and saves general-purpose registers, user mode stack, user mode code segment, flags, etc, onto the kernel stack.
- entry_SYSCALL_64 checks the system call number in the rax register, searches system call handler in the sys_call_table and calls it.
- After a system call handler finishes its work, it restores general-purpose registers, user mode stack, flags, return address, and exits from the entry_SYSCALL_64 with the sysretq instruction. This resumes the execution of the user mode program.