This two-part series of articles focuses on the system call execution model in Linux based operating systems. This first part explains what system calls are, why they are required and the role of the glibc wrapper in system call execution. It then touches on the system call execution model from the Linux kernel perspective.
A system call is a request by a user space program (application) to interact with an OS executing in the kernel space. Basically, the user application invokes a system call when it requires access to the services that can only be accessed through a higher privilege mode — for example, creating a new task, doing network I/O or file I/O, or accessing hardware resources. These operations cannot be directly performed by the user space application; hence, operating systems like Linux provide a set of routines called system calls which are basically C functions executing in the kernel space.
When a user space program invokes a system call, there is a software interrupt (nowadays x86-64 provides syscall instruction for fast system call execution) and the mode switches from user space to kernel space (or more precisely, the privilege mode changes from lower to higher). Now the system call handler in the kernel space performs the required operation on behalf of the user space application and sends the response back to it.
We will see in detail in later sections as to how the user space to kernel space mode switching happens and how kernel space system call handlers are invoked. But first let’s examine the role of the standard C library in the execution of system calls.
Note: x86-64 CPUs have a concept called privilege levels. A privilege level in the x86 instruction set controls the access of the program currently running on the processor to resources such as memory regions, I/O ports, and special instructions. There are four privilege levels ranging from 0, which is the most privileged, to 3 which is least privileged. Most modern operating systems use level 0 for the kernel/executive and use level 3 for application programs. Hence, the kernel runs at the most privileged level, called Ring 0, and user programs run at a lesser level, typically Ring 3.
Any resource available to level n is also available to levels 0 to n, so the privilege levels are rings. When a lesser privileged process tries to access a higher privileged process, a general protection fault exception is reported to the OS. |
Role of the C library
When we say C library, the most commonly and widely distributed C library with a Linux based OS is glibc or GNU C library. This C library helps implement standard C functions and APIs like print(), scanf(), malloc(), fopen(), strcpy(), etc. These standard functions may or may not invoke system calls internally — for example, printf() internally invokes write(2) system calls. However, all these internal invocations of system calls are hidden from the user space application.
Now let’s examine the role of C library when a user space application invokes a system call. For sake of simplicity let’s take the example of the open(2) system call.
#include <stdlib.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <errno.h> int main () { int fd = open (“/home/shwetabh/wrapper.txt”, O_CREAT|O_RDWR); close(fd); return 0; }
In the above code snippet, the user application invokes the open(2) system call. But as we already discussed, system calls are implemented in kernel space and cannot be directly invoked from user programs. So how does a user space application invoke open(2) system calls. Actually, the open() in the above code snippet is the wrapper function implemented in the C library (glibc). This wrapper function internally invokes the actual system call implemented in the kernel. The C library provides the wrapper for almost all the system calls implemented in Linux. Therefore, from an application program point of view, invoking a system call is similar to calling a C function.
Now we understand that glibc provides the system call wrapper and the user application invokes it, which further invokes system call. Let’s now examine the major steps performed by the wrapper function for the execution of system calls. These are:
- Setting up arguments
- Calling system calls
- Checking the return values
Setting up arguments
The wrapper function performs the validation, initialisation (in some cases) and error checking of the argument provided by the application program. In case of an error, it directly returns to the user program with the relevant ‘errno’. After successful validation, wrapper functions load the system call number and argument into a specific CPU register as specified by the Linux kernel. The table below shows the detailed order in which the wrapper function or user space program should load the arguments into registers before invoking the system call.
From AMD64 architecture manual. The Linux AMD64 kernel uses internally the same calling conventions as user-level applications. User-level applications that like to call system calls should use the functions from the C library. The interface between the C library and the Linux kernel is the same as for the user-level applications with the following differences: User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11. The number of the syscall has to be passed in register %rax. System-calls are limited to six arguments, no argument is passed directly on the stack. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno. Only values of class INTEGER or class MEMORY are passed to the kernel. Class INTEGER: This class consists of integer types that fit into one of the general-purpose registers.
Register Argument User Space Argument Kernel Space %rax Not Used System Call Number %rdi Argument 1 Argument 1 %rsi Argument 2 Argument 2 %rdx Argument 3 Argument 3 %r10 Not Used Argument 4 %r8 Argument 5 Argument 5 %r9 Argument 6 Argument 6 %rcx Argument 4 Destroyed %r11 Not Used Destroyed |
Calling system calls
As we have already discussed, system calls or, more precisely, system call handlers are a part of kernel code, which are executed in the kernel mode (high privilege mode 0). Therefore, it is not possible for user space applications or glibc wrapper to execute kernel code directly. To change the mode from user to kernel space, glibc wrapper function must somehow signal system call handler execution to the kernel.
In the older version of x86 processors, Linux kernel and glibc, system call wrappers used to trigger a software interrupt (exception), and the system would switch to kernel mode and execute the exception handler. The exception handler, in this case, is the system call handler. The defined software interrupt on x86 is interrupt number 128, which is incurred via the int $0x80 instruction. It triggers a switch to kernel mode and the execution of exception vector 128, which is the system call handler.
However, the int assembly language instruction is inherently slow because it performs several consistency and security checks. Nowadays, modern x86 processors provide a faster way of system call through the execution of syscall (on x86-64 architecture) instruction. The newer version of glibc system call wrapper executes syscall instruction for switching the mode from user space to kernel space and execution of system call handler.
We now understand that glibc system call wrapper sets the argument into the specified registers (as directed by kernel) and executes syscall instruction to hand over the work to the kernel space. With this knowledge, lets browse the glibc wrapper of open(2) system call to check how things are actually implemented in C library. Here’s the code snippet that gives the details of implementation. Check the code comment marked in red.
sysdeps/unix/sysv/linux/generic/open.c int __libc_open (const char *file, int oflag, ...) { int mode = 0; if (__OPEN_NEEDS_MODE (oflag)) { va_list arg; va_start (arg, oflag); mode = va_arg (arg, int); va_end (arg); } if (SINGLE_THREAD_P) return INLINE_SYSCALL (openat, 4, AT_FDCWD, file, oflag, mode); int oldtype = LIBC_CANCEL_ASYNC ();
/ *** internally glibc is invoking openat(2) system call * INLINE_SYSCALL macro is used for invoking the system call */
int result = INLINE_SYSCALL (openat, 4, AT_FDCWD, file, oflag, mode); LIBC_CANCEL_RESET (oldtype); return result; } libc_hidden_def (__libc_open) weak_alias (__libc_open, __open) libc_hidden_weak (__open) weak_alias (__libc_open, open) sysdeps/unix/sysv/linux/x86_64/sysdep.h # undef INLINE_SYSCALL # define INLINE_SYSCALL(name, nr, args...) \ ({
/* Further INTERNAL_SYSCALL macro is being invoked*/
unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args); \ if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, ))) \ { \ __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, )); \ resultvar = (unsigned long int) -1; \ } \ (long int) resultvar; }) #define INTERNAL_SYSCALL(name, err, nr, args...) \ internal_syscall##nr (SYS_ify (name), err, args) #undef SYS_ify #define SYS_ify(syscall_name) __NR_##syscall_name /* SYS_ify() macro will convert the name into system call number NR_syscall_name.
* These system call number are defined * /usr/include/x86_64linuxgnu/asm/unistd_64.h * #define __NR_read 0 * #define __NR_write 1 * #define __NR_open 2 * #define __NR_close 3 * #define __NR_stat 4 * Based on number of argument (nr) required for system call * nr will be appended and macro will be expanded * internal_syscall0, internal_syscall1 ….. internal_syscall7 * for openat is 4 arguments system call hence macro will expanded to * internal_syscall4 */
#undef internal_syscall4 #define internal_syscall4(number, err, arg1, arg2, arg3, arg4) \ ({ \ unsigned long int resultvar; TYPEFY (arg4, __arg4) = ARGIFY (arg4) \ TYPEFY (arg3, __arg3) = ARGIFY (arg3); \ TYPEFY (arg2, __arg2) = ARGIFY (arg2); \ TYPEFY (arg1, __arg1) = ARGIFY (arg1); \ register TYPEFY (arg4, _a4) asm (“r10”) = __arg4; \ register TYPEFY (arg3, _a3) asm (“rdx”) = __arg3; \ register TYPEFY (arg2, _a2) asm (“rsi”) = __arg2; \ register TYPEFY (arg1, _a1) asm (“rdi”) = __arg1; \
/* Below asm code snippet, system call number and arguments are loaded * and syscall instruction is executed which ultimately changes the * mode to kernel space and leads to invocation of kernel space system call handler */
asm volatile ( \ “syscall\n\t” \ : “=a” (resultvar) \ : “0” (number), “r” (_a1), “r” (_a2), “r” (_a3), “r” (_a4) \ : “memory”, REGISTERS_CLOBBERED_BY_SYSCALL); \ (long int) resultvar; \ })
So now we understand how the glibc system call wrapper prepares the arguments and executes syscall instruction to change the mode from user space to kernel space. We can see that the system call wrapper hides all architecture level assembly complexity related with argument preparation and execution of syscall instructions from the user space application program. For invoking a particular system call, the user space application only needs to invoke the wrapper function provided by glibc like a normal C function or API. There is no need to bother about the underlying architecture related to assembly code/instructions.
Calling system call with syscall(2) interface
We have seen how glibc wrapper hides the system call invocation complexity from the user space application program. But what if the glibc wrapper doesn’t exist for some of the system calls (such as futex(2)). In order to handle or invoke such system calls from user space applications, glibc provides a generic library function called syscall(2). This syscall(2) function is implemented as assembly in glibc (as it needs to take care of architecture-specific assembly instructions). The implementation of syscall(2) function in glibc for x86-64 is given below. Please check the comments marked in red.
#include <sysdep.h> /* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for more information about the value -4095 used below. */ /* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6) We need to do some arg shifting, the syscall_number will be rax. */
#include <sysdep.h> /* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for more information about the value -4095 used below. */ /* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6) We need to do some arg shifting, the syscall_number will be rax. */
.text ENTRY (syscall) movq %rdi, %rax /* Syscall number -> rax. */ movq %rsi, %rdi /* shift arg1 - arg5. */ movq %rdx, %rsi movq %rcx, %rdx movq %r8, %r10 movq %r9, %r8 movq 8(%rsp),%r9 /* arg6 is on the stack. */ syscall /* Do the system call. */ cmpq $-4095, %rax /* Check %rax for error. */ jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */ ret /* Return to caller. */ PSEUDO_END (syscall) From the man page of syscall(2) #include <sys/syscall.h> /* Definition of SYS_* constants */ #include <unistd.h> long syscall(long number, ...); /* example program to invoke write(2) system call via syscall(2)*/ #include <unistd.h> #include <sys/syscall.h> int main () { syscall (SYS_write, 1, “Hello syscall”, 13); return 0; }
Checking the return value
In x86-64 architecture, before resuming user space, the Linux kernel puts the system call handler return value in the rax register. Hence the glibc system call wrapper function (after executing the syscall instruction) reads the system call return value from the rax register.
In case of success the kernel sets a positive value in the rax register, while in case of error it sets a negative value. This negative value is actually one of the negated values of ‘errno’ constant defined in /usr/include/asm-generic/errno-base.h. In case of a positive value in the rax register, glibc wrapper function gives 0 (success) as return value of system call to the user application program. However, in case of negative value, glibc wrapper gives -1 (failure) as return value to the user application program and sets ‘errno’ by negating the value in the rax register; this means it will be a positive value defined as one of the errno constants in /usr/include/asm-generic/errno-base.h. Given below are some of the code snippets and comments from the glibc code related with handling of the return value from the system call. Please check the code comments marked in red.
sysdeps/unix/sysv/linux/x86_64/sysdep.h /* Linux uses a negative return value to indicate syscall errors, unlike most Unices, which use the condition codes’ carry flag. Since version 2.1 the return value of a system call might be negative even if the call succeeded. E.g., the `lseek’ system call might return a large offset. Therefore, we must not anymore test for < 0, but test for a real error by making sure the value in %eax is a real error number. Linus said he will make sure the no syscall returns a value in -1 .. -4095 as a valid result so we can savely test with -4095. */ # undef INLINE_SYSCALL # define INLINE_SYSCALL(name, nr, args...) \ ({ /* Further INTERNAL_SYSCALL macro is being invoked*/ unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args); \ /* This macro check the range of return value */ if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, ))) \ { \ /* set errno by negating the return value */ __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, )); \ resultvar = (unsigned long int) -1 \ } \ (long int) resultvar; }) # undef INTERNAL_SYSCALL_ERROR_P # define INTERNAL_SYSCALL_ERROR_P(val, err) \ /* Range of error value is -1 to -4095 */ ((unsigned long int) (long int) (val) >= -4095L) # undef INTERNAL_SYSCALL_ERRNO /*negating the value */ # define INTERNAL_SYSCALL_ERRNO(val, err) (-(val)) cat /usr/include/asm-generic/errno-base #ifndef _ASM_GENERIC_ERRNO_BASE_H #define _ASM_GENERIC_ERRNO_BASE_H #define EPERM 1 /* Operation not permitted */ #define ENOENT 2 /* No such file or directory */ #define ESRCH 3 /* No such process */ #define EINTR 4 /* Interrupted system call */ #define EIO 5 /* I/O error */ #define ENXIO 6 /* No such device or address */ #define E2BIG 7 /* Argument list too long */ #define ENOEXEC 8 /* Exec format error */ #define EBADF 9 /* Bad file number */ #define ECHILD 10 /* No child processes */ #define EAGAIN 11 /* Try again */ #define ENOMEM 12 /* Out of memory */ #define EACCES 13 /* Permission denied */ #define EFAULT 14 /* Bad address */ #define ENOTBLK 15 /* Block device required */ #define EBUSY 16 /* Device or resource busy */ #define EEXIST 17 /* File exists */ #define EXDEV 18 /* Cross-device link */ #define ENODEV 19 /* No such device */ #define ENOTDIR 20 /* Not a directory */ #define EISDIR 21 /* Is a directory */ #define EINVAL 22 /* Invalid argument */ #define ENFILE 23 /* File table overflow */ #define EMFILE 24 /* Too many open files */ #define ENOTTY 25 /* Not a typewriter */ #define ETXTBSY 26 /* Text file busy */ #define EFBIG 27 /* File too large */ #define ENOSPC 28 /* No space left on device */ #define ESPIPE 29 /* Illegal seek */ #define EROFS 30 /* Read-only file system */ #define EMLINK 31 /* Too many links */ #define EPIPE 32 /* Broken pipe */ #define EDOM 33 /* Math argument out of domain of func */ #define ERANGE 34 /* Math result not representable */ #endif
With this, I conclude the first article in the two-part series. We have discussed the system call execution model from the user mode perspective as well as the role of the C library wrapper function in system call execution. We now understand how C library loads system call arguments into architecture-specific registers and generates syscall interrupt to switch the mode from user space to kernel space.
In the next and final part we will discuss system call execution from the perspective of the kernel mode.