Linux Syscall

System calls on the Intel architecture involve a special trap instruction. This instruction both modifies the flow of control and the privilege state of the processor. After the trap, the processor is running in kernel mode, with access to the entire physical memory, with return information on a special stack reserved for kernel operations, running a special trap handling function.

Setting the target of this trap is a very deep operation which is performed by the operating system during boot. The processor has a register which needs to be loaded with the base address of a table. The table is a sequence of addresses, each address a function for handling the various interrupts and exceptions, including the trap. This table is called the interrupt vector. Both that table is initialzed and the processor register set during boot and determine the one and only entry point into the kernel for the purpose of system services. This trap and the function that receives the trap is called the syscall.

The setting of the kernel stack is more complicated. When a trap occurs, it is important that the kernel have a private and trusted stack to handle the trap. The kernel cannot continue to run on the user's stack, for many reasons. There is a choice here: either have a kernel stack associated with each process or a single kernel stack used universally. I believe linux choses to have a stack per process, and this means that the trap configuration for the processor must be modified with each context switch.

The syscall code is found at line 364 of entry.S (for the i386 architecture). In order that this one function server the entire API of kernel services, a differentiator is provided as a necessary argument to the syscall. In Linux, and certainly many other operating systems, it is an integer. Each service of the API is assigned a number. The syscall function then looks in a table, the syscall_table, to the location in the table offset by this number, and finds there the address of the function implementing the system call. The actual vectored call to the service is at line 377 of entry.S.

More on traps

Sorry about this digression, I'm so interested that you know this but this isn't really the exact right moment. So skip over this section if you want.

Oh good, you didn't skip over.

A trap is one of a family of trap-like operations or events. The names and taxonomy will vary from processor to processor, however they all must break down according to the same broad principles. Intel defines Interrupt and Exceptions. Interrupts are signals to change processor state that are caused by outside events whereas exceptions are caused by some operation internal to the processor. Interrupts can be maskable or non-maskabled, meaning they can be ignored or cannot be ignored. Exceptions are classified as programmed or non-programmed. A programmed exception is really like a call because it occurs by running a special instruction. Unlike a call, it changes processor state while vectoring through a special register or memory location. Intel calls these call gates.

The non-programmed exceptions are somehow errors. They are classified as Faults, Traps, or Aborts. A Fault requires that the faulting instruction be retried to continue program flow. A Trap requires that the instruction following the trap be the location to continue program flow. An Abort means that the instruction flow cannot be continued. The aborting instruction completed partially and cannot be restarted or continued.

So I'm lying. The syscall is not a trap, it's a programmed exception. But that does not indeed sound cool. Now you wish you skipped over.

System services

The syscall and the syscall table are the fundamental mechanism for getting system services. Bookending this mechanim is a layer of code on the user side that presents the syscall in a user-friendly way, and, on the other side, a kernel function which implements the service.

For example, the syscall nice is number 34, and is the 34-th entry in the syscall_table, on line 36. It points the the sys_nice which implements the syscall. Their must be a convention for the passing of arguments are return values. The argument passing is signaled by the asmlinkage keyword. The compiler takes care of the rest. The return convention is that return 0 means the syscall completed succesfully.

The other bookend concerns the user code. This is not part of the kernel, and is therefore not part of linux. Rather this would be GNU, and in paricular, the C library, libc. The system call numbers are defined in the common and public header file unistd.h. There is a defined macro to make the system call, which is in fact an int 0x80 command.

Code References