CHERIoT Programmers' Guide

David Chisnall

Table of contents

3. The RTOS Core

The core of the RTOS is a set of privilege-separated components. Each core component runs with some privileges that mean that it is (at least partially) in the trusted computing base (TCB) for other things.

3.1. Starting the system with the loader

The loader runs on system startup. It reads the compartment headers and populates each compartment with the set of capabilities that it needs. The loader exists so that the system can be started from a firmware image that does not embed capabilities. This is a useful property even if a particular target has persistent storage (non-volatile RAM) that can hold capabilities because it ensures that there is an on-device pointer provenance flow for the firmware.

If a device has non-volatile storage that holds tags, you will typically run the loader once at install time or on first boot of a new firmware image. This ensures that the image contains only capabilities handed to it. This, in turn, enables multi-stage boot where some functionality, such as attestation, secure key storage, and so on, are provided by a bootloader. These abstractions can all be built from capabilities and so, unlike systems based on protection rings such as TrustZone, an arbitrary number can be nested.

If a compartment contains a global that is a pointer, initialised to point to another global, the loader will initialise pointer by deriving a capability from one out of the compartment's code or data capabilities. Again, this enforces provenance properties, this time within a firmware image. A malicious compartment may provide a relocation that points to a global outside its own memory, but the loader will attempt to derive the capability only from the compartment's initial pcc (code) and cgp (globals) regions and so will fail.

The loader must also provide all capabilities to compartments that allow them to communicate outside of their own private space. This includes access to memory-mapped I/O (MMIO) regions, capabilities for pre-shared objects, for software-defined capabilities, and any capabilities for calling entry points exposed by other compartments or libraries. The loader also creates the stacks and trusted stacks for each thread and creates their initial entry points.

The loader is the most privileged component in the system. When a CHERIoT CPU boots, it will have a small set of root capabilities in registers. These, between them, convey the full set of rights that can be granted by a capability. Every capability in the running system is derived (often via many steps) from one of these. As such, the loader is able to do anything.

In a system with a multi-stage boot, the initial capabilities provided to the loader may be restricted, rather than the omnipotent set from CPU boot. For example, an early loader may implement A/B booting by providing the RTOS loader with capabilities to only half of persistent memory.

The risk from the loader is mitigated by the fact that it does not run on untrusted data. The loader operates only on the instructions generated by the linker and so it is possible to audit precisely what it will do (see Chapter 11. Auditing firmware images). It is also possible to validate this by running the loader in a simulator and capturing the precise memory state after it has run.

The loader enforces some of the guarantees in the initial state. It is structured to be able to enforce some of these by construction. For example, only stacks and trusted stacks (accessible only by the switcher, see Section 3.2. Changing trust domain with the switcher) have store-local permission and these do not have global permission. The scheduler derives these from a capability that has store-local but not global permissions and derives all other capabilities from one that has the store-local permission removed.

Before starting the system, the loader erases almost all of its code (leaving the stub that handles this erasure), its stack, and clears its registers. The last bit of the loader's code becomes the idle thread (a wait-for-interrupt loop). The loader's stack is used for the scheduler stack. The memory that held loader's code is used for heap memory.

3.2. Changing trust domain with the switcher

The switcher is the most privileged component that runs after the system finishes booting. It is responsible for transitions between threads (context switches) and between compartments (cross-compartment calls and returns). The switcher is a very small amount of code—under 500 instructions—that is expected to be amenable to formal verification.

Work is underway to formally verify the security properties of the switcher, but is still in early stages.

The switcher is the only component in a running CHERIoT system that has access-system-registers permission. It uses this primarily to access a single reserved register that holds the trusted stack. The trusted stack is a region of memory containing the register save area for context switches and a small frame for every cross-compartment call that allows safe return even if the callee has corrupted all state that it has access to.

Trusted stacks are set up by the loader. The loader passes the scheduler (see Section 3.3. Time slicing with the scheduler) a sealed capability to each of these on initialisation. The switcher holds the only permit-unseal capability for the type used to seal trusted stacks.

The context switch path spills all registers to the current trusted stack's save area and then invokes the scheduler, which returns a sealed capability to the next thread to run. It then restores the register file from this thread and resumes. If the scheduler returns an invalid capability (one not sealed with the correct type) then the switcher will raise a fault. If the exception program counter capability on exception entry is within the switcher's capability, the switcher will terminate.

On the cross-compartment call path, the switcher is responsible for unsealing the capability that refers to the export table of the callee, clearing unused argument registers, pushing the information about the return to the trusted stack, subsetting the bounds of the stack, and zeroing the part of the stack passed to the callee. On return, it zeroes the stack again, zeroes unused return registers, and restores the callee's state.

This means that the switcher is the only component that has access to either two threads', or two compartments', state at the same time. As such, it is in the TCB for both compartment and thread isolation. This risk is mitigated in several ways:

The switcher appears to the rest of the system as a library. It can expose functions for inspecting or, in a small number of cases, modifying state. These are defined in switcher.h. For example, prior to performing a cross-compartment call, you may want to check that there is sufficient space on the trusted stack for the number of calls that it will need to make. The trusted_stack_has_space function exposed by the switcher lets you query if the trusted stack has enough space for a specified number of cross-compartment calls. The amount of (normal) stack space is directly visible in a compartment and so normal stack checks do not require the switcher to be involved.

Documentation for the trusted_stack_has_space function
_Bool trusted_stack_has_space(int requiredFrames)

Returns true if the trusted stack contains at least requiredFrames frames past the current one, false otherwise.

Note: This is faster than calling either trusted_stack_index or trusted_stack_size and so should be preferred in guards.

The switcher also implements the thread_id_get function, which provides a fast way for compartments to determine which thread they are currently running on. This function is used in the implementation of priority-inheriting locks (see Section 7.7. Inheriting priorities). Implementing efficient priority-inheriting locks requires a fast mechanism for getting the current thread ID so that it can be stored in the lock.

Documentation for the thread_id_get function
uint16_t thread_id_get()

Return the thread ID of the current running thread. This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.

This is implemented in the switcher.

3.3. Time slicing with the scheduler

When the switcher receives an interrupt (including an explicit yield), it delegates the decision about what to run next to the scheduler. The scheduler has direct access to the interrupt controller but, in most respects, is just another compartment.

The switcher also holds a capability to a small stack for use by the scheduler. This is not quite a full thread. It cannot make cross-compartment calls and is not independently schedulable. When the switcher handles an interrupt, it invokes the scheduler's entry point on this stack.

The scheduler also exposes other entry points that can be invoked by cross-compartment calls. These fulfil a role similar to system calls on other operating systems, for example waiting for external events or performing inter-thread communication. The scheduler implements blocking operations by moving the current thread from a run queue to a sleep queue and then issuing a software interrupt instruction to branch to the switcher. When the switcher then invokes the scheduler to make a scheduling decision, it will discover that the current thread is no longer runnable and pick another. Once the thread becomes runnable again, the switcher resumes the thread from the point where it yielded, at which point it can return from the scheduler.

The scheduler is, by definition, in the TCB for availability. It is the component that decides which threads run and which do not. A bug in the scheduler (with or without an active attacker) can result in a thread failing to run.

It is not; however, in the TCB for confidentiality or integrity. The scheduler has no mechanism to inspect the state of an interrupted thread. When invoked explicitly, it is called with a normal cross-compartment call and so has no access to anything other than the arguments.

As with the switcher, the scheduler mitigates these risks by being small (though larger than the switcher). It currently compiles to under 4 KiB of object code. This small size is accomplished by providing only a small set of features that can be used as building blocks for other tasks.

For example, some embedded operating systems provide features such as message queues in their kernel. In CHERIoT RTOS, these are provided by a separate library, which relies on the futex (see Section 7.5. Waiting for events with futexes) facility exposed by the scheduler to allow a producer to block when the queue is full and allow consumers to block when the queue is empty.

Futexes are the only mechanism that the scheduler provides for blocking. Interrupts are mapped to futexes and so threads wait for hardware or software events in exactly the same way. This narrow interface and clear separation of concerns helps improve overall system security.

3.4. Sharing memory from the allocator

The final core component is the memory allocator, which provides the heap, which is used for all dynamic memory allocations. This is discussed in detail in Chapter 8. Memory management in CHERIoT RTOS. Sharing memory between compartments in CHERIoT requires nothing more than passing pointers (until you start to add availability requirements in the presence of mutual distrust). This means that you can allocate objects (or complex object graphs) from a few bytes up to the entire memory of the system and share them with other compartments.

The allocator has access to the shadow bitmap and hardware revocation engine that enforce temporal safety for the heap, and is responsible for setting bounds on allocated memory. It is therefore trusted for confidentiality and integrity of memory allocated from the heap. If it incorrectly sets bounds, a compartment may gain access to memory belonging to another allocation. If it incorrectly configures revocation state or reuses memory too early then a use-after-free bug may become exploitable.

The allocator is not able to bypass capability permissions, it simply holds a capability that spans the whole of heap memory. As such, it is in the TCB only with respect to heap allocations. It cannot access globals (or code), held in other compartments and so a compartment that does not use the heap does not need to trust the allocator.

The allocator also provides a rich set of mechanisms (described in Chapter 8. Memory management in CHERIoT RTOS) for two mutually distrusting compartments to ensure that memory is not deallocated at inconvenient times.

3.5. Building a C/C++ environment

So far, this chapter has discussed the components of the system that provide distinct trust domains. CHERIoT RTOS also provides several shared libraries. Remember that (as discussed in Section 2.7. Sharing code with libraries), CHERIoT shared libraries do not have private globals, they provide functions that can be invoked from multiple compartments but which has no mutable state of its own.

A C freestanding environment needs a small set of standard-library functions to exist. These are memcpy, memset, and so on. The compiler may insert different calls to these for things like struct assignment or initialisation and any C code may assume that they exist.

These functions are fairly small (less than 0.5 KiB), but we do not want every compartment that contains them to have to include a copy of them because that small size can add up quickly when you have very large numbers of compartments. Instead, the RTOS includes a freestanding library that includes these functions.

C++ expects slightly more from a freestanding environment. CHERIoT RTOS does not provide support for exceptions, but does support thread-safe initialisation of statics. In C++, function-local static variables that have non-trivial constructors are initialised lazily, the first time that they are invoked. The compiler emits a guard word that is used to mark whether the object is initialised and act as a lock to protect initialisation. The compiler also emits a branch on the initialised bit at the point where the variable becomes in scope. In the uninitialised case, this inserts a call to __cxa_guard_acquire to acquire the lock and then a call to __cxa_guard_release to release the lock. The acquire function also checks whether another thread has initialised the variable between the initial check and the call, preventing double initialisation. These are provided by the cxxrt library.

In environments that support exceptions, there is a third function for static initialisation. The __cxa_guard_abort function is called when a constructor throws an exception. This is not supported on CHERIoT.

For both C and C++ (and other languages that use the same compiler back end) there are some sequences that compilers prefer to generate as calls to helper functions. For example, if you divide one 64-bit number by another, this is a single operator in C. On most 64-bit processors, it's a single divide instruction but on 32-bit processors it requires a much longer sequence. Similarly, GCC and Clang provide a population count built-in function that counts the number of bits in an integer that are set to one. This is a single instruction on most modern application processors, but requires some shifting and masking on microcontrollers.

CHERIoT RTOS provides the crt library to implement these functions. Not every compartment will need all of them, but the library is around 1.5 KiB of code and so duplicating it across every compartment that does need them would be likely to increase code size significantly.

Although the code sizes in this section may look small, moving a function to a separate compartment typically adds only a handful of bytes to the final binary size. It's quite feasible to have a compartment that is on the order of a hundred bytes in total size.

3.6. Supporting atomic operations

C11 and C++11 introduced atomic operations. In C, these use the _Atomic qualifier, in C++ they use the std::atomic<> template. Recently, these were somewhat unified so that Atomic(T) can be a macro that expands to std::atomic<T> in C++. In both languages, these are often referred to as the C++11 model because the C++ version was introduced first and then ported to C.

Compilers implement these with a set of built-in functions that, on most application cores, are lowered in the back end to atomic instructions. Most microcontrollers are single core and so may lack atomic instructions. The CHERIoT Ibex, for example, does not implement any atomic instructions. This does not matter because atomic instructions need to be atomic only with respect to the CPU core itself.

On single-core systems, atomic operations can be implemented by simply disabling interrupts, performing the read-modify-write sequence, and then enabling interrupts. As we saw in Section 2.5. Controlling interrupt status with sentries, CHERIoT has a simple mechanism for allowing a single function to run with interrupts disabled, without granting it the power to arbitrarily disable interrupts. This mechanism makes it possible to implement the atomic helpers are trivial C functions that just do the non-atomic operations but run with interrupts disabled. Each atomic operation on 1, 2, 4, and 8-byte values is defined as a separate function.

On larger cores, these simple helpers are unnecessary, but C and C++ specified that any type could be atomic. Operations such as load, store, exchange and compare-and-exchange are expected to work even if the object is enormous. The compiler expects functions that implement these operations on arbitrary-sized data.

For types larger than a capability, CHERIoT's representations of _Atomic(T) and std::atomic<T> differ. The C version will use the variable-sized atomic helpers that run with interrupts disabled. The C++ version uses an inline lock in the object and performs atomic operations by acquiring and releasing the lock as needed. This prevents the C++ version from monopolising the CPU and breaking realtime guarantees but breaks interoperability between the two languages. In general, it is a good idea to avoid _Atomic(T) for any type that is larger than a pointer in C.

Many firmware images use only a small subset of these sizes. 32-bit atomic values are common, and are the size supported for the futex operations (see Section 7.5. Waiting for events with futexes). The RTOS provides a family of libraries for implementing atomics. The atomic1, atomic2, atomic4 and atomic8 libraries each provide atomic operations for one fixed-size type. As a convenience, there is also an atomic_fixed pretend library that simply acts as if you'd depended on all of the fixed-sized versions. Finally, there is a general atomic library, which also introduces the variable-sized types. The last of these is very rarely used.

3.7. Adding more standard-library functions

A freestanding C is the minimum for an embedded system but it's far from a pleasant development environment. The C strings.h header contains functions such as strlen and strcpy, which do not depend on any operating-system functionality but are also not required by a freestanding environment. On CHERIoT RTOS, these are provided by the strings library.

Similarly, we provide a minimal implementation of a subset of stdio.h that is useful for debugging via the stdio library. In most C implementations there is a single libc or similar that provides all of the standard library. CHERIoT prefers to decompose this so that firmware images can adopt useful subsets.

Most of the C++ standard library that we provide is header-only, but where shared implementations are useful these will be similarly decomposed.

3.8. Exploring other RTOS features

The lib directory in the RTOS SDK contains all of the libraries. Some of these have been discussed already because they are part of a core environment that you might expect to be available for any compartment to use.

Others are correspond more to operating system features and will be discussed in later chapters. For example, those related to locking, message queues, and so on a will be discussed in Chapter 7. Communicating between threads. The features for debugging are discussed in Chapter 9. Features for debug builds.

Others, such as the port of the Microvium JavaScript interpreter are not discussed in the book at all. This book does not aim to provide an exhaustive list of everything that the RTOS provides as libraries and more will be added after the book is published. Please look in the RTOS repository to see what has been added.