2. The RTOS Core
The core of the RTOS is a set of privilege-separated components. Each core component runs with some privileges that mean that it is (at least partially) in the trusted computing base (TCB) for other things.
2.1. Starting the system with the loader
The loader runs on system startup. It reads the compartment headers and populates each compartment with the set of capabilities that it needs. The loader exists so that the system can be started from a firmware image that does not embed capabilities. This is a useful property even if a particular target has persistent storage (non-volatile RAM) that can hold capabilities because it ensures that there is an on-device pointer provenance flow for the firmware.
If a device has non-volatile storage that holds tags, you will typically run the loader once at install time or on first boot of a new firmware image. This ensures that the image contains only capabilities handed to it. This, in turn, enables multi-stage boot where some functionality, such as attestation, secure key storage, and so on, are provided by a bootloader. These abstractions can all be built from capabilities and so, unlike systems based on protection rings such as TrustZone, an arbitrary number can be nested.
If a compartment contains a global that is a pointer, initialised to point to another global, the loader will initialise pointer by deriving a capability from one out of the compartment's code or data capabilities. Again, this enforces provenance properties, this time within a firmware image. A malicious compartment may provide a relocation that points to a global outside its own memory, but the loader will attempt to derive the capability only from the compartment's initial pcc (code) and cgp (globals) regions and so will fail.
The loader must also provide all capabilities to compartments that allow them to communicate outside of their own private space. This includes access to memory-mapped I/O (MMIO) regions, capabilities for pre-shared objects, for software-defined capabilities, and any capabilities for calling entry points exposed by other compartments or libraries. The loader also creates the stacks and trusted stacks for each thread and creates their initial entry points.
The loader is the most privileged component in the system. When a CHERIoT CPU boots, it will have a small set of root capabilities in registers. These, between them, convey the full set of rights that can be granted by a capability. Every capability in the running system is derived (often via many steps) from one of these. As such, the loader is able to do anything.
In a system with a multi-stage boot, the initial capabilities provided to the loader may be restricted, rather than the omnipotent set from CPU boot. For example, an early loader may implement A/B booting by providing the RTOS loader with capabilities to only half of persistent memory.
The risk from the loader is mitigated by the fact that it does not run on untrusted data. The loader operates only on the instructions generated by the linker and so it is possible to audit precisely what it will do (see Chapter 10. Auditing firmware images). It is also possible to validate this by running the loader in a simulator and capturing the precise memory state after it has run.
The loader enforces some of the guarantees in the initial state. It is structured to be able to enforce some of these by construction. For example, only stacks and trusted stacks (accessible only by the switcher, see Section 2.2. Changing trust domain with the switcher) have store-local permission and these do not have global permission. The scheduler derives these from a capability that has store-local but not global permissions and derives all other capabilities from one that has the store-local permission removed.
Before starting the system, the loader erases almost all of its code (leaving the stub that handles this erasure), its stack, and clears its registers. The last bit of the loader's code becomes the idle thread (a wait-for-interrupt loop). The loader's stack is used for the scheduler stack. The memory that held loader's code is used for heap memory.
2.2. Changing trust domain with the switcher
The switcher is the most privileged component that runs after the system finishes booting. It is responsible for transitions between threads (context switches) and between compartments (cross-compartment calls and returns). The switcher is a very small amount of code—under 500 instructions—that is expected to be amenable to formal verification.
Work is underway to formally verify the security properties of the switcher, but is still in early stages.
The switcher is the only component in a running CHERIoT system that has access-system-registers permission. It uses this primarily to access a single reserved register that holds the trusted stack. The trusted stack is a region of memory containing the register save area for context switches and a small frame for every cross-compartment call that allows safe return even if the callee has corrupted all state that it has access to.
Trusted stacks are set up by the loader. The loader passes the scheduler (see Section 2.3. Time slicing with the scheduler) a sealed capability to each of these on initialisation. The switcher holds the only permit-unseal capability for the type used to seal trusted stacks.
The context switch path spills all registers to the current trusted stack's save area and then invokes the scheduler, which returns a sealed capability to the next thread to run. It then restores the register file from this thread and resumes. If the scheduler returns an invalid capability (one not sealed with the correct type) then the switcher will raise a fault. If the exception program counter capability on exception entry is within the switcher's capability, the switcher will terminate.
On the cross-compartment call path, the switcher is responsible for unsealing the capability that refers to the export table of the callee, clearing unused argument registers, pushing the information about the return to the trusted stack, subsetting the bounds of the stack, and zeroing the part of the stack passed to the callee. On return, it zeroes the stack again, zeroes unused return registers, and restores the callee's state.
This means that the switcher is the only component that has access to either two threads', or two compartments', state at the same time. As such, it is in the TCB for both compartment and thread isolation. This risk is mitigated in several ways:
- The switcher is small. It contains a similar number of instructions to the amount of unverified code in seL4.
- The switcher is defensive. Most errors simply forcibly unwind to the previous trusted stack frame, so a compartment that attempts to attack the switcher exits to its caller.
- Like everything else in the system, it must follow the capability rules. Unlike an operating system running in a privileged mode on mainstream hardware, it does not get to opt out of memory protection, it is not able to access beyond the bounds of capabilities passed to it or access any memory to which it does not have an explicit capability.
- It is largely stateless, all state that it modifies is held in the trusted stack for the current thread.
The switcher appears to the rest of the system as a library. It can expose functions for inspecting or, in a small number of cases, modifying state. These are defined in switcher.h. For example, prior to performing a cross-compartment call, you may want to check that there is sufficient space on the trusted stack for the number of calls that it will need to make. The trusted_stack_has_space function exposed by the switcher lets you query if the trusted stack has enough space for a specified number of cross-compartment calls. The amount of (normal) stack space is directly visible in a compartment and so normal stack checks do not require the switcher to be involved.
_Bool trusted_stack_has_space(int requiredFrames)
Returns true if the trusted stack contains at least requiredFrames frames past the current one, false otherwise.
Note: This is faster than calling either trusted_stack_index or trusted_stack_size and so should be preferred in guards.
The switcher also implements the thread_id_get function, which provides a fast way for compartments to determine which thread they are currently running on. This function is used in the implementation of priority-inheriting locks (see Section 6.7. Inheriting priorities). Implementing efficient priority-inheriting locks requires a fast mechanism for getting the current thread ID so that it can be stored in the lock.
uint16_t thread_id_get()
Return the thread ID of the current running thread. This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.
This is implemented in the switcher.
2.3. Time slicing with the scheduler
When the switcher receives an interrupt (including an explicit yield), it delegates the decision about what to run next to the scheduler. The scheduler has direct access to the interrupt controller but, in most respects, is just another compartment.
The switcher also holds a capability to a small stack for use by the scheduler. This is not quite a full thread. It cannot make cross-compartment calls and is not independently schedulable. When the switcher handles an interrupt, it invokes the scheduler's entry point on this stack.
The scheduler also exposes other entry points that can be invoked by cross-compartment calls. These fulfil a role similar to system calls on other operating systems, for example waiting for external events or performing inter-thread communication. The scheduler implements blocking operations by moving the current thread from a run queue to a sleep queue and then issuing a software interrupt instruction to branch to the switcher. When the switcher then invokes the scheduler to make a scheduling decision, it will discover that the current thread is no longer runnable and pick another. Once the thread becomes runnable again, the switcher resumes the thread from the point where it yielded, at which point it can return from the scheduler.
The scheduler is, by definition, in the TCB for availability. It is the component that decides which threads run and which do not. A bug in the scheduler (with or without an active attacker) can result in a thread failing to run.
It is not; however, in the TCB for confidentiality or integrity. The scheduler has no mechanism to inspect the state of an interrupted thread. When invoked explicitly, it is called with a normal cross-compartment call and so has no access to anything other than the arguments.
As with the switcher, the scheduler mitigates these risks by being small (though larger than the switcher). It currently compiles to under 4 KiB of object code. This small size is accomplished by providing only a small set of features that can be used as building blocks for other tasks.
For example, some embedded operating systems provide features such as message queues in their kernel. In CHERIoT RTOS, these are provided by a separate library, which relies on the futex (see Section 6.5. Waiting for events with futexes) facility exposed by the scheduler to allow a producer to block when the queue is full and allow consumers to block when the queue is empty.
Futexes are the only mechanism that the scheduler provides for blocking. Interrupts are mapped to futexes and so threads wait for hardware or software events in exactly the same way. This narrow interface and clear separation of concerns helps improve overall system security.
2.4. Sharing memory from the allocator
The final core component is the memory allocator, which provides the heap, which is used for all dynamic memory allocations. This is discussed in detail in Chapter 7. Memory management in CHERIoT RTOS. Sharing memory between compartments in CHERIoT requires nothing more than passing pointers (until you start to add availability requirements in the presence of mutual distrust). This means that you can allocate objects (or complex object graphs) from a few bytes up to the entire memory of the system and share them with other compartments.
The allocator has access to the shadow bitmap and hardware revocation engine that enforce temporal safety for the heap, and is responsible for setting bounds on allocated memory. It is therefore trusted for confidentiality and integrity of memory allocated from the heap. If it incorrectly sets bounds, a compartment may gain access to memory belonging to another allocation. If it incorrectly configures revocation state or reuses memory too early then a use-after-free bug may become exploitable.
The allocator is not able to bypass capability permissions, it simply holds a capability that spans the whole of heap memory. As such, it is in the TCB only with respect to heap allocations. It cannot access globals (or code), held in other compartments and so a compartment that does not use the heap does not need to trust the allocator.
The allocator also provides a rich set of mechanisms (described in Chapter 7. Memory management in CHERIoT RTOS) for two mutually distrusting compartments to ensure that memory is not deallocated at inconvenient times.