2. The RTOS Core
The core of the RTOS is a set of privilege-separated components. Each core component runs with some privileges that mean that it is (at least partially) in the trusted computing base (TCB) for other things.
2.1. Starting the system with the loader
The loader runs on system startup. It reads the compartment headers and populates each compartment with the set of capabilities that it needs. If a compartment contains a global that is a pointer, initialised to point to another global, the loader will initialise pointer by deriving a capability from one out of the compartment's code or data capabilities. In addition, cross-compartment and cross-library calls and imported MMIO regions all require capabilities to be set up, as does the initial register file for each thread.
The loader is the most privileged component in the system. When a CHERIoT CPU boots, it will have a small set of root capabilities in registers. These, between them, convey the full set of rights that can be granted by a capability. Every capability in the running system is derived (often via many steps) from one of these. As such, the loader is able to do anything.
The risk from the loader is mitigated by the fact that it does not run on untrusted data. The loader operates only on the instructions generated by the linker and so it is possible to audit precisely what it will do. It is also possible to validate this by running the loader in a simulator and capturing the precise memory state after it has run.
The loader enforces some of the guarantees in the initial state. It is structured to be able to enforce some of these by construction. For example, only stacks and trusted stacks (accessible only by the switcher, see Section 2.2. Changing trust domain with the switcher) have store-local permission and these do not have global permission. The scheduler derives these from a capability that has store-local but not global permissions and derives all other capabilities from one that has the store-local permission removed.
Before starting the system, the loader erases almost all of its code (leaving the stub that handles this erasure), its stack, and clears its registers. The last bit of the loader's code becomes the idle thread (a wait-for-interrupt loop). The loader's stack is used for the scheduler stack. The memory that held loader's code is used for heap memory.
2.2. Changing trust domain with the switcher
The switcher is the most privileged component that runs after the system finishes booting. It is responsible for transitions between threads (context switches) and between compartments (cross-compartment calls and returns). The switcher is a very small amount of code—under 500 instructions—that is expected to be amenable to formal verification.
The switcher is the only component in a running CHERIoT system that has access-system-registers permission. It uses this primarily to access a single reserved register that holds the trusted stack. The trusted stack is a region of memory containing the register save area for context switches and a small frame for every cross-compartment call that allows safe return even if the callee has corrupted all state that it has access to.
Trusted stacks are set up by the loader. The loader passes the scheduler (see Section 2.3. Time slicing with the scheduler) a sealed capability to each of these on initialisation. The switcher holds the only permit-unseal capability for the type used to seal trusted stacks.
The context switch path spills all registers to the current trusted stack's save area and then invokes the scheduler, which returns a sealed capability to the next thread to run. It then restores the register file from this thread and resumes. If the scheduler returns an invalid capability (one not sealed with the correct type) then the switcher will raise a fault. If the exception program counter capability on exception entry is within the switcher's capability, the switcher will terminate.
On the cross-compartment call path, the switcher is responsible for unsealing the capability that refers to the export table of the callee, clearing unused argument registers, pushing the information about the return to the trusted stack, subsetting the bounds of the stack, and zeroing the part of the stack passed to the callee. On return, it zeroes the stack again, zeroes unused return registers, and restores the callee's state.
This means that the switcher is the only component that has access to either two threads', or two compartments', state at the same time. As such, it is in the TCB for both compartment and thread isolation. This risk is mitigated in several ways:
- The switcher is small. It contains a similar number of instructions to the amount of unverified code in seL4.
- The switcher is defensive. Most errors simply forcibly unwind to the previous trusted stack frame, so a compartment that attempts to attack the switcher exits to its caller.
- Like everything else in the system, it must follow the capability rules. Unlike an operating system running in a privileged mode on mainstream hardware, it does not get to opt out of memory protection, it is not able to access beyond the bounds of capabilities passed to it or access any memory to which it does not have an explicit capability.
- It is largely stateless, all state that it modifies is held in the trusted stack.
2.3. Time slicing with the scheduler
When the switcher receives an interrupt (including an explicit yield), it delegates the decision about what to run next to the _scheduler_. The scheduler has direct access to the interrupt controller but, in most respects, is just another compartment.
The switcher also holds a capability to a small stack for use by the scheduler. This is not quite a full thread. It cannot make cross-compartment calls and is not independently schedulable. When the switcher takes an interrupt, it invokes the switcher's entry point on this stack.
The switcher also exposes other entry points that can be invoked by cross-compartment calls. These fulfil a role similar to system calls on other operating systems, for example waiting for external events or performing inter-thread communication. The scheduler implements blocking operations by moving the current thread from a run queue to a sleep queue and then issuing a software interrupt instruction to branch to the switcher. When the switcher then invokes the scheduler to make a scheduling decision, it will discover that the current thread is no longer runnable and pick another. Once the thread becomes runnable again, the switcher resumes the thread from the point where it yielded, at which point it can return from the scheduler.
The scheduler is, by definition, in the TCB for availability. It is the component that decides which threads run and which do not. A bug in the scheduler (with or without an active attacker) can result in a thread failing to run.
It is not; however, in the TCB for confidentiality or integrity. The scheduler has no mechanism to inspect the state of an interrupted thread. When invoked explicitly, it is called with a normal cross-compartment call and so has no access to anything other than the arguments.
2.4. Sharing memory from the allocator
The final core component is the memory allocator, which provides the heap, which is used for all dynamic memory allocations. This is discussed in detail in Chapter 6. Memory management in CHERIoT RTOS. Sharing memory between compartments in CHERIoT requires nothing more than passing pointers (until you start to add availability requirements in the presence of mutual distrust). This means that you can allocate objects (or complex object graphs) from a few bytes up to the entire memory of the system and share them with other compartments.
The allocator has access to the shadow bitmap and hardware revocation engine that enforce temporal safety for the heap, and is responsible for setting bounds on allocated memory. It is therefore trusted for confidentiality and integrity of memory allocated from the heap. If it incorrectly sets bounds, a compartment may gain access to memory belonging to another allocation. If it incorrectly configures revocation state or reuses memory too early then a use-after-free bug may become exploitable.
The allocator is not able to bypass capability permissions, it simply holds a capability that spans the whole of heap memory. As such, it is in the TCB only with respect to heap allocations. It cannot access globals (or code), held in other compartments and so a compartment that does not use the heap does not need to trust the allocator.
2.5. Building firmware images
CHERIoT RTOS uses the xmake build system. Xmake is a build system implemented in Lua. It was chosen because it is easy to add new kinds of build targets.
In a typical system that uses the compile-link process invented by Mary Allen Wilkes in the '60s, you compile source files to object code and then link object code to produce executables. You may have an intermediate step that produces libraries.
The CHERIoT build process was designed to enable separate compilation and binary distribution of components. Each source file is compiled either for use in a shared library or for use in a specific compartment. This means that, when building compartments, the compiler invocation must know the compartment in which the object file will be used.
Next, compartments and libraries are linked. This requires a special invocation of the linker that produces a relocatable object file with the correct structure. At this point, the only exported symbols are those for exported functions and the only undefined symbols should be those for MMIO regions or exports from other compartments (see Chapter 4. Compartments and libraries for more information).
The build system produces a .library or .compartment file for each shared library and each compartment.
The final link step produces a firmware image. It also produces the JSON report that describes all cross-compartment interactions and is used for auditing.
Using the RTOS build system involves writing an xmake.lua file that describes the build. This starts with some boilerplate:
set_project("CHERIoT example")
sdkdir = os.getenv("CHERIOT_SDK") or
"../../../rtos-source/sdk/"
includes(sdkdir)
set_toolchains("cheriot-clang")
The set_project call gives a name to the project.
Lines 7–9 import the RTOS SDK. This first tries to use the CHERIOT_SDK environment variable and, if not, tries a relative file. The sdkdir variable should point to the location of the sdk directory from the RTOS repository. Finally, line 11 selects the CHERIoT toolchain. Ideally this line would not be needed, but xmake's scoping rules require it to be provided here.
This boilerplate snipped will exist at the top of most xmake.lua files for CHERIoT. Only the name of the project (and possibly the path to the SDK) will be different.
The SDK file provides rules for building the various kinds of CHERIoT components (compartments, libraries, and firmware) and also includes all of the libraries that are part of the RTOS. These libraries include the core definitions for a freestanding C implementation (memcpy and friends), the atomic helpers for cores without atomic instructions, and the C runtime things that are called from compiler builtins. See the lib directory in the SDK for a full list.
If you want your firmware built to support running on more than one CHERIoT implementation then you will typically want to make the board to target an option, as shown in Listing 2. This exposes a --board option at the configure stage, we'll see how this is used later.
option("board")
set_default("sail")
-- end#begin
-- compartments#begin
-- An example compartment that we can call
compartment("example_compartment")
add_files("compartment.cc")
-- Our entry-point compartment
compartment("hello")
add_files("hello.cc")
-- compartments#end
-- firmware#begin
-- Firmware image for the example.
firmware("hello_world")
-- RTOS-provided libraries
add_deps("freestanding", "stdio")
-- Our compartments
add_deps("hello", "example_compartment")
on_load(function(target)
-- The board to target
target:values_set("board", "$(board)")
-- Threads to select
target:values_set("threads", {
{
compartment = "hello",
priority = 1,
entry_point = "entry",
stack_size = 0x400,
trusted_stack_frames = 2
}
}, {expand = false})
end)
-- firmware#end
You can set a default and we use "sail" here for the simulator build from our Sail formal model of the ISA. This refers to a board description file (see Chapter 10. Adding a new board). If you're usually targeting a particular hardware platform, setting the default here allows users to avoid specifying it manually on every build. If you're always targeting a particular hardware platform then you can avoid this entirely.
Next, you need to add any compartments and libraries that are specific to this firmware image. In most cases, you can do this in just two lines, the first providing the name of the compartment and the second providing the list of files, as shown in Listing 3. For this example, we'll have two compartments. One is our entry point, the other is a function that we'll use as a simple example of a cross-compartment call.
-- An example compartment that we can call
compartment("example_compartment")
add_files("compartment.cc")
-- Our entry-point compartment
compartment("hello")
add_files("hello.cc")
The name of the compartment in the xmake.lua must match the name used for the exported function as described in Section 4.6. Exporting functions from libraries and compartments. If they do not match, the compiler will raise an error that a function is defined in the wrong compartment.
This example is going to do a cross-compartment call and then print the result using printf, which is provided by the stdio library from the RTOS. The cross-compartment call is exposed from hello.h as shown in Listing 4. The only difference between this and a normal C/C++ function prototype is the __cheri_compartment macro. This is explained in detail in Section 4.6. Exporting functions from libraries and compartments.
/**
* Example of a function in a compartment.
*/
__cheri_compartment("example_compartment")
int exported_function(void);
The implementation of this function is trivial (see Listing 5), it just returns 42. Note that, aside from the annotation from the function prototype, we don't need any changes to expose this for use from other compartments. The same is true on the caller's side, as shown in Listing 6. Functions exported from a compartment are called just like any other C function.
#include "hello.h"
int exported_function()
{
return 42;
}
/// Thread entry point.
void __cheri_compartment("hello") entry()
{
printf("compartment returned %d\n",
exported_function());
}
The entry function is also annotated as a function exported from a compartment. This is because it's a thread entry point, a function that is called at the start of a thread. In CHERIoT RTOS, threads are statically defined. This is described in more detail in Chapter 5. Communicating between threads.
Returning to the build system Listing 7 shows how the firmware block defines everything that's combined together to create a firmware image. First, the add_deps lines are defining the compartments and libraries that are linked. The first two are libraries provided by the RTOS, implementing the core functions for a freestanding C environment and a minimal subset of stdio.h functions, respectively. The next two are the compartments that we defined earlier.
Not all of the metadata that we set can be defined in the declarative syntax of xmake and so we have to implement a function using the on_load hook to set the remaining properties. The "board" property is set from the option that we declared. This is where, if you don't need to support multiple targets, you could directly specify the board that you wish to target.
-- Firmware image for the example.
firmware("hello_world")
-- RTOS-provided libraries
add_deps("freestanding", "stdio")
-- Our compartments
add_deps("hello", "example_compartment")
on_load(function(target)
-- The board to target
target:values_set("board", "$(board)")
-- Threads to select
target:values_set("threads", {
{
compartment = "hello",
priority = 1,
entry_point = "entry",
stack_size = 0x400,
trusted_stack_frames = 2
}
}, {expand = false})
end)
The "threads" property is set to an array (as a Lua array literal) of thread descriptions. Each thread must set five properties:
- compartment
- The compartment in which this thread starts.
- entry_point
- The name of the function for this thread's entry point. This must be a function that takes and returns void, exported from the compartment specified by the compartment key.
- priority
- The priority of this thread. Higher numbers indicate higher priorities.
- stack_size
- The number of bytes of stack space that this thread has allocated.
- trusted_stack_frames
- The number of trusted stack frames. Each cross-compartment call pushes a new frame onto this stack and so this defines the maximum number of compartments call depth (including the entry point) for this thread.