9. Writing a device driver
CHERIoT aims to be small and easy to customize. It does not have a generic device driver interface but it does have a number of tools that make it possible to write modular device drivers.
9.1. What is a device?
From the perspective of the CPU, a device is something that you communicate with via a memory-mapped I/O (MMIO) interface, which may (optionally) generate interrupts. There are several devices that the core parts of the RTOS interact with:
- The UART, which is used for writing debug output during development.
- The core-local interrupt controller, which is used for managing timer interrupts.
- The platform interrupt controller, which is used for managing external interrupts.
- The revoker, which scans memory for dangling capabilities (pointers) and invalidates them.
Most embedded systems on chip will include additional devices. These range from very simple interfaces, such as general-purpose I/O (GPIO) pins that are mapped to a bit in a register, up to entire wireless network interfaces with rich sets of functionality.
CHERIoT, like most modern systems, makes heavy use of MMIO. Device registers are exposed as if they are memory locations. To read from a device register, you simply execute a load instruction on the CPU. Similarly, to write to a device register, you execute a store instruction.
This model is very convenient for CHERI systems because CHERI capabilities already allow you to restrict access to ranges of memory. This means that we don't need to define a new protection model for device access on CHERIoT. Capabilities can grant access to MMIO ranges just as they do to real memory. You can provide a read-only to a device range, or even to a single register.
A read-only capability to a device's MMIO region may convey more rights than you expect. For example, a device register may be the end of a memory-mapped FIFO, in which case reading it would remove the front entry. More generally, reads of device memory might have side effects. You will generally know this per device, but don't assume that read-only means may-not-affect-device-state when providing capabilities to device memory.
This abstraction also works at the C level or higher. A device's MMIO region is referred to by a volatile pointer to a structure representing the device's registers. Reading or writing the device's registers then becomes simple field access in C (or C++, or some higher-level language).
This abstraction is sufficient for polling, where you query the device periodically to see if it has anything ready to process. Polling may be sufficient for devices with a simple request-response interface. For example, if you send some plaintext to an AES engine and then read the cyphertext back, you always know that there'd data ready (or about to be data ready in a few cycles). In contrast, for something like a UART, Ethernet interface, USB controller, and so on, there will be long periods where the device has no data for you to process. Querying every such device in a loop would be inefficient both in terms of power and performance and so devices can raise interrupts. We'll discuss these in more detail in Section 9.5. Handling interrupts but, at a high level, they are asynchronous events that come from a device. When a frame is ready on an Ethernet controller, for example, it will send an interrupt letting the CPU know, so that some software can handle the incoming frame.
Between MMIO regions and interrupts, you have the building blocks for interfacing with any CHERIoT device.
9.2. Why do device drivers exist?
Device drivers are software interfaces to devices. In general (not limited to CHERIoT) they exist for two reasons:
- Abstraction
- A device driver allows software to be written to interface with a device class rather than a single device.
- Multiplexing
- A device may need to be accessed by different software components.
If you want to write a USB protocol stack or a network stack, you need some generic interface to USB controllers or Ethernet interfaces. This code doesn't want to have to be specific to each device, it wants to have an abstract way of talking to any device of the correct class via some device abstraction.
Often, the device multiplexing is delegated to a higher-level piece of software. For example, a disk interface may provide a generic block device abstraction but then give exclusive access to the next layer up in a storage stack. This may be a volume manager, which presents a set of logical block devices to other things in a kernel. On top of this, you'd often run a filesystem driver, which provides a way of naming variable-sized virtual disks ('files') and allows different users and different programs to store data independently.
The multiplexing and abstraction features may be entangled. On most operating systems, the common interface to the network is a socket, not a time slice in a network device. The TCP/IP stack is responsible for both providing abstractions (TCP and UDP sockets) and for multiplexing (different components can have different sockets and treat them as if they had unique access to a network device).
On a CHERIoT system, the correct structure for any device driver depends on the trust model. This determines how (or if) you should build multiplexing on top of abstraction.
For example, if you have a GPIO device that controls some LEDs, you may simply want to delegate direct access to that device to a compartment that wants to control them. Alternatively, you may want to provide an interface that allows individual compartments to have control over a single LED, or allows compartments to monotonically set any of them but requires a different permission for clearing them.
For more complex devices, such as SPI, Ethernet, or USB, you will want a low-level device driver that provides a generic interface to the device. This driver will be wrapped in something that provides a richer interface to other compartments.
9.3. Specifying a device's locations
Devices are specified in the board description file, which is described in detail in Chapter 11. Adding a new board. The two relevant parts are the devices node, which specifies the memory-mapped I/O devices and the interrupts section that describes how external interrupts should be configured. For example, our initial FPGA prototyping platform had sections like this describing its Ethernet device:
"devices" : { "ethernet" : { "start" : 0x98000000, "length": 0x204 }, ... }, "interrupts": [ { "name": "Ethernet", "number": 16, "priority": 3 } ],
The first part says that the ethernet device's MMIO space is 0x204 bytes long and starts at address 0x98000000. The second says that interrupt number 16 is used for the ethernet device.
9.4. Accessing the memory-mapped I/O region
The MMIO_CAPABILITY macro is used to get a pointer to memory-mapped I/O devices. This takes two arguments. The first is the C/C++ type of the pointer, the second is the name from the board configuration file. For example, to get a pointer to the memory-mapped I/O space for the ethernet device above, we might do something like:
struct EthernetMMIO { // Control register layout here: ... }; __always_inline volatile struct EthernetMMIO *ethernet_device() { return MMIO_CAPABILITY(struct EthernetMMIO, ethernet); }
This macro must be used in code, it cannot be used for static initialisation. The macro expands to a load from the compartment's import table. Assigning the result of it to a global is an antipattern: you will get smaller code using it directly. The helper shown here will be inlined and expand to a single load capability load.
Now that you have a pointer to a volatile object representing the device's MMIO region, you can access its control registers directly. Any device can be accessed from any compartment in this way, but that access will appear in the linker audit report.
Any compartment that accesses this device will have an entry in the audit report that looks like this:
{ "kind": "MMIO", "length": 516, "start": 2550136832 },
There is no generic policy for device access because the right policy depends on device and the SoC. Consider a device has two GPIO pins, one connected to an LED used to indicate a fault in the device and the other to trigger the sprinkler system for the building. You would probably write a policy that allows most compartments to indicate a fault, but restricts access to the sprinkler control to a single compartment. From the perspective of both the SoC and the RTOS, the two devices are identical.
You can then audit whether a firmware image enforces whatever policy you want (for example, no compartment other than a device driver may access the device directly). Note that the linker reports will always provide the addresses and lengths in decimal, because they are standard JSON. We support a small number of extensions to JSON in the files that we consume, to improve usability, but don't use these in files that we produce, to improve interoperability.
There is no requirement to expose a device as a single MMIO region. You may wish to define multiple regions, which can be as small as a single byte, so that you can privilege separate your device driver.
Some devices have a very large control structure. For example, the platform-local interrupt controller is many KiBs. We don't define a C structure that covers every single field for this and instead just use uint32_t as the type for MMIO_CAPABILITY, which lets us treat the space as an array of 32-bit control registers.
9.5. Handling interrupts
Interrupts are asynchronous notifications from devices. On most modern systems, including CHERIoT, external interrupts are multiplexed by an interrupt controller. The RISC-V platform-local interrupt controller (PLIC) handles all interrupts coming from devices and forwards them to the core. When the core is running with interrupts disabled (or, more accurately, deferred), interrupts are still received by the PLIC and recorded. Similarly, if two interrupts fire at the same time, the PLIC ensures that they are not lost.
When the PLIC delivers an interrupt to the core, it will trigger the switcher to save the current process's state. The switcher will then invoke the scheduler, which will query the PLIC to see which interrupts have fired and wake more any threads that were waiting for them.
CHERIoT has a unified event model, where the futexes are the only event source that can block. This means that the same waiting mechanism is used for both hardware- and software-generated events. In both cases, you will wait for a futex (see Section 6.5. Waiting for events with futexes) and then run code when the scheduler wakes you.
To be able to handle interrupts, you must have a software capability (see Section 5.13. Building software capabilities with sealing) that authorises access to the interrupt. This capability allows you to get a pointer to the futex word associated with the interrupt. Futexes are building blocks for a variety of different synchronisation primitives. For interrupts, the futex word contains a counter that is incremented each time the interrupt fires. Section 9.6. Waiting for an interrupt discusses how to wait on this futex.
Before you can wait for interrupts using a futex, you must get the pointer to the futex word. This will be a read-only capability to a 32-bit value. For the Ethernet device that we've been using as an example, you would request the associated interrupt futex with this macro invocation:
DECLARE_AND_DEFINE_INTERRUPT_CAPABILITY(ethernetInterruptCapability, Ethernet, true, true);
If you wish to share this between multiple compilation units, you can use the separate DECLARE_ and DEFINE_ forms (see interrupt.h) but the combined form is normally most convenient. This macro takes four arguments:
- The name that we're going to use to refer to this capability. The name ethernetInterruptCapability is arbitrary, you can use whatever makes sense to you.
- The name of the interrupt, from the board description file (Ethernet, in this case).
- Whether this capability authorises waiting for this interrupt (this will almost always be true).
- Whether this capability authorises acknowledging the interrupt so that it can fire again. This will almost always be true in device drivers but should generally be true for only one compartment (for each interrupt), whereas multiple compartments may wish to observe interrupts for monitoring.
As with the MMIO capabilities, sealed objects appear in compartment reports. For example, the above macro expands to this in the final report:
{ "contents": "10000101", "kind": "SealedObject", "sealing_type": { "compartment": "sched", "key": "InterruptKey", "provided_by": "build/cheriot/cheriot/release/example-firmware.scheduler.compartment", "symbol": "__export.sealing_type.sched.InterruptKey" }
The sealing type tells you that this is an interrupt capability (it's sealed with the InterruptKey type, provided by the scheduler). The contents lets you audit what this authorises. The first two bytes are a 16-bit (little-endian on all currently supported targets) integer containing the interrupt number, so 1000 means 16 (our Ethernet interrupt number). The next two bytes are boolean values reflecting the last two arguments to the macro, so this authorises both waiting and clearing the macro. Again, this can form part of your firmware auditing.
9.6. Waiting for an interrupt
Now that you're authorised to handle interrupts, you will need something that you can wait on. Most real-time operating systems allow you to register interrupt-service routines (ISRs) directly. CHERIoT RTOS does not allow this because ISRs run with access to the state of the interrupted thread. On Arm M-profile, some registers are banked but the others are visible, on RISC-V all registers of the interrupted thread are visible. This means that an ISR runs with access to the thread and compartment that are interrupted. Not only would this potentially break compartment isolation, it would be difficult to use safely because the ISR would inherit an (untrusted) stack from the interrupted thread and have access to the interrupted compartment's globals instead of its own.
Instead, CHERIoT RTOS maps interrupts onto events that threads can wait on. A single thread with the highest priority that blocks waiting on an interrupt will be run as soon as the switcher and scheduler finish handling the interrupt. The switcher will spill the interrupted thread's state, the scheduler will wake the sleeping thread and note that it is now the highest-priority runnable thread, and then the switcher will resume from that thread. This sequence takes around 1,000 cycles on Ibex, giving an interrupt latency of 50 µS at 20 MHz or 10 µS at 100 MHz.
A future version of the CHERIoT architecture is expected to include extensions to the interrupt controller to allow direct context switch to a high-priority thread.
Each interrupt is mapped to a futex word, which can be used with scheduler waiting primitives. Futexes are discussed in detail in Section 6.5. Waiting for events with futexes but, for the purpose of interrupt handling you can think of them as counters with a compare-and-wait operation. You can get the word associated with an interrupt by passing the authorising capability to the interrupt_futex_get function exported by the scheduler:
const uint32_t *ethernetFutex = ethernetFutex = interrupt_futex_get(STATIC_SEALED_VALUE(ethernetInterruptCapability));
The ethernetFutex pointer is now a read-only capability (attempting to store through it will trap) that contains a number that is incremented every time the ethernet interrupt fires. You can now query whether any interrupts have fired since you last checked by comparing it against a previous value and you can wait for an interrupt with futex_wait, for example:
[,cpp] do { uint32_t last = *ethernetFutex; // Handle interrupt here } while (futex_wait(ethernetFutex, last) == 0);
If you want to wait for multiple event sources, you can use the multiwaiter (see Section 6.11. Waiting for multiple events) API. This allows sleeping on multiple kinds of event source so you can, for example, have a single thread that blocks waiting for a message to send from another thread or a message to receive from the device.
9.7. Acknowledging interrupts
If you copy the last example into a real device driver then you might be surprised that the loop runs twice and then stops. It will run once on start, once when the first interrupt is delivered, and then never again. This is because external interrupts are not delivered on a particular channel unless the preceding one has been acknowledged. A more complete version of the loop above looks like this:
do { uint32_t last = *ethernetFutex; // Handle interrupt here interrupt_complete(STATIC_SEALED_VALUE(ethernetInterruptCapability)); } while ((last != *ethernetFutex) || futex_wait(ethernetFutex, last) == 0);
This includes two changes. The first is the call to interrupt_complete once the interrupt has been handled. This tells the scheduler to mark the interrupt as completed in the interrupt handler. It is possible that the interrupt will then fire immediately, in which case there's no point trying to sleep. The second change checks whether the value of the futex word has changed - if it has, then we skip the futex_wait call and handle the next interrupt immediately.
9.8. Exposing device interfaces
CHERIoT device drivers often have two levels of abstraction. The lower level provides an abstraction across different devices that offer similar functionality. The higher level provides a security model atop this.
In most cases, the lower-level abstractions are provided as header-only libraries that can be included in whichever compartments need them. This allows drivers to be incorporated into another compartment that has full access to the device. For example, the scheduler is the only component that has direct access to the interrupt controller, the memory allocator is the only component that has full access to the revoker. In both cases, separating the driver into a compartment would not provide any security benefit because the component that uses the device is allowed to do anything that it wants to the device and does not need to be protected from the device.
If a device has multiple consumers then it may need a compartment to handle multiplexing. For example, our debug APIs use the UART directly, but safe use of the UART would involve locking to avoid interleaved messages. Implementing this model would use the header-only UART driver from a compartment and writing a simple interface for reading and writing (possibly with an authorising capability).
9.9. Using layered platform includes
Each board description contains a set of include paths. For example, our Ibex simulator has this:
"driver_includes" : [ "../include/platform/ibex", "../include/platform/generic-riscv" ],
These are added *in this order*, which makes it possible for code in the more specialised directories to #include_next versions of the files in the more generic versions or to add files that are found in preference to the generic versions.
For example, the UART device in the generic-riscv directory defines a basic 16550 interface. This is templated with the size of the register because the original 16550 used 8-bit registers, whereas newer versions typically use the low 8 bits of a 32-bit register. This implementation is sufficient for simulated environments but real UARTs with higher-speed cores often require more control over their frequency to get the right baud rate. We can support the Synopsis extended 16550 by creating a platform/synopsis directory containing a platform-uart.hh that uses #include_next <platform-uart.hh> to get the generic version. This can be inserted in the include path before platform/generic-riscv. A specific configuration can use this by not providing anything at a higher level, replace it entirely by providing a custom platform-uart.hh, or provide a modified version of it by using #include_next.
9.10. Conditionally compiling driver code
The DEVICE_EXISTS macro can be used with #if to conditionally compile code depending on whether the current board provides a definition of the device. This is keyed on the existence of an MMIO region in the board description file with the specified name. For example, the ethernet device that we've been using as an example could be protected with:
#if DEVICE_EXISTS(ethernet) // Driver for the ethernet device here. #endif
This highlights why "ethernet" is not a great name for the device: ideally the name should be specific to the hardware interface, not the high-level functionality, so that you can conditionally compile specific drivers. We have used a generic name in this tutorial to avoid introducing device-specific complications.