CHERIoT RTOS provides threads as a core abstraction. Threads run until they either yield or are preempted via an interrupt, and then later resume from the same point. There are a small number of scheduler APIs that allow threads to block; higher-level APIs from other compartments may block by indirectly invoking them.

Remember that, in most respects, the scheduler is just another compartment. It doesn’t run with elevated privileges, it makes a decision about which thread to run next but it is not able to see the stacks or register states associated with threads.

Defining threads

Threads in CHERIoT RTOS cannot be dynamically created. Creating threads at run time would require allocating stacks at run time. The no-capture guarantees that CHERIoT RTOS enforces are based on the guarantee that no code (after the loader) has access to two stacks' memory at a time and so the switcher can zero stacks and avoid leaks. The only store-local capabilities that a thread ever has access to are derived from its current stack. Allowing stack creation would violate that: at least the memory allocator would have access to multiple stacks at once. It would be possible to allocate stacks from a separate pool, but that’s not really different from allocating stacks up front and having a small compartment that switches them from one use to another or implements a thread pool. There is an example thread pool in lib/thread_pool and thread_pool.h that you can either use directly or use as inspiration for your own design, if you want to create new thread-like contexts.

Threads in CHERIoT RTOS are constructed with four properties:

  • The size of their stack.

  • The size of their trusted stack.

  • Their priority.

  • The entry point where they start executing.

The stack size means the same as on any other platform. Specifically on CHEIRoT, the stack-pointer capability will be bounded to this size (rounded up if necessary for alignment) and any overflow of the stack, even by a single byte, will trap. The trusted stack size is the maximum number of cross-compartment calls that this thread can do. Each cross-compartment call invokes the switcher, which pushes a new frame on the trusted stack describing where to return.

Note
In the current version, each trusted stack frame is three capabilities (24 bytes). A larger trusted stack does not make much difference to total memory consumption.

The priority of threads matters only in relative terms. Like FreeRTOS (and unlike UNIX), higher numbers mean higher priorities. The scheduler has some data structures whose size depends on the number of priorities, so compiling with fewer priorities can make the scheduler smaller.

The entry point is a compartment entry point. It must be exposed as described in [compartments]. Thread entry points take no arguments and return no arguments.

On most other systems, thread creation functions take a pointer. This does not make sense for threads that are not dynamically created because there is no context for their creation.

Identifying the current thread

You will sometimes need to know which thread is currently running. This can be for something as simple as debugging but may also be needed for maintaining per-thread data structures. You can call the scheduler with thread_id_get to be told the current thread.

uint16_t thread_id_get(void)

Return the thread ID of the current running thread.

This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.

It is quite expensive to to a cross-compartment call to read a single 16-bit value. To avoid this repeated overhead, you can instead call thread_id_get_pointer. This returns a (read-only) pointer to the variable in the global in the scheduler that contains the thread ID of the currently scheduled thread. When a new thread is scheduled, this is updated to contain the current thread ID and so reading it will always give the current thread. From the perspective of any given thread, the value read here never changes.

uint16_t * thread_id_get_pointer(void)

Returns a cacheable (read-only) pointer to a global owned by the scheduler that contains the current thread ID.

Reading this pointer will return the thread ID of the currently running thread.

If you’re writing C++, there is a helper available, thread_id_get_fast, that caches this value automatically and simply dereferences it on calls after the first one. There is very little reason to prefer thread_id_get over thread_id_get_fast if you are writing C++ rather than C.

uint16_t thread_id_get_fast()

Helper function that returns the current thread ID.

This will do a cross-compartment call the first time it is called but not on any subsequent calls.

Using the Timeout structure

Several RTOS APIs have timeouts. These are expressed as a pointer to a Timeout structure. This design is intended to allow a single timeout to be passed down a chain of operations.

Note
Timeouts represent time spent blocking (yielding waiting to be runnable), not time spent running (doing useful work).

Timeouts measure time in scheduler Ticks. A tick is a single scheduling quantum, which depends on the board configuration. This is the minimum amount of time for which it is plausible for a thread to sleep. If a thread sleeps then another thread becomes runnable and is then allowed to run (unless it also yields) for one tick.

At the end of each tick, the scheduler receives the timer interrupt and chooses the next thread to run. Threads may only run at all if no higher-priority thread is runnable. Threads at the same priority level are round-robin scheduled.

The timeout structure captures the amount of time that is allowed to block and the number of ticks for which it has blocked. Each subsequent call that is passed the same timeout structure may increase the amount of slept time and decrease the remaining time.

Note
A thread may block for more than the permitted limit if it is sleeping while a higher-priority thread runs. Only the highest-priority thread can make strong realtime guarantees in the presence of other runnable threads.

Functions that take a timeout should always expect it as the first argument. This allows it to be forwarded to subsequent calls trivially.

Sleeping

Sleeping for a bounded number of ticks is the simplest form of blocking available. The thread_sleep call causes the caller to yield until a certain number of ticks have run.

int thread_sleep(struct Timeout * timeout)

Sleep for at most the specified timeout (see timeout.h).

The thread becomes runnable once the timeout has expired but a higher-priority thread may prevent it from actually being scheduled. The return value is a saturating count of the number of ticks that have elapsed.

A call of thread_sleep is with a timeout of zero is equivalent to yield, but reports the time spent sleeping. This requires a cross-domain call and return in addition to the overheads of yield and so yield should be preferred in contexts where the elapsed time is not required.

As with other calls that take a Timeout, the number of ticks that have elapsed during the call can be checked by reading the elapsed field of the timeout structure.

Building locks with futexes

The scheduler exposes a set of futex APIs as a building block for various notification and locking mechanisms. Futex is a contraction of 'fast userspace mutex'. This does not quite apply on a CHERIoT system, where there is no userspace, but the core concept of avoiding a privilege transition on fast paths still applies.

A CHERIoT RTOS futex is a 32-bit word where the scheduler provides compare-and-sleep (futex_timed_wait) and notify (futex_wake) operations.

int futex_timed_wait(Timeout * ticks,
                     const uint32_t * address,
                     uint32_t expected,
                     enum FutexWaitFlags flags)

Compare the value at address to expected and, if they match, sleep the thread until a wake event is sent with futex_wake or until this the thread has slept for ticks ticks.

The value of ticks specifies the permitted timeout. See timeout.h for details.

The address argument must permit loading four bytes of data after the address.

The flags argument contains flags that may control the behaviour of the call.

This returns:

  • 0 on success.

  • -EINVAL if the arguments are invalid.

  • -ETIMEOUT if the timeout expires.

int futex_wake(uint32_t * address,
               uint32_t count)

Wakes up to count threads that are sleeping with futex[_timed]_wait on address.

The address argument must permit storing four bytes of data after the address. This call does not store to the address but requiring store permission prevents a thread from waking up a futex that it cannot possibly have moved to a different state.

The return value for a successful call is the number of threads that were woken. -EINVAL is returned for invalid arguments.

A futex allows you to use atomic operations on a 32-bit word for fast paths but then sleep and wake threads when they are blocked, rather than spinning. Anything that can be implemented with a spin-wait loop can usually be made more efficient with a futex.

For example, consider the simplest possible spinlock, which uses a single word containing a one to indicate locked and a zero to indicate unlocked. When you encounter a one, you sit in a loop doing an atomic compare-and-swap trying to replace a zero with a one. When this succeeds, you’ve acquired the lock.

On most operating systems with single-core systems, you will sit in this loop until you exhaust your quantum, then a timer will fire and another thread will run. Your thread may be scheduled before the thread that owns the lock finishes, so you’ll then spin for another quantum.

The first simple optimisation on this design is to yield in the spin loop. This will allow other threads to run but the waiting thread remains runnable and so may be rescheduled early. With an RTOS priority scheduler, if the thread that’s waiting is a higher priority than thread that owns the lock then the thread that owns the lock may never be scheduled.

A futex lets the waiting thread sleep. The futex_timed_wait call will compare the value in the futex word to the expected value (one, indicating locked, in this case) and, if they match, will send the thread to sleep and remain asleep until the thread owning the lock will then do a futex_wake call when unlocking.

A more complete futex-based lock uses three values in the lock word to differentiate between locked states with and without waiters. This allows the uncontended case to avoid any cross-compartment calls.

The locks.hh header contains some C++ wrappers around the futex primitives to give some different locks.

  • TicketLock provides guaranteed FIFO semantics. Each thread waits until the thread in front of it in the queue has released the lock.

  • FlagLock is a simple lock that wakes waiters in the order of their thread priorities.

  • FlagLockPriorityInherited is a variant of FlagLock that supports priority inheritance (see Inheriting priorities).

These implement a uniform interface and so it’s easy to switch between them. This header also defines a NoLock class that provides the same interface but does not do any locking so generic data structures can be implemented with and without locking.

Futexes can be used to build other waiting mechanisms beyond locks. For example, a ring buffer with producer and consumer counters can have the sender wait while the ring is full by using a futex wait on the consumer counter and the receiver can do likewise with the producer counter. This allows a ring buffer design that is mostly lockless, yet allows the producer to sleep when the ring is full or the consumer to sleep when it is empty.

Inheriting priorities

Simple futex-based locks are vulnerable to priority inversion. Consider a case with three threads. The first is a low-priority thread that acquires a lock. The second is a medium-priority thread that preempts the first. The third is a high-priority thread that waits for the lock.

The high-priority thread in this example cannot make progress until the low-priority thread releases the lock. The low-priority thread cannot make progress until the medium-priority thread blocks. This means that the medium-priority thread is preventing the high-priority thread from making progress, which is the opposite of the desired situation.

Priority inheritance is the solution to this kind of problem. The blocking high-priority thread loans its priority to the low-priority thread, allowing it to (temporarily) be scheduled in preference to the medium-priority thread.

The futex APIs implement this by storing the thread ID of the owning thread in the bottom 16 bits of the futex word and passing FutexPriorityInheritance to the flags argument in the wait call. The specified thread will have its priority set to the highest priority of any of the waiting threads. The priority boost lasts until the waiters time out or the boosted thread releases the lock, whichever happens first.

A single thread can hold multiple priority-inheriting locks and receive priority boosts from all of them.

Securing futexes

Most of the time you will want to use futexes to synchronise operations within a single compartment. Futex-based locks rely on the contents of the lock word to be valid. For example, if a FlagLock is directly accessible by two mutually distrusting compartments, one can write an invalid value to the word and either prevent the other from waking waiters or cause it to spuriously believe that it has acquired the lock.

This is not normally a limitation because locks typically protect some data structure or other resource that should not be concurrently mutated by multiple threads. Providing mutable views of such a structure to multiple compartments is almost certainly a security vulnerability, even without attacks on the futex.

There is one situation where futexes are safe to share across compartment boundaries. If you have a component that others trust for availability, it can share read-only views of a futex to allow waiting for an out-of-band event. The scheduler does this for interrupts (see [drivers]), allowing threads to use the futex wait operation to block until an interrupt is ready.

Using event groups

The scheduler provides an event group API that is primarily intended for porting code written against FreeRTOS’s event groups APIs. An event group is a set of up to 24 values that can be set or cleared independently. Waiters can wait for any or all of an arbitrary subset of these.

Caution
A future version of the scheduler may remove event groups and replace them with bitset operations on futexes. At that point, the APIs will likely change. In particular, event_create will no longer be required.

Event groups are created with the event_create function. This returns an opaque handle to the event group, which can be used for setting, clearing, or waiting on events.

int event_create(Timeout * timeout,
                 struct SObjStruct * heapCapability,
                 void ** ret)

Create a new event group.

Note that, because this allocates memory, it requires an allocation capability. See [shared_heap] for more information about what this means.

You can then use event_bits_set and event_bits_clear to set and clear some or all of the event flags in this group. Both of these calls return the old values of the bits.

int event_bits_set(void * evt,
                   uint32_t * retBits,
                   uint32_t bitsToSet)

Set the bits in an event group.

int event_bits_clear(void * evt,
                     uint32_t * retBits,
                     uint32_t bitsToClear)

Manually clear the bits in event group.

You can then subsequently wait for some of the events to be set with the event_bits_wait function. This takes a set of events to wait for and can wait until either any or all of them are set.

int event_bits_wait(Timeout * timeout,
                    void * evt,
                    uint32_t * retBits,
                    uint32_t bitsToWait,
                    bool clearOnExit,
                    bool waitAll)

Wait on this event group for a particular set of bits.

This call can also atomically clear the bits that you’ve waited on, giving them edge-triggered behaviour.

Sending messages

Caution
The message queue APIs will probably be moved out of the scheduler at some point. At that point, the exact APIs are likely to change.

The scheduler provides a message queue API. A message queue is a FIFO capable of storing a fixed number of fixed-sized entries. Queues are created with queue_create, which allocates the buffer and returns an opaque handle.

int queue_create(Timeout * timeout,
                 struct SObjStruct * heapCapability,
                 void ** outQue,
                 size_t itemSize,
                 size_t maxNItems)

Create a new message queue.

Messages are then sent with queue_send and received with queue_recv. These are blocking (if allowed to by with a non-zero timeout) calls that send or receive a single message.

int queue_send(Timeout * timeout,
               void * que,
               const void * src)

Send a message to the queue, blocking for at most waitTicks of timer ticks.

The message size does not need to be provided, since we get the size internally.

int queue_recv(Timeout * timeout,
               void * que,
               void * dst)

Same as queue_send, just in the other direction.

Waiting for multiple events

The multiwaiter API allows waiting for any of a set of independent events. It is conceptually similar to select, poll, epoll, and kqueue in *NIX operating systems or WaitForMultipleObjects in Windows. It is designed to bound the amount of time that the scheduler must spend checking multiwaiters and to minimise the amount of memory that multiwaiters consume. Memory is allocated only when a multiwaiter is created, with multiwaiter_create. This creates a multiwaiter with space for a fixed number of events.

int multiwaiter_create(Timeout * timeout,
                       struct SObjStruct * heapCapability,
                       struct MultiWaiter ** ret,
                       size_t maxItems)

Create a multiwaiter object.

This is a stateful object that can wait on at most maxItems event sources.

Each multiwaiter_wait call is a one-shot operation. The call is passed a set of things to wait for and the associated condition via the events array and returns the waited status via the same array. This is typically an on-stack array.

int multiwaiter_wait(Timeout * timeout,
                     struct MultiWaiter * waiter,
                     struct EventWaiterSource * events,
                     size_t newEventsCount)

Wait for events.

The first argument is the multiwaiter to wait on. New events can optionally be added by providing an array of newEventsCount elements as the newEvents argument.

Return values:

  • On success, this function returns 0.

  • If the arguments are invalid, this function returns -EINVAL.

  • If the timeout is reached without any events being triggered then this returns -ETIMEOUT.

CHERIoT Programmers' Guide