5. Communicating between threads

CHERIoT RTOS provides threads as a core abstraction. Threads run until they either yield or are preempted via an interrupt, and then later resume from the same point. There are a small number of scheduler APIs that allow threads to block; higher-level APIs from other compartments may block by indirectly invoking them.

Remember that, in most respects, the scheduler is just another compartment. It doesn’t run with elevated privileges, it makes a decision about which thread to run next but it is not able to see the stacks or register states associated with threads.

5.1. Defining threads

Threads in CHERIoT RTOS cannot be dynamically created. Creating threads at run time would require allocating stacks at run time. The no-capture guarantees that CHERIoT RTOS enforces are based on the guarantee that no code (after the loader) has access to two stacks' memory at a time and so the switcher can zero stacks and avoid leaks. The only store-local capabilities that a thread ever has access to are derived from its current stack. Allowing stack creation would violate that: at least the memory allocator would have access to multiple stacks at once. It would be possible to allocate stacks from a separate pool, but that’s not really different from allocating stacks up front and having a small compartment that switches them from one use to another or implements a thread pool. There is an example thread pool in lib/thread_pool and thread_pool.h that you can either use directly or use as inspiration for your own design, if you want to create new thread-like contexts.

Threads in CHERIoT RTOS are constructed with four properties:

  • The size of their stack.

  • The size of their trusted stack.

  • Their priority.

  • The entry point where they start executing.

The stack size means the same as on any other platform. Specifically on CHEIRoT, the stack-pointer capability will be bounded to this size (rounded up if necessary for alignment) and any overflow of the stack, even by a single byte, will trap. The trusted stack size is the maximum number of cross-compartment calls that this thread can do. Each cross-compartment call invokes the switcher, which pushes a new frame on the trusted stack describing where to return.

In the current version, each trusted stack frame is three capabilities (24 bytes). A larger trusted stack does not make much difference to total memory consumption.

The priority of threads matters only in relative terms. Like FreeRTOS (and unlike UNIX), higher numbers mean higher priorities. The scheduler has some data structures whose size depends on the number of priorities, so compiling with fewer priorities can make the scheduler smaller.

The entry point is a compartment entry point. It must be exposed as described in Chapter 4. Thread entry points take no arguments and return no arguments.

On most other systems, thread creation functions take a pointer. This does not make sense for threads that are not dynamically created because there is no context for their creation.

5.2. Identifying the current thread

You will sometimes need to know which thread is currently running. This can be for something as simple as debugging but may also be needed for maintaining per-thread data structures. The ID of each thread is stored in the register save area for that thread and the switcher exposes a library call (thread_id_get) to read it.

uint16_t thread_id_get(void)

Return the thread ID of the current running thread.

This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.

This is implemented in the switcher.

Thread IDs start at one (not zero!) because zero is used to indicate the idle thread and so is never visible. The thread_count function returns the number of threads that have been created in the system. This is not decremented when threads exit and so provides the upper bound on the number of threads that may exist. This can be used to size data structures that are indexed by thread ID.

uint16_t thread_count()

Returns the number of threads, including threads that have exited.

This API never fails, but if the trusted stack is exhausted and it cannot be called then it will return -1. Callers that have not probed the trusted stack should check for this value.

The result of this is safe to cache: it will never change over time.

5.3. Using the Timeout structure

Several RTOS APIs have timeouts. These are expressed as a pointer to a Timeout structure. This design is intended to allow a single timeout to be passed down a chain of operations.

Timeouts represent time spent blocking (yielding waiting to be runnable), not time spent running (doing useful work).

Timeouts measure time in scheduler Ticks. A tick is a single scheduling quantum, which depends on the board configuration. This is the minimum amount of time for which it is plausible for a thread to sleep. If a thread sleeps then another thread becomes runnable and is then allowed to run (unless it also yields) for one tick.

At the end of each tick, the scheduler receives the timer interrupt and chooses the next thread to run. Threads may only run at all if no higher-priority thread is runnable. Threads at the same priority level are round-robin scheduled.

The timeout structure captures the amount of time that is allowed to block and the number of ticks for which it has blocked. Each subsequent call that is passed the same timeout structure may increase the amount of slept time and decrease the remaining time.

A thread may block for more than the permitted limit if it is sleeping while a higher-priority thread runs. Only the highest-priority thread can make strong realtime guarantees in the presence of other runnable threads.

Functions that take a timeout should always expect it as the first argument. This allows it to be forwarded to subsequent calls trivially.

Timeouts may not be stored on the heap. Any function checking timeouts may refuse to accept a heap-allocated timeout. It is difficult to work with heap-allocated timeouts because they may be deallocated while the thread is sleeping, which would then cause it to crash on updating the timeout structure.

5.4. Sleeping

Sleeping for a bounded number of ticks is the simplest form of blocking available. The thread_sleep call causes the caller to yield until a certain number of ticks have run.

int thread_sleep(struct Timeout * timeout)

Sleep for at most the specified timeout (see timeout.h).

The thread becomes runnable once the timeout has expired but a higher-priority thread may prevent it from actually being scheduled. The return value is a saturating count of the number of ticks that have elapsed.

A call of thread_sleep is with a timeout of zero is equivalent to yield, but reports the time spent sleeping. This requires a cross-domain call and return in addition to the overheads of yield and so yield should be preferred in contexts where the elapsed time is not required.

As with other calls that take a Timeout, the number of ticks that have elapsed during the call can be checked by reading the elapsed field of the timeout structure.

Sleeping in a system with an RTOS scheduler conflates two concepts:

  • Waiting for some time to elapse.

  • Allowing lower-priority threads to run.

The thread_sleep call supports both of these but understanding how they differ requires understanding a little of the scheduler’s behaviour.

Traditional OS schedulers from the earlies preemptive multitasking systems used a fixed scheduling quantum. The scheduler would configure a periodic timer interrupt and would make a new scheduling decision at each interrupt or at explicit yields. This fixed quantum is the origin of the tick abstraction in CHERIoT RTOS.

The CHERIoT RTOS scheduler is tickless. This means that, although it uses ticks as an abstraction for defining scheduling quanta, it does not schedule a regular timer interrupt. When two threads at the same priority level are runnable, the scheduler will request a timer interrupt to preempt the current one and switch to the other. If the running thread has no peers, the scheduler will allow it to run until either it yields or another higher or equal-priority thread’s timeout expires. The tick abstraction remains as a convenient way of expressing time to the scheduler, but internally the scheduler tracks only elapsed cycles.

By default, if 0 is passed as the flags argument to thread_sleep, the sleep operation is treated as a yield. This is a way for the running thread to communicate to the scheduler that it is happy for other (lower or equal-priority) threads to run for up to the specified number of ticks. The scheduler may wake the yielding thread if no other thread is going to be runnable within that number of ticks. This allows a high-priority threads to allow other threads to run, but continue using the CPU is no other thread is runnable.

In some cases, you really want to sleep. For example, if you’re updating a clock display, you will want to run once a second or once a minute to update a display. The same applies if you’re sending keep-alive packets or periodically monitoring some other component. Even if no other threads are runnable, you have no useful work to do for a bit. You can pass ThreadSleepNoEarlyWake as the flags argument to thread_sleep to indicate that you really want to sleep.

5.5. Building locks with futexes

The scheduler exposes a set of futex APIs as a building block for various notification and locking mechanisms. Futex is a contraction of 'fast userspace mutex'. This does not quite apply on a CHERIoT system, where there is no userspace, but the core concept of avoiding a privilege transition on fast paths still applies.

A CHERIoT RTOS futex is a 32-bit word where the scheduler provides compare-and-sleep (futex_timed_wait) and notify (futex_wake) operations.

int futex_timed_wait(Timeout * ticks,
                     const uint32_t * address,
                     uint32_t expected,
                     enum FutexWaitFlags flags)

Compare the value at address to expected and, if they match, sleep the thread until a wake event is sent with futex_wake or until this the thread has slept for ticks ticks.

The value of ticks specifies the permitted timeout. See timeout.h for details.

The address argument must permit loading four bytes of data after the address.

The flags argument contains flags that may control the behaviour of the call.

This returns:

  • 0 on success: either *address and expected differ or a wake is received.

  • -EINVAL if the arguments are invalid.

  • -ETIMEOUT if the timeout expires.

int futex_wake(uint32_t * address,
               uint32_t count)

Wakes up to count threads that are sleeping with futex[_timed]_wait on address.

The address argument must permit storing four bytes of data after the address. This call does not store to the address but requiring store permission prevents a thread from waking up a futex that it cannot possibly have moved to a different state.

The return value for a successful call is the number of threads that were woken. -EINVAL is returned for invalid arguments.

In C++, std::atomic<uint32_t> provides wait, notify_all, and notify_one methods that expose futex functionality and may be more convenient to call than the raw futex APIs. These include some additional (non-standard) overloads that expose more of the underlying futex functionality.

A futex allows you to use atomic operations on a 32-bit word for fast paths but then sleep and wake threads when they are blocked, rather than spinning. Anything that can be implemented with a spin-wait loop can usually be made more efficient with a futex.

For example, consider the simplest possible spinlock, which uses a single word containing a one to indicate locked and a zero to indicate unlocked. When you encounter a one, you sit in a loop doing an atomic compare-and-swap trying to replace a zero with a one. When this succeeds, you’ve acquired the lock.

On most operating systems with single-core systems, you will sit in this loop until you exhaust your quantum, then a timer will fire and another thread will run. Your thread may be scheduled before the thread that owns the lock finishes, so you’ll then spin for another quantum.

The first simple optimisation on this design is to yield in the spin loop. This will allow other threads to run but the waiting thread remains runnable and so may be rescheduled early. With an RTOS priority scheduler, if the thread that’s waiting is a higher priority than thread that owns the lock then the thread that owns the lock may never be scheduled.

A futex lets the waiting thread sleep. The futex_timed_wait call will compare the value in the futex word to the expected value (one, indicating locked, in this case) and, if they match, will send the thread to sleep and remain asleep until the thread owning the lock will then do a futex_wake call when unlocking.

A more complete futex-based lock uses three values in the lock word to differentiate between locked states with and without waiters. This allows the uncontended case to avoid any cross-compartment calls.

The locks library provides a set of futex-based locks. The locks.h header exposes the interface to this library.

Ticket locks

provide guaranteed FIFO semantics for waiters.

Flag locks

are simple locks that wake waiters in the order of their thread priorities. These can optionally provide priority inheritance (see Section 5.6).

Recursive mutexes

wrap a priority-inheriting flag lock and allow the same thread to acquire a lock multiple times.

Semaphores

provide a counting semaphore abstraction.

C++ users may prefer to use the wrappers provided in locks.hh, which implement a uniform interface for different lock types. This header also defines a NoLock class that provides the same interface but does not do any locking so generic data structures can be implemented with and without locking.

Futexes can be used to build other waiting mechanisms beyond locks. For example, a ring buffer with producer and consumer counters can have the sender wait while the ring is full by using a futex wait on the consumer counter and the receiver can do likewise with the producer counter. This allows a ring buffer design that is mostly lockless, yet allows the producer to sleep when the ring is full or the consumer to sleep when it is empty.

5.6. Inheriting priorities

Simple futex-based locks are vulnerable to priority inversion. Consider a case with three threads. The first is a low-priority thread that acquires a lock. The second is a medium-priority thread that preempts the first. The third is a high-priority thread that waits for the lock.

The high-priority thread in this example cannot make progress until the low-priority thread releases the lock. The low-priority thread cannot make progress until the medium-priority thread blocks. This means that the medium-priority thread is preventing the high-priority thread from making progress, which is the opposite of the desired situation.

Priority inheritance is the solution to this kind of problem. The blocking high-priority thread loans its priority to the low-priority thread, allowing it to (temporarily) be scheduled in preference to the medium-priority thread.

The futex APIs implement this by storing the thread ID of the owning thread in the bottom 16 bits of the futex word and passing FutexPriorityInheritance to the flags argument in the wait call. The specified thread will have its priority set to the highest priority of any of the waiting threads. The priority boost lasts until the waiters time out or the boosted thread releases the lock, whichever happens first.

A single thread can hold multiple priority-inheriting locks and receive priority boosts from all of them.

The priority inheritance mechanism can also be used to build asymmetric locks. These have a fast path that doesn’t do any cross-compartment calls and a slow path that does. You can find one example of this in the hazard pointer mechanism for short-lived claims. This must detect when a thread has tried to add a hazard pointer while the allocator is scanning the list, without slowing down the allocator. Before reading the list, the allocator increments the top 16 bits of the futex word and sets the low 16 to the thread ID performing the operation. Threads updating the hazard set check the futex word before and after updating the list. If the top 16 bits have changed, they know that the allocator has scanned the list and they must retry. If the top 16 bits contain an odd value, the allocator is currently scanning the list and they must wait. They can do a priority-inheriting wait with a one-tick timeout even though the allocator will not ever call `futex_wake`. They will yield for one tick, boosting the priority of the thread that’s currently in the allocator, but then resume at the end of the tick.

5.7. Securing futexes

Most of the time you will want to use futexes (and the locks that wrap them) to synchronise operations within a single compartment. Futex-based locks rely on the contents of the lock word to be valid. For example, if a flag lock is directly accessible by two mutually distrusting compartments, one can write an invalid value to the word and either prevent the other from waking waiters or cause it to spuriously believe that it has acquired the lock.

This is not normally a limitation because locks typically protect some data structure or other resource that should not be concurrently mutated by multiple threads. Providing mutable views of such a structure to multiple compartments is almost certainly a security vulnerability, even without attacks on the futex.

There is one situation where futexes are safe to share across compartment boundaries. If you have a component that others trust for availability, it can share read-only views of a futex to allow waiting for an out-of-band event. The scheduler does this for interrupts (see [drivers]), allowing threads to use the futex wait operation to block until an interrupt is ready.

5.8. Using event groups

The event_group library provides an event group API that is primarily intended for porting code written against FreeRTOS’s event groups APIs. The event.h header exposes the interface to this library. These APIs do not have a clear trust model and so should be avoided in new code that is not ported from FreeRTOS. You can build more convenient interfaces atop futexes for most synchronisation operations. You may also simply use multiple futexes and the multiwaiter API (see Section 5.10) to wait for multiple events.

An event group is a set of up to 24 values that can be set or cleared independently. Waiters can wait for any or all of an arbitrary subset of these.

Event groups are created with the eventgroup_create function. This returns an opaque handle to the event group, which can be used for setting, clearing, or waiting on events.

int eventgroup_create(struct Timeout * timeout,
                      struct SObjStruct * heapCapability,
                      struct EventGroup ** outGroup)

Create a new event group, allocated using heapCapability.

The event group is returned via outGroup.

This returns zero on success. Otherwise it returns a negative error code. If the timeout expires then this returns -ETIMEDOUT, if memory cannot be allocated it returns -ENOMEM.

Note that, because this allocates memory, it requires an allocation capability. See Chapter 6 for more information about what this means.

You can then use eventgroup_set and eventgroup_clear to set and clear some or all of the event flags in this group. Both of these calls return the old values of the bits.

int eventgroup_set(Timeout * timeout,
                   struct EventGroup * group,
                   uint32_t * outBits,
                   uint32_t bitsToSet)

Set one or more bits in an event group.

The bitsToSet argument contains the bits to set. Any thread waiting with eventgroup_wait will be woken if the bits that it is waiting for are set.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

int eventgroup_clear(Timeout * timeout,
                     struct EventGroup * group,
                     uint32_t * outBits,
                     uint32_t bitsToClear)

Clear one or more bits in an event group.

The bitsToClear argument contains the set of bits to clear. This does not wake any threads.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

You can then subsequently wait for some of the events to be set with the eventgroup_wait function. This takes a set of events to wait for and can wait until either any or all of them are set.

int eventgroup_wait(Timeout * timeout,
                    struct EventGroup * group,
                    uint32_t * outBits,
                    uint32_t bitsWanted,
                    _Bool waitForAll,
                    _Bool clearOnExit)

Wait for events in an event group.

The bitsWanted argument must contain at least one bit set in the low 24 bits (and none in the high bits). This indicates the specific events to wait for. If waitForAll is true then all of the bits in bitsWanted must be set in the event group before this returns. If waitForAll is false then any of the bits in bitsWanted being set in the event group will cause this to return.

If this returns zero then outBits will contain the bits that were set at the time that the condition became true. If this returns -ETIMEDOUT then outBits will contain the bits that were set at the time that the timeout expired.

Note: waitForAll requires all bits to be set at the same time. This makes it trivial to introduce race conditions if used with multiple waiters and clearOnExit, or if different threads clear bits different bits in the waited set.

If clearOnExit is true and this returns successfully then the bits in bitsWanted will be cleared in the event group before this returns.

This call can also atomically clear the bits that you’ve waited on, giving them edge-triggered behaviour.

5.9. Sending messages

A message queue is a FIFO capable of storing a fixed number of fixed-sized entries. There are two distinct use cases for message queues:

  • Communicating between two threads in the same compartment.

  • Communicating between different compartments.

In the first case, the endpoints are in the same trust domain. The message_queue_library library provides a simple message queue API that is intended for this use case. When the endpoints are in different trust domains, the endpoints must be protected from tampering. The message_queue compartment wraps the library in a compartment that exposes an almost identical interface to the library but with the endpoints exposed as (tamper-proof) sealed capabilities.

Queues for use within a single compartment are created with queue_create, which allocates the buffer and returns a handle that can be used for sending and receiving messages. There is no explicit queue_destroy function. The memory allocated can simply be freed when the queue is no longer needed. The pointer returned via the outAllocation parameter refers to the entire allocation used for the queue and so can be passed to heap_free, along with the heap capability used to allocate the queue.

int queue_create(Timeout * timeout,
                 struct SObjStruct * heapCapability,
                 struct QueueHandle * outQueue,
                 void ** outAllocation,
                 size_t elementSize,
                 size_t elementCount)

Allocates space for a queue using heapCapability and stores a handle to it via outQueue.

The underlying allocation (which is necessary to free the queue) is returned via outAllocation.

The queue is has space for elementCount entries. Each entry is a fixed size, elementSize bytes.

Messages are then sent with queue_send and received with queue_receive. These are blocking (if allowed to by with a non-zero timeout) calls that send or receive a single message.

int queue_send(Timeout * timeout,
               struct QueueHandle * handle,
               const void * src)

Send a message to the queue specified by handle.

This expects to be able to copy the number of bytes specified by elementSize when the queue was created from src.

Returns 0 on success. On failure, returns -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

This expected to be called with a valid queue handle. It does not validate that this is correct. It uses safe_memcpy and so will check the buffer.

int queue_receive(Timeout * timeout,
                  struct QueueHandle * handle,
                  void * dst)

Receive a message over a queue specified by handle.

This expects to be able to copy the number of bytes specified by elementSize. The message is copied to dst, which must have sufficient permissions and space to hold the message.

Returns 0 on success, -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

For defence in depth, you can use queue_make_receive_handle or queue_make_send_handle to create a handle that can only be used for receiving or sending messages, respectively.

struct QueueHandle queue_make_receive_handle(struct QueueHandle handle)

Convert a queue handle returned from queue_create into one that can be used only for receiving.

Note: This is primarily defence in depth. A malicious holder of this queue handle can still set the consumer counter to invalid values.

struct QueueHandle queue_make_send_handle(struct QueueHandle handle)

Convert a queue handle returned from queue_create into one that can be used only for sending.

Note: This is primarily defence in depth. A malicious holder of this queue handle can still set the producer counter to invalid values and overwrite arbitrary queue locations.

The library interfaces to queues are not intended to be robust in the presence of malicious callers. They run in the same security context as the caller and so a caller may abuse them to corrupt its own state. They do aim to be robust with respect to the source or destination buffer for sending and receiving messages being invalid or concurrently deallocated.

You can probe the number of messages in a queue with queue_items_remaining.

int queue_items_remaining(struct QueueHandle * handle,
                          size_t * items)

Returns the number of items in the queue specified by handle via items.

Returns 0 on success. This has no failure mechanisms, but is intended to have the same interface as the version that operates on a sealed queue handle.

Note: This interface is inherently racy. The number of items in the queue may change in between the return of this function and the caller acting on the result.

If you are passing messages between compartments, you should use the versions of these functions with the _sealed suffix. The queue_create_sealed function creates a queue and returns separate send and receive handles, which can be passed to separate compartments. This queue can be destroyed by calling queue_destroy_sealed with the send and receive handles. The queue is not destroyed until both handles have been passed to this function.

int queue_create_sealed(Timeout * timeout,
                        struct SObjStruct * heapCapability,
                        struct SObjStruct ** outQueueSend,
                        struct SObjStruct ** outQueReceive,
                        size_t elementSize,
                        size_t elementCount)

Allocate a new message queue that is managed by the message queue compartment.

This is returned as two sealed pointers to send and receive ends of the queue.

int queue_destroy_sealed(Timeout * timeout,
                         struct SObjStruct * heapCapability,
                         struct SObjStruct * queueHandle)

Destroy a queue using a sealed queue endpoint handle.

The queue is not actually freed until both endpoints are destroyed, which means that you can safely call this from the sending end without the receiving end losing access to messages held in the queue.

The corresponding send and receive functions are identical to their library counterparts, but take one of the queue handles returned from queue_destroy_sealed.

5.10. Waiting for multiple events

The multiwaiter API allows waiting for any of a set of independent events. It is conceptually similar to select, poll, epoll, and kqueue in *NIX operating systems or WaitForMultipleObjects in Windows. It is designed to bound the amount of time that the scheduler must spend checking multiwaiters and to minimise the amount of memory that multiwaiters consume. Memory is allocated only when a multiwaiter is created, with multiwaiter_create. This creates a multiwaiter with space for a fixed number of events.

int multiwaiter_create(Timeout * timeout,
                       struct SObjStruct * heapCapability,
                       struct MultiWaiter ** ret,
                       size_t maxItems)

Create a multiwaiter object.

This is a stateful object that can wait on at most maxItems event sources.

Each multiwaiter_wait call is a one-shot operation. The call is passed a set of things to wait for and the associated condition via the events array and returns the waited status via the same array. This is typically an on-stack array.

int multiwaiter_wait(Timeout * timeout,
                     struct MultiWaiter * waiter,
                     struct EventWaiterSource * events,
                     size_t newEventsCount)

Wait for events.

The first argument is the multiwaiter to wait on. New events can optionally be added by providing an array of newEventsCount elements as the newEvents argument.

Return values:

  • On success, this function returns 0.

  • If the arguments are invalid, this function returns -EINVAL.

  • If the timeout is reached without any events being triggered then this returns -ETIMEOUT.

The multiwaiter can natively wait only for futex notifications but higher-level mechanisms are built out of futexes. For example, if you wish to wait for a message queue (see Section 5.9) to be ready to send, you can call multiwaiter_queue_receive_init to initialise a multiwaiter event with the queue’s receive counter and expected value. This event will then fire if the queue becomes non-full. The normal caveats about race conditions apply: the queue may become full again if another thread sends messages in between your receiving the notification and sending a message.