CHERIoT Programmers' Guide

David Chisnall

Table of contents

6. Communicating between threads

CHERIoT RTOS provides threads as a core abstraction. Threads run until they either yield or are preempted via an interrupt, and then later resume from the same point. There are a small number of scheduler APIs that allow threads to block; higher-level APIs from other compartments may block by indirectly invoking them.

Remember that, in most respects, the scheduler is just another compartment. It doesn't run with elevated privileges, it makes a decision about which thread to run next but it is not able to see the stacks or register states associated with threads.

6.1. Defining threads

Threads in CHERIoT RTOS cannot be dynamically created. Creating threads at run time would require allocating stacks at run time. The no-capture guarantees that CHERIoT RTOS enforces are based on the guarantee that no code (after the loader) has access to two stacks' memory at a time and so the switcher can zero stacks and avoid leaks. The only store-local capabilities that a thread ever has access to are derived from its current stack. Allowing stack creation would violate that: at least the memory allocator would have access to multiple stacks at once. It would be possible to allocate stacks from a separate pool, but that's not really different from allocating stacks up front and having a small compartment that switches them from one use to another or implements a thread pool. There is an example thread pool in lib/thread_pool and thread_pool.h that you can either use directly or use as inspiration for your own design, if you want to create new thread-like contexts.

Threads in CHERIoT RTOS are constructed with four properties:

The stack size means the same as on any other platform. Specifically on CHEIRoT, the stack-pointer capability will be bounded to this size (rounded up if necessary for alignment) and any overflow of the stack, even by a single byte, will trap. The trusted stack size is the maximum number of cross-compartment calls that this thread can do. Each cross-compartment call invokes the switcher, which pushes a new frame on the trusted stack describing where to return.

In the current version, each trusted stack frame is three capabilities (24 bytes). A larger trusted stack does not make much difference to total memory consumption.

The priority of threads matters only in relative terms. Like FreeRTOS (and unlike UNIX), higher numbers mean higher priorities. The scheduler has some data structures whose size depends on the number of priorities, so compiling with fewer priorities can make the scheduler smaller.

The entry point is a compartment entry point. It must be exposed as described in Chapter 5. Compartments and libraries. Thread entry points take no arguments and return no arguments.

On most other systems, thread creation functions take a pointer. This does not make sense for threads that are not dynamically created because there is no context for their creation.

6.2. Identifying the current thread

You will sometimes need to know which thread is currently running. This can be for something as simple as debugging but may also be needed for maintaining per-thread data structures. The ID of each thread is stored in the register save area for that thread and the switcher exposes a library call (thread_id_get) to read it.

Documentation for the thread_id_get function
uint16_t thread_id_get()

Return the thread ID of the current running thread. This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.

This is implemented in the switcher.

Thread IDs start at one (not zero!) because zero is used to indicate the idle thread and so is never visible. The thread_count function returns the number of threads that have been created in the system. This is not decremented when threads exit and so provides the upper bound on the number of threads that may exist. This can be used to size data structures that are indexed by thread ID.

Documentation for the thread_count function
uint16_t thread_count()

Returns the number of threads, including threads that have exited.

This API never fails, but if the trusted stack is exhausted and it cannot be called then it will return -1. Callers that have not probed the trusted stack should check for this value.

The result of this is safe to cache: it will never change over time.

The current_thread example shows calling these functions. The entry point function for this is shown in Listing 23 and the thread definitions from the xmake.lua file in Listing 24.

/// Thread entry point.
void __cheri_compartment("current") entry()
{
	for (int i = 0; i < 2; i++)
	{
		printf("Current thread: %d of %d\n",
		       thread_id_get(),
		       thread_count());
	}
}

Listing 23. A simple example that prints the current threadexamples/current_thread/current.cc

-- Threads to select
		target:values_set("threads", {
			{
				compartment = "current",
				priority = 1,
				entry_point = "entry",
				stack_size = 0x400,
				trusted_stack_frames = 2
			},
			{
				compartment = "current",
				priority = 2,
				entry_point = "entry",
				stack_size = 0x400,
				trusted_stack_frames = 2
			}
		}, {expand = false})

Listing 24. The thread definitions for the current-thread exampleexamples/current_thread/xmake.lua

Note that thread two has a higher priority than thread one. When you run this example, you should see output like this:

Current thread: 2 of 2
Current thread: 2 of 2
Current thread: 1 of 2
Current thread: 1 of 2

The higher-priority thread is running until it exist. Normally, a higher-priority thread would yield to allow another thread to run, as we'll see later in this chapter.

6.3. Using the Timeout structure

Several RTOS APIs have timeouts. These are expressed as a pointer to a Timeout structure. This design is intended to allow a single timeout to be passed down a chain of operations.

Timeouts represent time spent blocking (yielding waiting to be runnable), not time spent running (doing useful work).

Timeouts measure time in scheduler ticks. A tick is a single scheduling quantum, which depends on the board configuration. This is the minimum amount of time for which it is plausible for a thread to sleep. If a thread sleeps then another thread becomes runnable and is then allowed to run (unless it also yields).

Although ticks exist as a unit of accounting, the CHERIoT RTOS scheduler is a tickless scheduler. Traditional schedulers schedule a timer interrupt at a fixed quantum and make a scheduling choice at each call. This can be inefficient because a high-priority thread will be routinely interrupted and then rescheduled (because it remains the highest-priority thread). A tickless scheduler avoids this and instead, before scheduling a thread, sets a timer interrupt to fire at the next point when another thread may be woken.

For example, consider the case where a high-priority thread sleeps for three ticks and a lower-priority thread runs. With a traditional scheduler, a timer interrupt will fire three times. Each time, the scheduler will do some accounting and then reschedule the lower-priority thread. In contrast, a tickless scheduler will configure the timer to fire once, after three ticks have elapsed. At that point, the high-priority thread is runnable and so will be scheduled.

The timeout structure captures the amount of time that is allowed to block and the number of ticks for which it has blocked. Each subsequent call that is passed the same timeout structure may increase the amount of slept time and decrease the remaining time.

A thread may block for more than the permitted limit if it is sleeping while a higher-priority thread runs. Only the highest-priority thread can make strong realtime guarantees in the presence of other runnable threads.

Functions that take a timeout should always expect it as the first argument. This allows it to be forwarded to subsequent calls trivially.

Timeouts may not be stored on the heap. Any function checking timeouts may refuse to accept a heap-allocated timeout. It is difficult to work with heap-allocated timeouts because they may be deallocated while the thread is sleeping, which would then cause it to crash on updating the timeout structure.

6.4. Sleeping

Sleeping for a bounded number of ticks is the simplest form of blocking available. The thread_sleep call causes the caller to yield until a certain number of ticks have run.

Documentation for the thread_sleep function
int thread_sleep(struct Timeout * timeout, uint32_t flags)

Sleep for at most the specified timeout (see timeout.h).

The thread becomes runnable once the timeout has expired but a higher-priority thread may prevent it from actually being scheduled. The return value is a saturating count of the number of ticks that have elapsed.

A call of thread_sleep is with a timeout of zero is equivalent to yield, but reports the time spent sleeping. This requires a cross-domain call and return in addition to the overheads of yieldyield and so `yield` should be preferred in contexts where the elapsed time is not required.

The flags parameter is a bitwise OR of ThreadSleepFlags.

A sleeping thread may be woken early if no other threads are runnable or have earlier timeouts. The thread with the earliest timeout will be woken first. This can cause a yielding thread to sleep when no other thread is runnable, but avoids a potential problem where a high-priority thread yields to allow a low-priority thread to make progress, but then the low-priority thread does a short sleep. In this case, the desired behaviour is not to wake the high-priority thread early, but to allow the low-priority thread to run for the full duration of the high-priority thread's yield.

If you are using thread_sleep to elapse real time, pass ThreadSleepNoEarlyWake as the flags argument to prevent early wakeups.

As with other calls that take a Timeout, the number of ticks that have elapsed during the call can be checked by reading the elapsed field of the timeout structure.

Sleeping in a system with an RTOS scheduler conflates two concepts:

The thread_sleep call supports both of these but understanding how they differ requires understanding a little of the scheduler's behaviour. Recall that CHERIoT RTOS has a tickless scheduler.

This means that, although it uses ticks as an abstraction for defining scheduling quanta, it does not schedule a regular timer interrupt. When two threads at the same priority level are runnable, the scheduler will request a timer interrupt to preempt the current one and switch to the other. If the running thread has no peers, the scheduler will allow it to run until either it yields or another higher or equal-priority thread's timeout expires. The tick abstraction remains as a convenient way of expressing time to the scheduler, but internally the scheduler tracks only elapsed cycles.

By default, if 0 is passed as the flags argument to thread_sleep, the sleep operation is treated as a yield. This is a way for the running thread to communicate to the scheduler that it is happy for other (lower or equal-priority) threads to run for up to the specified number of ticks. The scheduler may wake the yielding thread if no other thread is going to be runnable within that number of ticks. This allows a high-priority threads to allow other threads to run, but continue using the CPU is no other thread is runnable.

In some cases, you really want to sleep. For example, if you're updating a clock display, you will want to run once a second or once a minute to update a display. The same applies if you're sending keep-alive packets or periodically monitoring some other component. Even if no other threads are runnable, you have no useful work to do for a bit. You can pass ThreadSleepNoEarlyWake as the flags argument to thread_sleep to indicate that you really want to sleep.

You can see the effect of sleeping in the thread_sleep example, as shown in Listing 25. This is a modified version of the current_thread example from earlier, now sleeping in each loop iteration.

/// Thread entry point.
void __cheri_compartment("current") entry()
{
	for (int i = 0; i < 2; i++)
	{
		printf("Current thread: %d of %d\n",
		       thread_id_get(),
		       thread_count());
		Timeout t{1};
		thread_sleep(&t);
	}
	printf("Cycles elapsed: %lld\n", rdcycle64());
}

Listing 25. A simple example of thread sleepingexamples/thread_sleep/current.cc

If you run this, you should see output that looks somewhat like this:

Current thread: 2 of 2
Current thread: 1 of 2
Current thread: 2 of 2
Current thread: 1 of 2
Cycles elapsed: 262193
Cycles elapsed: 265806

As before, thread two runs first, but then it yields and allows thread one to run. Thread one then yields and allows thread two to run, and so on. If thread one did not yield then it would be preempted after one tick.

If you run this with the Sail simulator, do not be surprised if the cycle counts look very small. Sail is not a cycle-accurate model and so the cycle count is guaranteed to be monotonic, but not to represent a real system in any way. The snippets in this section are using the Ibex simulator.

Try modifying this example, adding ThreadSleepNoEarlyWake as a second argument to the thread_sleep call. You should now see output that looks very similar, but shows lower cycle counts at the end:

Current thread: 2 of 2
Current thread: 1 of 2
Current thread: 2 of 2
Current thread: 1 of 2
Cycles elapsed: 249233
Cycles elapsed: 252273

Here you see that the total execution time has gone from 265,806 cycles to 252,273. In the original version, when thread one slept (after doing far less than one tick's worth of work), there were no runnable threads and so the scheduler does nothing for a while. Eventually, thread two (the high-priority thread) is runnable again and it resumes. In the version with ThreadSleepNoEarlyWake, thread two can resume as soon as thread one sleeps. Similarly, when thread two yields for the second time, thread one will resume.

6.5. Waiting for events with futexes

The scheduler exposes a set of futex APIs as a building block for various notification and locking mechanisms. Futex is a contraction of 'fast userspace mutex'. This does not quite apply on a CHERIoT system, where there is no userspace, but the core concept of avoiding a privilege transition on fast paths still applies.

A CHERIoT RTOS futex is a 32-bit word where the scheduler provides compare-and-sleep (futex_timed_wait) and notify (futex_wake) operations.

Documentation for the futex_timed_wait function
int futex_timed_wait(Timeout * ticks, const uint32_t * address, uint32_t expected, uint32_t flags)

Compare the value at address to expected and, if they match, sleep the thread until a wake event is sent with futex_wake or until this the thread has slept for ticks ticks.

The value of ticks specifies the permitted timeout. See timeout.h for details.

The address argument must permit loading four bytes of data after the address.

The flags argument contains flags that may control the behaviour of the call.

  • This returns: 0 on success: either *address and expected differ or a wake is received.
  • -EINVAL if the arguments are invalid.
  • -ETIMEOUT if the timeout expires.
Documentation for the futex_wake function
int futex_wake(uint32_t * address, uint32_t count)

Wakes up to count threads that are sleeping with futex[_timed]_wait on address.

The address argument must permit storing four bytes of data after the address. This call does not store to the address but requiring store permission prevents a thread from waking up a futex that it cannot possibly have moved to a different state.

The return value for a successful call is the number of threads that were woken. -EINVAL is returned for invalid arguments.

In C++, std::atomic<uint32_t> provides wait, notify_all, and notify_one methods that expose futex functionality and may be more convenient to call than the raw futex APIs. These include some additional (non-standard) overloads that expose more of the underlying futex functionality.

A futex allows you to use atomic operations on a 32-bit word for fast paths but then sleep and wake threads when they are blocked, rather than spinning. Anything that can be implemented with a spin-wait loop can usually be made more efficient with a futex.

For example, consider the simplest possible spinlock, which uses a single word containing a one to indicate locked and a zero to indicate unlocked. When you encounter a one, you sit in a loop doing an atomic compare-and-swap trying to replace a zero with a one. When this succeeds, you've acquired the lock.

On most operating systems running on single-core processors, you will sit in this loop until you exhaust your quantum, then a timer will fire and another thread will run. Your thread may be scheduled before the thread that owns the lock finishes, so you'll then spin for another quantum.

The first simple optimisation on this design is to yield in the spin loop. This will allow other threads to run but the waiting thread remains runnable and so may be rescheduled early. With an RTOS priority scheduler, if the thread that's waiting is a higher priority than thread that owns the lock then the thread that owns the lock may never be scheduled.

A futex lets the waiting thread sleep. The futex_timed_wait call will compare the value in the futex word to the expected value (one, indicating locked, in this case) and, if they match, will send the thread to sleep and remain asleep until the thread owning the lock will then do a futex_wake call when unlocking.

A more complete futex-based lock uses three values in the lock word to differentiate between locked states with and without waiters. This allows the uncontended case to avoid any cross-compartment calls.

Futexes can be used to build other waiting mechanisms beyond locks. A barrier is the simplest primitive that you can build with a futex. This is a very simple primitive that blocks every thread until all threads have reached the same point in the code. Listing 26 shows how a barrier might be implemented, using the std::atomic wrapper around futexes.

/// Thread entry point.
__cheri_compartment("barrier") void entry()
{
	static std::atomic<uint32_t> barrier = 2;
	printf("Thread: %d arrived at barrier\n",
	       thread_id_get());
	uint32_t value = --barrier;
	if (value == 0)
	{
		barrier.notify_all();
	}
	else
	{
		while (value != 0)
		{
			barrier.wait(value);
			value = barrier;
		}
	}
	printf("Thread: %d passed barrier\n", thread_id_get());
}

Listing 26. Implementing a barrier with a futexexamples/barrier/barrier.cc

The counter is set to the number of threads (two in this case). When a thread arrives at the barrier, it decrements the counter. Note that the decrement operator on std::atomic is an atomic decrement, so exactly one thread should take the counter to zero. The thread that set the counter to zero then notifies any waiting threads to wake. Other threads, as they arrive, will sit using the wait method, which is a thin wrapper around futex_wait.

The loop on line 26 is important. Imagine that one thread decrements the counter to one and is then preempted. Then another thread decrements it to zero and does the notify_all (futex_wake) call. The first thread, when it resumes will try to wait on the futex, but the value has now changed. Fortunately, futex_wait takes an expected value and so, if the futex word does not have this value then the call will resume immediately.

If this example had three threads, a similar race could occur in the opposite direction. If the first thread decremented from three to two and was then preempted by a second thread then it would call futex_wake with an expected value of two, but the value would now be one. It would then wake, because the value does not match. If this happens, it would reload the value, note that it is not zero, and retry waiting. The futex APIs are designed to allow this combination of atomic operation and wait to work for any interleaving.

Try modifying this example by adding more threads (don't forget to increase the initial value of the counter!).

6.6. Building locks from futexes

The locks library provides a set of futex-based locks. The locks.h header exposes the interface to this library.

Ticket locks
provide guaranteed FIFO semantics for waiters.
Flag locks
are simple locks that wake waiters in the order of their thread priorities. These can optionally provide priority inheritance (see Section 6.7. Inheriting priorities).
Recursive mutexes
wrap a priority-inheriting flag lock and allow the same thread to acquire a lock multiple times.
Semaphores
provide a counting semaphore abstraction.

C++ users may prefer to use the wrappers provided in locks.hh, which implement a uniform interface for different lock types. This header also defines a NoLock class that provides the same interface but does not do any locking so generic data structures can be implemented with and without locking.

You can see the difference between the lock types by running the locking example. This uses a lock declared at the top of the file, as shown in Listing 27. This is using a flag lock in the uncommented version, you can change it to using a ticket lock by commenting out the first declaration and uncommenting the second.

// Comment out this line and uncomment
// the next one to see how ticket locks
// behave.
FlagLock lock;
// TicketLock lock;

Listing 27. Declaring a lock in C++examples/locking/locking.cc

The rest of this file contains three versions of the function shown in Listing 28. One of these is defined for each of three threads, which are created with three different priorities. The do_useful_work function in the example just calls thread_sleep with a timeout of one second, but in real code would be doing the work that must be done with the lock held.

__cheri_compartment("locking") void low()
{
	while (true)
	{
		lock.lock();
		printf("Low priority thread "
		       "acquired lock\n");
		do_useful_work();
		lock.unlock();
	}
}

Listing 28. The low-priority thread entry point for the locking exampleexamples/locking/locking.cc

When you run this example, you should see output like this:

High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock

This is an example of starvation. The low- and medium-priority threads are both waiting for the lock, but the high-priority thread is not yielding between releasing and reacquiring the lock and so they are never woken.

If you change this to a ticket lock, the output will change to this:

High priority thread acquired lock
Medium priority thread acquired lock
Low priority thread acquired lock
High priority thread acquired lock
Medium priority thread acquired lock
Low priority thread acquired lock

Ticket locks are modelled after the ticket dispensers that are used to manage queues of people. When you arrive, you take the ticket with the next number. There is a display showing the current number and you wait until the display matches your number. A ticket lock is implemented with two counters. The first counter is used to assign tickets. When you try to acquire the lock, you atomically fetch-and-increment the counter. The result of the fetch is your ticket number. You then wait until the second counter is equal to your ticket number. When you release the lock, you increment the second counter. This is built from futexes by doing a futex-wait on the counter word to block acquiring the lock and a futex-wake when you release the lock.

In this example, the high-priority thread runs first and acquires the lock. Next, while it's yielding in the middle, the medium-priority thread tries to acquire the lock and blocks. Before blocking, it successfully receives a ticket and so is next in line for the lock. The low-priority thread then does the same. When the high-priority thread resumes, it also acquires a ticket and is now in the queue to run after the other two threads.

Each thread runs in this version, but a high-priority thread is blocked waiting for two lower-priority threads to proceed, which may not be desirable.

6.7. Inheriting priorities

Simple futex-based locks are vulnerable to priority inversion, where a high-priority thread is unable to make progress because a lock is held by a lower-priority thread. We saw an example of this with ticket locks but the same is also possible with flag locks, as we'll see in the priority_inheritance example.

This starts with the high-priority thread shown in Listing 29 running. This will sit in a loop, first yielding for up to a second to allow other threads to run, then tries to acquire a lock. This example is using LockGuard, a simple RAII wrapper around the RTOS' locks. The constructor variant here with two arguments takes a timeout in addition to the lock. Lock guards are convertible to bool and convert as true if and only if the lock is acquired. This allows the thread to print whether the lock was acquired, or it timed out waiting for the lock.

__cheri_compartment("priority_"
                    "inheritance") void high()
{
	// Let the low and
	// medium-priority threads start
	Timeout t(MS_TO_TICKS(1000));
	thread_sleep(&t);
	while (true)
	{
		t = Timeout(MS_TO_TICKS(1000));
		if (LockGuard g{lock, &t})
		{
			printf("High-priority thread acquired the lock!\n");
		}
		else
		{
			printf("High-priority thread failed to acquire the "
			       "lock!\n");
		}
	}
}

Listing 29. A high-priority thread that will be starvedexamples/priority_inheritance/priority_inheritance.cc

While this thread yields, the thread shown in Listing 30 runs. This first yields and then infinite loops. The infinite loop in this example is a placeholder for anything that does long-running work and does not yield.

__cheri_compartment("priority_"
                    "inheritance") void medium()
{
	// Let the low-priority thread run
	// until it yields
	Timeout t(MS_TO_TICKS(1000));
	thread_sleep(&t);
	printf("Medium priority thread entering infinite loop "
	       "and not yielding\n");
	while (true)
	{
		x++;
	}
}

Listing 30. A medium-priority thread that will starve a high-priority threadexamples/priority_inheritance/priority_inheritance.cc

While the medium-priority thread yields, the thread shown in Listing 31 runs. This acquires the lock and also yields. This thread is the lowest priority and so, even without the explicit yield, a similar thread in a more complex program could easily be preempted with the lock held. The yield simply forces it to happen every time, to trigger the problem.

Prior to C++26, infinite loops are undefined behaviour in C++ if they do not have any side effects. The x++ in this loop is simply incrementing an atomic integer to make sure that this loop is not optimised away.

__cheri_compartment("priority_"
                    "inheritance") void low()
{
	while (true)
	{
		lock.lock();
		printf("Low-priority thread acquired the lock\n");
		Timeout t(MS_TO_TICKS(500));
		thread_sleep(&t, ThreadSleepNoEarlyWake);
		printf("Low-priority thread releasing the lock\n");
		lock.unlock();
	}
}

Listing 31. A low-priority thread that will be preempted with a lock heldexamples/priority_inheritance/priority_inheritance.cc

After the low-priority thread yields (or is preempted) the medium-priority thread will resume and infinite loop (or, at least, does something that doesn't yield). The low-priority thread cannot run, because a higher-priority thread is runnable. When the high-priority thread tries to run, it sees that the lock is already acquired blocks waiting for it to be finished. Unfortunately, the thread that could release the lock never runs and so you see output like this:

Low-priority thread acquired the lock
Low-priority thread releasing the lock
Low-priority thread acquired the lock
Medium priority thread entering infinite loop and not yielding
High-priority thread failed to acquire the lock!
High-priority thread failed to acquire the lock!
High-priority thread failed to acquire the lock!

This example is quite easy to debug, because the starvation is total: the high-priority thread never makes progress. If the medium-priority thread yielded sometimes, this would be much worse because the high-priority thread would make progress but at a far slower rate than its priority would imply.

Priority inheritance is the solution to this kind of problem. With priority inheritance, a thread that blocks on a lock that is held by a lower-priority thread will temporarily loan its priority to the thread the owns the lock. This allows the lower-priority thread run and release the lock. If you change the FlagLock in the example to a FlagLockPriorityInherited, you will see this output:

Low-priority thread acquired the lock
Low-priority thread releasing the lock
Low-priority thread acquired the lock
Medium priority thread entering infinite loop and not yielding
Low-priority thread releasing the lock
High-priority thread acquired the lock!
High-priority thread acquired the lock!
High-priority thread acquired the lock!

Now, while the medium-priority thread the high-priority thread waits and tries to acquire the lock. The lock is held and so it will try to loan its priority to the thread that owns it. This allows the low-priority thread to run in preference to the medium-priority one. The low-priority thread runs and releases the lock, allowing the high-priority thread to resume and acquire the lock.

The futex APIs implement this by storing the thread ID of the owning thread in the bottom 16 bits of the futex word and passing FutexPriorityInheritance to the flags argument in the wait call. The specified thread will have its priority set to the highest priority of any of the waiting threads. The priority boost lasts until the waiters time out or the boosted thread releases the lock, whichever happens first. A single thread can hold multiple priority-inheriting locks and receive priority boosts from all of them. Most of the time, you will not use this directly and will instead use the priority-inheriting lock APIs from C or their C++ wrappers.

The priority inheritance mechanism can also be used to build asymmetric locks. These have a fast path that doesn't do any cross-compartment calls and a slow path that does. You can find one example of this in the hazard pointer mechanism for short-lived claims. This must detect when a thread has tried to add a hazard pointer while the allocator is scanning the list, without slowing down the allocator. Before reading the list, the allocator increments the top 16 bits of the futex word and sets the low 16 to the thread ID performing the operation. Threads updating the hazard set check the futex word before and after updating the list. If the top 16 bits have changed, they know that the allocator has scanned the list and they must retry. If the top 16 bits contain an odd value, the allocator is currently scanning the list and they must wait. They can do a priority-inheriting wait with a one-tick timeout even though the allocator will not ever call futex_wake. They will yield for one tick, boosting the priority of the thread that's currently in the allocator, but then resume at the end of the tick.

6.8. Securing futexes

Most of the time you will want to use futexes (and the locks that wrap them) to synchronise operations within a single compartment. Futex-based locks rely on the contents of the lock word to be valid. For example, if a flag lock is directly accessible by two mutually distrusting compartments, one can write an invalid value to the word and either prevent the other from waking waiters or cause it to spuriously believe that it has acquired the lock.

This is not normally a limitation because locks typically protect some data structure or other resource that should not be concurrently mutated by multiple threads. Providing mutable views of such a structure to multiple compartments is almost certainly a security vulnerability, even without attacks on the futex.

There is one situation where futexes are safe to share across compartment boundaries. If you have a component that others trust for availability, it can share read-only views of a futex to allow waiting for an out-of-band event. The scheduler does this for interrupts (see Chapter 9. Writing a device driver), allowing threads to use the futex wait operation to block until an interrupt is ready.

6.9. Using event groups

The event_group library provides an event group API that is primarily intended for porting code written against FreeRTOS's event groups APIs. The event.h header exposes the interface to this library. These APIs do not have a clear trust model and so should be avoided in new code that is not ported from FreeRTOS. You can build more convenient interfaces atop futexes for most synchronisation operations. You may also simply use multiple futexes and the multiwaiter API (see Section 6.11. Waiting for multiple events) to wait for multiple events.

An event group is a set of up to 24 values that can be set or cleared independently. Waiters can wait for any or all of an arbitrary subset of these.

Event groups are created with the eventgroup_create function. This returns an opaque handle to the event group, which can be used for setting, clearing, or waiting on events.

Documentation for the eventgroup_create function
int eventgroup_create(struct Timeout * timeout, struct SObjStruct * heapCapability, struct EventGroup * * outGroup)

Create a new event group, allocated using heapCapability. The event group is returned via outGroup.

This returns zero on success. Otherwise it returns a negative error code. If the timeout expires then this returns -ETIMEDOUT, if memory cannot be allocated it returns -ENOMEM.

Note that, because this allocates memory, it requires an allocation capability. See Chapter 7. Memory management in CHERIoT RTOS for more information about what this means.

You can then use eventgroup_set and eventgroup_clear to set and clear some or all of the event flags in this group. Both of these calls return the old values of the bits.

Documentation for the eventgroup_set function
int eventgroup_set(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsToSet)

Set one or more bits in an event group. The bitsToSet argument contains the bits to set. Any thread waiting with eventgroup_wait will be woken if the bits that it is waiting for are set.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

Documentation for the eventgroup_clear function
int eventgroup_clear(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsToClear)

Clear one or more bits in an event group. The bitsToClear argument contains the set of bits to clear. This does not wake any threads.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

You can then subsequently wait for some of the events to be set with the eventgroup_wait function. This takes a set of events to wait for and can wait until either any or all of them are set.

Documentation for the eventgroup_wait function
int eventgroup_wait(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsWanted, _Bool waitForAll, _Bool clearOnExit)

Wait for events in an event group. The bitsWanted argument must contain at least one bit set in the low 24 bits (and none in the high bits). This indicates the specific events to wait for. If waitForAll is true then all of the bits in bitsWanted must be set in the event group before this returns. If waitForAll is false then any of the bits in bitsWanted being set in the event group will cause this to return.

If this returns zero then outBits will contain the bits that were set at the time that the condition became true. If this returns -ETIMEDOUT then outBits will contain the bits that were set at the time that the timeout expired.

Note: waitForAll requires all bits to be set *at the same time*. This makes it trivial to introduce race conditions if used with multiple waiters and clearOnExit, or if different threads clear bits different bits in the waited set.

If clearOnExit is true and this returns successfully then the bits in bitsWanted will be cleared in the event group before this returns.

This call can also atomically clear the bits that you've waited on, giving them edge-triggered behaviour.

6.10. Sending messages

A message queue is a FIFO capable of storing a fixed number of fixed-sized entries. There are two distinct use cases for message queues:

In the first case, the endpoints are in the same trust domain. The message_queue_library library provides a simple message queue API that is intended for this use case. When the endpoints are in different trust domains, the endpoints must be protected from tampering. The message_queue compartment wraps the library in a compartment that exposes an almost identical interface to the library but with the endpoints exposed as (tamper-proof) sealed capabilities.

Queues for use within a single compartment are created with queue_create, which allocates the ring buffer and returns a pointer to the structure. This is a struct MessageQueue and callers are at liberty to look inside it directly. There is no expectation that it is protected from the caller. The functions exposed by the library are (by their nature as shared-library functions) shared between any compartments that use the library, but this is a code-size reduction exercise not a security boundary.

Documentation for the queue_create function
int queue_create(Timeout * timeout, struct SObjStruct * heapCapability, struct MessageQueue * * outQueue, size_t elementSize, size_t elementCount)

Allocates space for a queue using heapCapability and stores a handle to it via outQueue.

The queue is has space for elementCount entries. Each entry is a fixed size, elementSize bytes.

Returns 0 on success, -ENOMEM on allocation failure, and -EINVAL if the arguments are invalid (for example, if the requested number of elements multiplied by the element size would overflow).

Queues can be freed simply with heap_free but doing so may result in deadlocks. If a thread is blocked trying to send or receive from a queue then it will remain blocking if the queue is freed out from underneath it. The queue_destroy function avoids this by waking all threads. Other threads may then trap immediately after they return and try to read from the queue's counters, but at least this is recoverable (see Section 5.9. Handling errors).

Documentation for the queue_destroy function
int queue_destroy(struct SObjStruct * heapCapability, struct MessageQueue * handle)

Destroys a queue. This wakes up all threads waiting to produce or consume, and makes them fail to acquire the lock, before deallocating the underlying allocation.

Returns 0 on success. This can fail only if deallocation would fail and will, in these cases, return the same error codes as heap_free.

This function will check the heap capability first and so will avoid upgrading the locks if freeing the queue would fail.

Messages are then sent with queue_send and received with queue_receive. These are blocking (if allowed to by with a non-zero timeout) calls that send or receive a single message.

Documentation for the queue_send function
int queue_send(Timeout * timeout, struct MessageQueue * handle, const void * src)

Send a message to the queue specified by handle. This expects to be able to copy the number of bytes specified by elementSize when the queue was created from src.

Returns 0 on success. On failure, returns -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

This expected to be called with a valid queue handle. It does not validate that this is correct. It uses safe_memcpy and so will check the buffer.

Documentation for the queue_receive function
int queue_receive(Timeout * timeout, struct MessageQueue * handle, void * dst)

Receive a message over a queue specified by handle. This expects to be able to copy the number of bytes specified by elementSize. The message is copied to dst, which must have sufficient permissions and space to hold the message.

Returns 0 on success, -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

The library interfaces to queues are not intended to be robust in the presence of malicious callers. They run in the same security context as the caller and so a caller may abuse them to corrupt its own state. They do aim to be robust with respect to the source or destination buffer for sending and receiving messages being invalid or concurrently deallocated. This robustness is implemented using the scoped error handling and so requires calling compartments to link the relevant error handler, as documented in Section 5.11. Using scoped error handling.

You can probe the number of messages in a queue with queue_items_remaining.

Documentation for the queue_items_remaining function
int queue_items_remaining(struct MessageQueue * handle, size_t * items)

Returns the number of items in the queue specified by handleitemsin the queue specified by `handle` via `items`.

Returns 0 on success. This has no failure mechanisms, but is intended to have the same interface as the version that operates on a sealed queue handle.

Note: This interface is inherently racy. The number of items in the queue may change in between the return of this function and the caller acting on the result.

If you are passing messages between compartments, you should use the versions of these functions with the _sealed suffix. These are provided by the compartmentalised version and so ensure that the queue's internal state is not mutated by its callers.

The queue_create_sealed function creates a queue in exactly the same way as queue_create (and has the same argument structure) but returns a sealed pointer to it. Nowhere outside of the queue compartment can unseal this and so the queue is protected against tampering. The queue may still contain malicious or malformed data, but you have guarantees that messages will arrive in order and that they won't be tampered with in flight.

Documentation for the queue_create_sealed function
int queue_create_sealed(Timeout * timeout, struct SObjStruct * heapCapability, struct SObjStruct * * outQueue, size_t elementSize, size_t elementCount)

Allocate a new message queue that is managed by the message queue compartment. The resulting queue handle (returned in outQueue) is a sealed capability to a queue that can be used for both sending and receiving.

This queue can be destroyed by calling queue_destroy_sealed. You cannot free an object pointed to by a sealed capability unless you also have the capability that authorises unsealing. This means that, unlike the unsealed version, you cannot destroy a queue created with queue_create_sealed except by calling the correct destroy function.

Documentation for the queue_destroy_sealed function
int queue_destroy_sealed(Timeout * timeout, struct SObjStruct * heapCapability, struct SObjStruct * queueHandle)

Destroy a queue handle. If this is called on a restricted endpoint (returned from queue_receive_handle_create_sealed or queue_send_handle_create_sealed), this frees only the handle. If called with the queue handle returned from queue_create_sealed, this will destroy the queue.

The corresponding send and receive functions are identical to their library counterparts, but take sealed queue handles. Sometimes, it's useful to be able to give one compartment the ability to write to a queue and another the ability to read. The queue compartment provides two APIs that let you allocate a handle that is authorised for only sending or receiving, queue_receive_handle_create_sealed and queue_send_handle_create_sealed.

Documentation for the queue_receive_handle_create_sealed function
int queue_receive_handle_create_sealed(struct Timeout * timeout, struct SObjStruct * heapCapability, struct SObjStruct * handle, struct SObjStruct * * outHandle)

Convert a queue handle returned from queue_create_sealed into one that can be used *only* for receiving.

Returns 0 on success and writes the resulting restricted handle via outHandle. Returns -ENOMEM on allocation failure or -EINVAL if the handle is not valid.

Documentation for the queue_send_handle_create_sealed function
int queue_send_handle_create_sealed(struct Timeout * timeout, struct SObjStruct * heapCapability, struct SObjStruct * handle, struct SObjStruct * * outHandle)

Convert a queue handle returned from queue_create_sealed into one that can be used *only* for sending.

Returns 0 on success and writes the resulting restricted handle via outHandle. Returns -ENOMEM on allocation failure or -EINVAL if the handle is not valid.

With these APIs, you can have an initial setup compartment that creates a queue and then hands the send and receive endpoints to two others. This provides mutual distrust because neither compartment that holds a send or receive handle can free the queue, nor can they be used for the opposite operation, and the queue compartment protects the integrity of the queue itself.

In many cases, the trust relationship may be asymmetric. For example, a compartment may provide a queue that other untrusted compartments can send messages to, but the senders trust the receiving compartment. The APIs also support this asymmetric use case, where the trusted compartment keeps the original handle but uses queue_send_handle_create_sealed to create handles to pass to other compartments.

6.11. Waiting for multiple events

The multiwaiter API allows waiting for any of a set of independent events. It is conceptually similar to select, poll, epoll, and kqueue in *NIX operating systems or WaitForMultipleObjects in Windows. It is designed to bound the amount of time that the scheduler must spend checking multiwaiters and to minimise the amount of memory that multiwaiters consume. Memory is allocated only when a multiwaiter is created, with multiwaiter_create. This creates a multiwaiter with space for a fixed number of events.

Documentation for the multiwaiter_create function
int multiwaiter_create(Timeout * timeout, struct SObjStruct * heapCapability, struct MultiWaiter * * ret, size_t maxItems)

Create a multiwaiter object. This is a stateful object that can wait on at most maxItems event sources.

Each multiwaiter_wait call is a one-shot operation. The call is passed a set of things to wait for and the associated condition via the `events` array and returns the waited status via the same array. This is typically an on-stack array.

Documentation for the multiwaiter_wait function
int multiwaiter_wait(Timeout * timeout, struct MultiWaiter * waiter, struct EventWaiterSource * events, size_t newEventsCount)

Wait for events. The first argument is the multiwaiter to wait on. New events can optionally be added by providing an array of newEventsCount elements as the newEvents argument.

Return values:

  • On success, this function returns 0.
  • If the arguments are invalid, this function returns -EINVAL.
  • If the timeout is reached without any events being triggered then this returns -ETIMEOUT.

The multiwaiter can natively wait only for futex notifications but higher-level mechanisms are built out of futexes. For example, if you wish to wait for a message queue (see Section 6.10. Sending messages) to be ready to send, you can call multiwaiter_queue_receive_init to initialise a multiwaiter event with the queue's receive counter and expected value. This event will then fire if the queue becomes non-full. The normal caveats about race conditions apply: the queue may become full again if another thread sends messages in between your receiving the notification and sending a message.