6. Communicating between threads

CHERIoT RTOS provides threads as a core abstraction. Threads run until they either yield or are preempted via an interrupt, and then later resume from the same point. There are a small number of scheduler APIs that allow threads to block; higher-level APIs from other compartments may block by indirectly invoking them.

Remember that, in most respects, the scheduler is just another compartment. It doesn't run with elevated privileges. It makes a decision about which thread to run next, but it is not able to see the stacks or register states associated with threads. When another thread calls into the scheduler (for example, to sleep or to wake other threads), the call is just like any other cross-compartment call and uses the same mechanism.

6.1. Defining threads

Threads in CHERIoT RTOS cannot be dynamically created. Creating threads at run time would require allocating stacks at run time. The no-capture guarantees that CHERIoT RTOS enforces are based on the guarantee that no code other than the loader has access to two stacks' memory at a time and so the switcher can zero stacks and avoid leaks. The only store-local capabilities that a thread ever has access to are derived from its current stack. Allowing stack creation would violate that: at least the memory allocator would have access to multiple stacks at once. It would be possible to allocate stacks from a separate pool, but that's not really different from allocating stacks up front and having a small compartment that switches them from one use to another or implements a thread pool. There is an example thread pool in lib/thread_pool in the RTOS SDK, with its interface defined in the thread_pool.h header, that you can either use directly or use as inspiration for your own design, if you want to create new thread-like contexts.

If a thread entry point returns then the thread exits but, importantly, this does not free the thread's stack or trusted stack. The memory owned by a thread is assigned to that thread as long as the system runs.

Threads in CHERIoT RTOS are constructed with four properties:

The size of their stack.
The size of their trusted stack.
Their priority.
The entry point where they start executing.

The stack size means the same as on any other platform. Specifically on CHERIoT, the stack-pointer capability will be bounded to this size (rounded up if necessary for alignment) and any overflow of the stack, even by a single byte, will trap. The trusted stack size is the maximum number of cross-compartment calls that this thread can do. Each cross-compartment call invokes the switcher, which pushes a new frame onto the trusted stack describing where to return.

In the current version, each trusted stack frame is three capabilities (24 bytes). A larger trusted stack does not make much difference to total memory consumption.

The priority of threads matters only in relative terms. Like FreeRTOS (and unlike UNIX), higher numbers mean higher priorities. The scheduler has some data structures whose size depends on the number of priorities, so compiling with fewer priorities can make the scheduler smaller.

The entry point property is a compartment's entry point. It must be exposed as described in Chapter 5. Compartments and libraries. Thread entry points take no arguments and return no arguments.

On most other systems, thread creation functions take a pointer. This does not make sense for threads that are not dynamically created because there is no context for their creation.

6.2. Identifying the current thread

You will sometimes need to know which thread is currently running. This can be for something as simple as debugging but may also be needed for maintaining per-thread data structures. The ID of each thread is stored in the register save area for that thread and the switcher exposes a library call (thread_id_get) to read it.

Documentation for the thread_id_get function

uint16_t thread_id_get()

Return the thread ID of the current running thread. This is mostly useful where one compartment can run under different threads and it matters which thread entered this compartment.

User threads (that is, those defined in the xmake firmware configuration) are 1-indexed, with 0 indicating primordial idle and scheduling contexts. User code never runs in these contexts and so anything using this result to index into a per-thread array may wish to subtract one and avoid allocating an array element for the idle thread.

This is implemented in the switcher.

Thread IDs start at one (not zero!) because zero is used to indicate the idle thread and so is never visible. The thread_count function returns the number of threads that have been created in the system. This is not decremented when threads exit, so it provides the upper bound on the number of threads that may exist. This can be used to size data structures that are indexed by thread ID.

Documentation for the thread_count function

uint16_t thread_count()

Returns the number of user threads (that is, those defined in the xmake firmware configuration), including threads that have exited.

This API never fails, but if the trusted stack is exhausted and it cannot be called then it will return -1. Callers that have not probed the trusted stack should check for this value.

The result of this is safe to cache because it will never change over time.

Somewhat counter intuitively, thread_id_get is faster than thread_count to call. The former is a libcall that that switcher implements, the latter is a cross-compartment call into the scheduler. This is not normally a problem because the result of thread_count does not change and can be cached in a compartment, whereas the result of thread_id_get depends on the current thread and cannot be safely cached.

The current_thread example shows calling these functions. The entry point function for this is shown in Listing 44 and the thread definitions from the xmake.lua file in Listing 45.

/// Thread entry point.
void __cheriot_compartment("current") entry()
{
	for (int i = 0; i < 2; i++)
	{
		printf("Current thread: %d of %d (iteration %d)\n",
		       thread_id_get(),
		       thread_count(),
		       i);
	}
}

Listing 44. A simple example that prints the current threadexamples/current_thread/current.cc

-- Threads to select
		target:values_set("threads", {
			{
				compartment = "current",
				priority = 1,
				entry_point = "entry",
				stack_size = 0x400,
				trusted_stack_frames = 2
			},
			{
				compartment = "current",
				priority = 2,
				entry_point = "entry",
				stack_size = 0x400,
				trusted_stack_frames = 2
			}
		}, {expand = false})

Listing 45. The thread definitions for the current-thread exampleexamples/current_thread/xmake.lua

Note that the second thread has a higher priority than thread one. When you run this example, you should see output like this:

Current thread: 2 of 2 (iteration 0)
Current thread: 2 of 2 (iteration 1)
Current thread: 1 of 2 (iteration 0)
Current thread: 1 of 2 (iteration 1)

The higher-priority thread is running until it exits. Normally, a higher-priority thread would yield to allow another thread to run, as we'll see later in this chapter.

6.3. Limiting blocking with timeouts

Several RTOS APIs have timeouts. These are expressed as a pointer to a Timeout structure. This design is intended to allow a single timeout to be passed down a chain of operations.

Timeouts represent time spent blocking (yielding waiting to be runnable), not time spent running (doing useful work).

Timeouts measure time in scheduler ticks. A tick is a single scheduling quantum, which depends on the board configuration. This is the minimum amount of time for which it is plausible for a thread to sleep. If a thread sleeps, another thread becomes runnable and is then allowed to run (unless it also yields).

Although ticks exist as a unit of accounting, the CHERIoT RTOS scheduler is a tickless scheduler. Traditional schedulers schedule a timer interrupt at a fixed quantum and make a scheduling choice at each call. This can be inefficient because a high-priority thread will be routinely interrupted and then rescheduled (because it remains the highest-priority thread). A tickless scheduler avoids this and instead, before scheduling a thread, sets a timer interrupt to fire at the next point when another thread may be woken.

For example, consider the case where a high-priority thread sleeps for three ticks and a lower-priority thread runs. With a traditional scheduler, a timer interrupt will fire three times. Each time, the scheduler will do some accounting and then reschedule the lower-priority thread. In contrast, a tickless scheduler will configure the timer to fire once, after three ticks have elapsed. At that point, the high-priority thread is runnable and so will be scheduled.

The timeout structure captures the amount of time that is allowed to block and the number of ticks for which it has blocked. Each subsequent call that is passed the same timeout structure may increase the amount of slept time and decrease the remaining time. This means that a timeout is stateful. If you pass a pointer to the same timeout to multiple APIs, they will all consume the ticks that were set when the timeout structure was initialised. This is normally useful because it means that a function that calls multiple functions that block can have a timeout parameter and forward it to each of the functions that it calls, but it means that you can't just create one timeout at the start of your thread and keep using it (unless you initialise it to the unlimited timeout value).

A thread may block for more than the permitted limit if it is sleeping while a higher-priority thread runs. Only the highest-priority thread can make strong realtime guarantees in the presence of other runnable threads.

Functions that take a timeout should always expect it as the first argument. This allows it to be forwarded to subsequent calls trivially.

Timeouts may not be stored on the heap. Any function checking timeouts may refuse to accept a heap-allocated timeout. It is difficult to work with heap-allocated timeouts because they may be deallocated while the thread is sleeping, which would then cause it to crash upon updating the timeout structure.

6.4. Sleeping

Sleeping for a bounded number of ticks is the simplest form of blocking available. The thread_sleep call causes the caller to yield until a certain number of ticks have run.

Documentation for the thread_sleep function

int thread_sleep(struct Timeout * timeout, uint32_t flags)

Sleep for at most the specified timeout (see timeout.h).

The thread becomes runnable once the timeout has expired but a higher-priority thread may prevent it from actually being scheduled. The return value is a saturating count of the number of ticks that have elapsed.

A call of thread_sleep with a timeout of zero is equivalent to yield, but reports the time spent sleeping. This requires a cross-compartment call and return in addition to the overheads of yield and so yield should be preferred in contexts where the elapsed time is not required.

The flags parameter is a bitwise OR of ThreadSleepFlags.

A sleeping thread may be woken early if no other threads are runnable or have earlier timeouts. The thread with the earliest timeout will be woken first. This can cause a yielding thread to sleep when no other thread is runnable, but avoids a potential problem where a high-priority thread yields to allow a low-priority thread to make progress, but then the low-priority thread does a short sleep. In this case, the desired behaviour is not to wake the high-priority thread early, but to allow the low-priority thread to run for the full duration of the high-priority thread's yield.

If you are using thread_sleep to elapse real time, pass ThreadSleepNoEarlyWake as the flags argument to prevent early wakeups.

As with other calls that take a Timeout, the number of ticks that have elapsed during the call can be checked by reading the elapsed field of the timeout structure.

Sleeping in a system with an RTOS scheduler conflates two concepts:

Waiting for some time to elapse.
Allowing lower-priority threads to run.

The thread_sleep call supports both of these but understanding how they differ requires understanding a little of the scheduler's behaviour. Recall that CHERIoT RTOS has a tickless scheduler.

This means that, although it uses ticks as an abstraction for defining scheduling quanta, it does not schedule a regular timer interrupt. When two threads at the same priority level are runnable, the scheduler will request a timer interrupt to preempt the current one and switch to the other. If the running thread has no peers, the scheduler will allow it to run until either it yields or another higher or equal-priority thread's timeout expires. The tick abstraction remains as a convenient way of expressing time to the scheduler, but internally the scheduler tracks only elapsed cycles.

By default, if 0 is passed as the flags argument to thread_sleep, the sleep operation is treated as a yield. This is a way for the running thread to communicate to the scheduler that it is happy for other (lower or equal-priority) threads to run for up to the specified number of ticks. The scheduler may wake the yielding thread if no other thread is going to be runnable within that number of ticks. This allows high-priority threads to allow other threads to run, but continue using the CPU if no other thread is runnable.

In some cases, you really want to sleep. For example, if you're updating a clock display, you will want to run once a second or once a minute to update the display. The same applies if you're sending keep-alive packets or periodically monitoring some other component. In these cases you definitely have no useful work to do, irrespective of the state of any other threads that can or cannot run. You can pass ThreadSleepNoEarlyWake as the flags argument to thread_sleep to indicate that you really want to sleep.

You can see the effect of sleeping in the thread_sleep example, as shown in Listing 46. This is a modified version of the current_thread example from earlier, now sleeping in each loop iteration.

/// Thread entry point.
void __cheriot_compartment("current") entry()
{
	for (int i = 0; i < 2; i++)
	{
		printf("Current thread: %d of %d\n",
		       thread_id_get(),
		       thread_count());
		Timeout t{1};
		thread_sleep(&t);
	}
	printf("Cycles elapsed: %lld\n", rdcycle64());
}

Listing 46. A simple example of thread sleepingexamples/thread_sleep/current.cc

If you run this, you should see output that looks somewhat like this:

Current thread: 2 of 2
Current thread: 1 of 2
Current thread: 2 of 2
Current thread: 1 of 2
Cycles elapsed: 262193
Cycles elapsed: 265806

As before, thread two runs first, but then it yields and allows thread one to run. Thread one then yields and allows thread two to run, and so on. If thread one did not yield then it would be preempted after one tick.

Try modifying this example, adding ThreadSleepNoEarlyWake as a second argument to the thread_sleep call. You should now see output that looks very similar, but shows lower cycle counts at the end:

Current thread: 2 of 2
Current thread: 1 of 2
Current thread: 2 of 2
Current thread: 1 of 2
Cycles elapsed: 249233
Cycles elapsed: 252273

If you run this with the Sail simulator, do not be surprised if the cycle counts look very small. Sail is not a cycle-accurate model. The cycle count is guaranteed to be monotonic, but not to represent a real system in any way. The snippets in this section are using the Ibex simulator.

Here you see that the total execution time has gone from 265,806 cycles to 252,273. In the original version, when thread one slept (after doing far less than one tick's worth of work), there were no runnable threads so the scheduler does nothing for a while. Eventually, thread two (the high-priority thread) is runnable again and it resumes. In the version with ThreadSleepNoEarlyWake, thread two can resume as soon as thread one sleeps. Similarly, when thread two yields for the second time, thread one will resume.

6.5. Waiting for events with futexes

The scheduler exposes a set of futex APIs as a building block for various notification and locking mechanisms. Futex is a contraction of 'fast userspace mutex'. This does not quite apply on a CHERIoT system, where there is no userspace, but the core concept of avoiding a privilege transition on fast paths still applies.

A CHERIoT RTOS futex is a 32-bit word where the scheduler provides compare-and-sleep (futex_timed_wait) and notify (futex_wake) operations.

Documentation for the futex_timed_wait function

int futex_timed_wait(Timeout * ticks, const uint32_t * address, uint32_t expected, uint32_t flags)

Compare the value at address to expected and, if they match, sleep the thread until a wake event is sent with futex_wake or until the thread has slept for ticks ticks.

The value of ticks specifies the permitted timeout. See timeout.h for details.

The address argument must permit loading four bytes of data after the address.

The flags argument contains flags that may control the behaviour of the call. This is either FutexNone (zero) for the default behaviour or FutexPriorityInheritance if the low 16 bits should be treated as a thread ID for priority inheritance.

This returns:

0 on success: either *address and expected differ or a wake is received.
-EINVAL if the arguments are invalid.
-ETIMEOUT if the timeout expires.

Documentation for the futex_wake function

int futex_wake(uint32_t * address, uint32_t count)

Wakes up to count threads that are sleeping with futex_timed_wait on address.

The address argument must be a valid unsealed pointer with a length of at least four after the address but the scheduler does not require any explicit permissions. The scheduler never needs store access to the futex word. Removing store permission means that a compromised scheduler can cause spurious wakes but cannot tamper with the futex word. If, for example, the futex word is a lock then the scheduler can wake threads that are blocked on the lock but cannot release the lock and so cannot make two threads believe that they have simultaneously acquired the same lock.

The return value for a successful call is the number of threads that were woken. -EINVAL is returned for invalid arguments.

In C++, std::atomic<uint32_t> provides wait, notify_all, and notify_one methods that expose futex functionality and may be more convenient to call than the raw futex APIs. These include some additional (non-standard) overloads that expose more of the underlying futex functionality.

A futex allows you to use atomic operations on a 32-bit word for fast paths but then sleep and wake threads when they are blocked, rather than spinning. Anything that can be implemented with a spin-wait loop can usually be made more efficient with a futex.

For example, consider the simplest possible spinlock, which uses a single word containing a one to indicate locked and a zero to indicate unlocked. When you encounter a one, you sit in a loop doing an atomic compare-and-swap trying to replace a zero with a one. When this succeeds, you've acquired the lock.

On most operating systems running on single-core processors, you will sit in this loop until you exhaust your quantum, then a timer will fire and another thread will run. Your thread may be scheduled before the thread that owns the lock finishes, so you'll then spin for another quantum.

The first simple optimisation on this design is to yield in the spin loop. This will allow other threads to run but the waiting thread remains runnable and so may be rescheduled early. With an RTOS priority scheduler, if the thread that's waiting is a higher priority than the thread that owns the lock, the thread that owns the lock may never be scheduled.

A futex lets the waiting thread sleep. The futex_timed_wait call will compare the value in the futex word to the expected value (one, indicating locked, in this case) and, if they match, will send the thread to sleep and remain asleep until the thread owning the lock does a futex_wake call when unlocking.

A more complete futex-based lock uses three values in the lock word to differentiate between locked states with and without waiters. This allows the uncontended case to avoid any cross-compartment calls.

Futexes can be used to build other waiting mechanisms beyond locks. A barrier is the simplest primitive that you can build with a futex. This is a very simple primitive that blocks every thread until all threads have reached the same point in the code. Listing 47 shows how a barrier might be implemented, using the std::atomic wrapper around futexes.

/// Thread entry point.
__cheriot_compartment("barrier") void entry()
{
	static std::atomic<uint32_t> barrier = 2;
	printf("Thread: %d arrived at barrier\n",
	       thread_id_get());
	uint32_t value = --barrier;
	if (value == 0)
	{
		barrier.notify_all();
	}
	else
	{
		while (value != 0)
		{
			barrier.wait(value);
			value = barrier;
		}
	}
	printf("Thread: %d passed barrier\n", thread_id_get());
}

Listing 47. Implementing a barrier with a futexexamples/barrier/barrier.cc

The counter is set to the number of threads (two in this case). When a thread arrives at the barrier, it decrements the counter. Note that the decrement operator on std::atomic is an atomic decrement, so exactly one thread should take the counter to zero. The thread that set the counter to zero then notifies any waiting threads to wake. Other threads, as they arrive, will sit using the wait method, which is a thin wrapper around futex_wait.

The loop on line 23 is important. Imagine that one thread decrements the counter to one and is then preempted. Then another thread decrements it to zero and does the notify_all (futex_wake) call. The first thread, when it resumes will try to wait on the futex, but the value has now changed. Fortunately, futex_wait takes an expected value and so, if the futex word does not have this value then the call will resume immediately.

If this example had three threads, a similar race could occur in the opposite direction. If the first thread decremented from three to two and was then preempted by a second thread, it would call futex_wake with an expected value of two, but the value would now be one. It would then wake, because the value does not match. If this happens, it would reload the value, note that it is not zero, and retry waiting. The futex APIs are designed to allow this combination of wait and atomic operation to work for any interleaving.

Try modifying this example by adding more threads (don't forget to increase the initial value of the counter!).

6.6. Building locks from futexes

The locks library provides a set of futex-based locks. The locks.h header exposes the interface to this library.

Ticket locks: provide guaranteed FIFO semantics for waiters.
Flag locks: are simple locks that wake waiters in the order of their thread priorities. These can optionally provide priority inheritance (see Section 6.7. Inheriting priorities).
Recursive mutexes: wrap a priority-inheriting flag lock and allow the same thread to acquire a lock multiple times.
Semaphores: provide a counting semaphore abstraction.

C++ users may prefer to use the wrappers provided in locks.hh, which implement a uniform interface for different lock types. This header also defines a NoLock class that provides the same interface but does not do any locking so generic data structures can be implemented with and without locking.

You can see the difference between the lock types by running the locking example. This uses a lock declared at the top of the file, as shown in Listing 48. This is using a flag lock in the uncommented version. You can change it to using a ticket lock by commenting out the first declaration and uncommenting the second.

// Comment out this line and uncomment
// the next one to see how ticket locks
// behave.
FlagLock lock;
// TicketLock lock;

Listing 48. Declaring a lock in C++examples/locking/locking.cc

The rest of this file contains three versions of the function shown in Listing 49. One of these is defined for each of three threads, which are created with three different priorities. The do_useful_work function in the example just calls thread_sleep with a timeout of one second, but in real code would be doing the work that must be done with the lock held.

__cheriot_compartment("locking") void low()
{
	while (true)
	{
		lock.lock();
		printf("Low priority thread "
		       "acquired lock\n");
		do_useful_work();
		lock.unlock();
	}
}

Listing 49. The low-priority thread entry point for the locking exampleexamples/locking/locking.cc

When you run this example, you should see output like this:

High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock
High priority thread acquired lock

This is an example of starvation. The low- and medium-priority threads are both waiting for the lock, but the high-priority thread is not yielding between releasing and reacquiring the lock and so they are never woken.

If you change this to a ticket lock, the output will change to this:

High priority thread acquired lock
Medium priority thread acquired lock
Low priority thread acquired lock
High priority thread acquired lock
Medium priority thread acquired lock
Low priority thread acquired lock

Ticket locks are modelled after the ticket dispensers that are used to manage queues of people. When you arrive, you take the ticket with the next number. There is a display showing the current number and you wait until the display matches your number. A ticket lock is implemented with two counters. The first counter is used to assign tickets. When you try to acquire the lock, you atomically fetch-and-increment the counter. The result of the fetch is your ticket number. You then wait until the second counter is equal to your ticket number. When you release the lock, you increment the second counter. This is built from futexes by doing a futex_wait on the counter word to block acquiring the lock and a futex_wake when you release the lock.

In this example, the high-priority thread runs first and acquires the lock. Next, while it's yielding in the middle, the medium-priority thread tries to acquire the lock and blocks. Before blocking, it successfully receives a ticket and so is next in line for the lock. The low-priority thread then does the same. When the high-priority thread resumes, it also acquires a ticket and is now in the queue to run after the other two threads.

Each thread runs in this version, but a high-priority thread is blocked waiting for two lower-priority threads to proceed, which may not be desirable.

6.7. Inheriting priorities

Simple futex-based locks are vulnerable to priority inversion, where a high-priority thread is unable to make progress because a lock is held by a lower-priority thread. We saw an example of this with ticket locks but the same is also possible with flag locks, as we'll see in the priority_inheritance example.

This starts with the high-priority thread shown in Listing 50 running. This will sit in a loop, first yielding for up to a second to allow other threads to run, then trying to acquire a lock. This example is using LockGuard, a simple RAII wrapper around the RTOS' locks. The constructor variant here with two arguments takes a timeout in addition to the lock. Lock guards are convertible to bool and convert as true if and only if the lock is acquired. This allows the thread to print whether the lock was acquired, or it timed out waiting for the lock.

__cheriot_compartment("priority_"
                      "inheritance") void high()
{
	// Let the low and
	// medium-priority threads start
	Timeout t(MS_TO_TICKS(1000));
	thread_sleep(&t);
	while (true)
	{
		t = Timeout(MS_TO_TICKS(1000));
		if (LockGuard g{lock, &t})
		{
			printf("High-priority thread acquired the lock!\n");
		}
		else
		{
			printf("High-priority thread failed to acquire the "
			       "lock!\n");
		}
	}
}

Listing 50. A high-priority thread that will be starvedexamples/priority_inheritance/priority_inheritance.cc

While this thread yields, the thread shown in Listing 51 runs. This first yields and then infinite loops. The infinite loop in this example is a placeholder for anything that does long-running work and does not yield.

__cheriot_compartment("priority_"
                      "inheritance") void medium()
{
	// Let the low-priority thread run
	// until it yields
	Timeout t(MS_TO_TICKS(1000));
	thread_sleep(&t);
	printf("Medium priority thread entering infinite loop "
	       "and not yielding\n");
	while (true)
	{
		x++;
	}
}

Listing 51. A medium-priority thread that will starve a high-priority threadexamples/priority_inheritance/priority_inheritance.cc

While the medium-priority thread yields, the thread shown in Listing 52 runs. This acquires the lock and also yields. This thread is the lowest priority and so, even without the explicit yield, a similar thread in a more complex program could easily be preempted with the lock held. The yield simply forces it to happen every time, to trigger the problem.

Prior to C++26, infinite loops are undefined behaviour in C++ if they do not have any side effects. The x++ in this loop is simply incrementing an atomic integer to make sure that this loop is not optimised away.

__cheriot_compartment("priority_"
                      "inheritance") void low()
{
	while (true)
	{
		lock.lock();
		printf("Low-priority thread acquired the lock\n");
		Timeout t(MS_TO_TICKS(500));
		thread_sleep(&t, ThreadSleepNoEarlyWake);
		printf("Low-priority thread releasing the lock\n");
		lock.unlock();
	}
}

Listing 52. A low-priority thread that will be preempted with a lock heldexamples/priority_inheritance/priority_inheritance.cc

After the low-priority thread yields (or is preempted) the medium-priority thread will resume and infinite loop (or, at least, does something that doesn't yield). The low-priority thread cannot run, because a higher-priority thread is runnable. When the high-priority thread tries to run, it sees that the lock is already acquired and blocks, waiting for it to be finished. Unfortunately, the thread that could release the lock never runs and so you see output like this:

Low-priority thread acquired the lock
Low-priority thread releasing the lock
Low-priority thread acquired the lock
Medium priority thread entering infinite loop and not yielding
High-priority thread failed to acquire the lock!
High-priority thread failed to acquire the lock!
High-priority thread failed to acquire the lock!

This example is quite easy to debug, because the starvation is total: the high-priority thread never makes progress. If the medium-priority thread yielded sometimes, this would be much worse because the high-priority thread would make progress but at a far slower rate than its priority would imply.

Priority inheritance is the solution to this kind of problem. With priority inheritance, a thread that blocks on a lock that is held by a lower-priority thread will temporarily loan its priority to the thread the owns the lock. This allows the lower-priority thread run and release the lock. If you change the FlagLock in the example to a FlagLockPriorityInherited, you will see this output:

Low-priority thread acquired the lock
Low-priority thread releasing the lock
Low-priority thread acquired the lock
Medium priority thread entering infinite loop and not yielding
Low-priority thread releasing the lock
High-priority thread acquired the lock!
High-priority thread acquired the lock!
High-priority thread acquired the lock!

Now, while the medium-priority thread runs, the high-priority thread waits and tries to acquire the lock. The lock is held and so it will try to loan its priority to the thread that owns it. This allows the low-priority thread to run in preference to the medium-priority one. The low-priority thread runs and releases the lock, allowing the high-priority thread to resume and acquire the lock.

The futex APIs implement this by storing the thread ID of the owning thread in the bottom 16 bits of the futex word and passing FutexPriorityInheritance to the flags argument in the wait call. The specified thread will have its priority set to the highest priority of any of the waiting threads. The priority boost lasts until the waiters time out or the boosted thread releases the lock, whichever happens first. A single thread can hold multiple priority-inheriting locks and receive priority boosts from all of them. Most of the time, you will not use this directly and will instead use the priority-inheriting lock APIs from their C or C++ wrappers.

The priority inheritance mechanism can also be used to build asymmetric locks. These have a fast path that doesn't do any cross-compartment calls and a slow path that does. You can find one example of this in the hazard pointer mechanism for short-lived claims. This must detect when a thread has tried to add a hazard pointer while the allocator is scanning the list, without slowing down the allocator. Before reading the list, the allocator increments the top 16 bits of the futex word and sets the low 16 to the thread ID performing the operation. Threads updating the hazard set check the futex word before and after updating the list. If the top 16 bits have changed, they know that the allocator has scanned the list and they must retry. If the top 16 bits contain an odd value, the allocator is currently scanning the list and they must wait. They can do a priority-inheriting wait with a one-tick timeout even though the allocator will not ever call futex_wake. They will yield for one tick, boosting the priority of the thread that's currently in the allocator, but then resume at the end of the tick.

6.8. Securing futexes

Most of the time you will want to use futexes (and the locks that wrap them) to synchronise operations within a single compartment. Futex-based locks rely on the contents of the lock word to be valid. For example, if a flag lock is directly accessible by two mutually distrusting compartments, one can write an invalid value to the word and either prevent the other from waking waiters or cause it to spuriously believe that it has acquired the lock.

This is not normally a limitation because locks typically protect some data structure or other resource that should not be concurrently mutated by multiple threads. Providing mutable views of such a structure to multiple compartments is almost certainly a security vulnerability, even without attacks on the futex.

There is one situation where futexes are safe to share across compartment boundaries. If you have a component that others trust for availability, it can share read-only views of a futex to allow waiting for an out-of-band event. The scheduler does this for interrupts (see Chapter 9. Writing a device driver), allowing threads to use the futex wait operation to block until an interrupt is ready.

6.9. Using event groups

The event_group library provides an event-group API that is primarily intended for porting code written against FreeRTOS's event-group APIs. The event.h header exposes the interface to this library. These APIs do not have a clear trust model and so should be avoided in new code that is not ported from FreeRTOS. You can build more convenient interfaces atop futexes for most synchronisation operations. You may also simply use multiple futexes and the multiwaiter API (see Section 6.12. Waiting for multiple events) to wait for multiple events.

An event group is a set of up to 24 values that can be set or cleared independently. Waiters can wait for any or all of an arbitrary subset of these.

Event groups are created with the eventgroup_create function. This returns an opaque handle to the event group, which can be used for setting, clearing, or waiting on events.

Documentation for the eventgroup_create function

int eventgroup_create(struct Timeout * timeout, AllocatorCapability heapCapability, struct EventGroup * * outGroup)

Create a new event group, allocated using heapCapability. The event group is returned via outGroup.

This returns zero on success. Otherwise it returns a negative error code. If the timeout expires then this returns -ETIMEDOUT, if memory cannot be allocated it returns -ENOMEM.

Note that, because this allocates memory, it requires an allocation capability. See Chapter 7. Memory management in CHERIoT RTOS for more information about what this means.

You can then use eventgroup_set and eventgroup_clear to set and clear some or all of the event flags in this group. Both of these calls return the old values of the bits.

Documentation for the eventgroup_set function

int eventgroup_set(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsToSet)

Set one or more bits in an event group. The bitsToSet argument contains the bits to set. Any thread waiting with eventgroup_wait will be woken if the bits that it is waiting for are set.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

Documentation for the eventgroup_clear function

int eventgroup_clear(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsToClear)

Clear one or more bits in an event group. The bitsToClear argument contains the set of bits to clear. This does not wake any threads.

This returns zero on success. If the timeout expires before this returns then it returns -ETIMEDOUT.

Independent of success or failure, outBits will be used to return the set of currently set bits in this event group.

You can then subsequently wait for some of the events to be set with the eventgroup_wait function. This takes a set of events to wait for and can wait until either some or all of them are set.

Documentation for the eventgroup_wait function

int eventgroup_wait(Timeout * timeout, struct EventGroup * group, uint32_t * outBits, uint32_t bitsWanted, _Bool waitForAll, _Bool clearOnExit)

Wait for events in an event group. The bitsWanted argument must contain at least one bit set in the low 24 bits (and none in the high bits). This indicates the specific events to wait for. If waitForAll is true then all of the bits in bitsWanted must be set in the event group before this returns. If waitForAll is false then any of the bits in bitsWanted being set in the event group will cause this to return.

If this returns zero then outBits will contain the bits that were set at the time that the condition became true. If this returns -ETIMEDOUT then outBits will contain the bits that were set at the time that the timeout expired.

Note: waitForAll requires all bits to be set at the same time. This makes it trivial to introduce race conditions if used with multiple waiters and clearOnExit, or if different threads clear different bits in the waited set.

If clearOnExit is true and this returns successfully then the bits in bitsWanted will be cleared in the event group before this returns.

This call can also atomically clear the bits that you've waited on, giving them edge-triggered behaviour.

6.10. Sending messages

A message queue is a FIFO capable of storing a fixed number of fixed-sized entries. There are two distinct use cases for message queues:

Communicating between two threads in the same compartment.
Communicating between different compartments.

In the first case, the endpoints are in the same trust domain. The message_queue_library library provides a simple message-queue API that is intended for this use case. When the endpoints are in different trust domains, the endpoints must be protected from tampering. The message_queue compartment wraps the library in a compartment that exposes an almost identical interface to the library but with the endpoints exposed as (tamper-proof) sealed capabilities.

Queues for use within a single compartment are created with queue_create, which allocates the ring buffer and returns a pointer to the structure. This is a struct MessageQueue and callers are at liberty to look inside it directly. There is no expectation that it is protected from the caller. The functions exposed by the library are (by their nature as shared-library functions) shared between any compartments that use the library, but this is a code-size reduction exercise not a security boundary.

Documentation for the queue_create function

int queue_create(Timeout * timeout, AllocatorCapability heapCapability, struct MessageQueue * * outQueue, size_t elementSize, size_t elementCount)

Allocates space for a queue using heapCapability and stores a handle to it via outQueue.

The queue has space for elementCount entries. Each entry is a fixed size, elementSize bytes.

Returns 0 on success, -ENOMEM on allocation failure, and -EINVAL if the arguments are invalid (for example, if the requested number of elements multiplied by the element size would overflow).

Queues can be freed simply with heap_free but doing so may result in deadlocks. If a thread is blocked trying to send or receive from a queue then it will remain blocking if the queue is freed out from underneath it. The queue_destroy function avoids this by waking all threads. Other threads may then trap immediately after they return and try to read from the queue's counters, but at least this is recoverable (see Section 5.11. Handling errors).

Documentation for the queue_destroy function

int queue_destroy(AllocatorCapability heapCapability, struct MessageQueue * handle)

Destroys a queue. This wakes up all threads waiting to produce or consume, and makes them fail to acquire the lock, before deallocating the underlying allocation.

Returns 0 on success. This can fail only if deallocation would fail and will, in these cases, return the same error codes as heap_free.

This function will check the heap capability first and will avoid upgrading the locks if freeing the queue would fail.

Messages are sent with queue_send and received with queue_receive. These are blocking (if allowed to by a non-zero timeout) calls that send or receive a single message.

Documentation for the queue_send function

int queue_send(Timeout * timeout, struct MessageQueue * handle, const void * src)

Send a message to the queue specified by handle. This expects to be able to copy the number of bytes specified by elementSize when the queue was created from src.

Returns 0 on success. On failure, returns -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

This expects to be called with a valid queue handle. It does not validate that this is correct.

Documentation for the queue_receive function

int queue_receive(Timeout * timeout, struct MessageQueue * handle, void * dst)

Receive a message over a queue specified by handle. This expects to be able to copy the number of bytes specified by elementSize. The message is copied to dst, which must have sufficient permissions and space to hold the message.

Returns 0 on success, -ETIMEOUT if the timeout was exhausted, -EINVAL on invalid arguments.

The library interfaces to queues are not intended to be robust in the presence of malicious callers. They run in the same security context as the caller, so a caller may abuse them to corrupt its own state. They do aim to be robust with respect to the source or destination buffer for sending and receiving messages being invalid or concurrently deallocated. This robustness is implemented using the scoped error handling and so requires calling compartments to link the relevant error handler, as documented in Section 5.13. Using scoped error handling.

You can probe the number of messages in a queue with queue_items_remaining.

Documentation for the queue_items_remaining function

int queue_items_remaining(struct MessageQueue * handle, size_t * items)

Returns the number of items in the queue specified by handle via items.

Returns 0 on success. This has no failure mechanisms, but is intended to have the same interface as the version that operates on a sealed queue handle.

Note: This interface is inherently racy. The number of items in the queue may change in between the return of this function and the caller acting on the result.

Listing 53 shows how to create a message queue. This uses the non_blocking template function. This and blocking_forever take advantage of the regularity of CHERIoT APIs (timeouts are always the first parameter) to pass a zero-length and unlimited timeout to a single function. The producer thread in this example is creating the queue and is then doing a futex_wake operation on the global variable where the queue pointer is stored.

	// Allocate the queue
	non_blocking<queue_create>(
	  MALLOC_CAPABILITY, &queue, sizeof(int), 16);
	// Wake the consumer thread
	futex_wake(reinterpret_cast<uint32_t *>(&queue), 1);

Listing 53. Allocating a message queue for use in a single compartmentexamples/producer_consumer/queue.cc

The corresponding futex_wait call is in the consumer, in Listing 55. This takes advantage of the fact that the global is initialised to zero and the resulting capability after allocation will be non-zero (in both halves, so the choice to use the first address is arbitrary here). If the consumer thread runs first, the wait call will block then return after the wake call from Listing 53. If the producer thread runs first then the wait call will return immediately.

The producer in this simple example then just sends 200 integers, one at a time, as shown in Listing 54. These will enter the queue until either the scheduler quantum expires or the queue fills up, then the consumer thread will run.

	// Loop, sending some numbers to the other thread.
	for (int i = 1; i < 200; i++)
	{
		Debug::log("Producer sending {} to queue", i);
		int ret = blocking_forever<queue_send>(queue, &i);
		// Abort if the queue send errors.
		Debug::Invariant(ret == 0, "Queue send failed {}", ret);
	}

Listing 54. Sending messages to a message queue.examples/producer_consumer/queue.cc

The consumer thread (Listing 55) simply reads each element and prints it. Once every expected message has been received, the consumer destroys the queue.

This is a good general pattern. This example detects the end of the queue because it knows that it will receive 200 messages but a more realistic system would send an end-of-messages marker through the queue. Once this is in the queue, the producer will no longer use the queue and as soon as the receiver reads it the consumer is free to deallocate the queue.

	// Use the queue pointer as a futex.  It is initialised to
	// 0, if the other thread has stored a valid pointer here
	// then it will not be zero and so futex_wait will return
	// immediately.
	futex_wait(reinterpret_cast<uint32_t *>(&queue), 0);
	Debug::log("Waiting for messages");
	// Get a message from the queue and print it.
	int value = 0;
	while ((value != 199) && (blocking_forever<queue_receive>(
	                            queue, &value) == 0))
	{
		Debug::log("Read {} from queue", value);
	}
	Debug::log("Destroying the queue");
	queue_destroy(MALLOC_CAPABILITY, queue);

Listing 55. Receiving messages from a message queue.examples/producer_consumer/queue.cc

This example prints the number of cycles that elapsed at the end. Try adjusting the queue size and the priorities of the two threads to see how this affects performance.

6.11. Sending messages between compartments

If you are passing messages between compartments, you should use the versions of these functions with the _sealed suffix. These are provided by the compartmentalised version and ensure that the queue's internal state is not mutated by its callers.

The queue_create_sealed function creates a queue in exactly the same way as queue_create (and has the same argument structure) but returns a sealed pointer to it. Nowhere outside of the queue compartment can unseal this so the queue is protected against tampering. The queue may still contain malicious or malformed data, but you have guarantees that messages will arrive in order and that they won't be tampered with in flight.

Documentation for the queue_create_sealed function

int queue_create_sealed(Timeout * timeout, AllocatorCapability heapCapability, struct MessageQueue * __sealed_capability * outQueue, size_t elementSize, size_t elementCount)

Allocate a new message queue that is managed by the message queue compartment. The resulting queue handle (returned in outQueue) is a sealed capability to a queue that can be used for both sending and receiving.

Listing 56 shows how to allocate a queue for sharing between compartments. This is equivalent to the example from Listing 53, but with the producer and consumer separated into different compartments. The queue is protected from both by being managed in the queue compartment, which is provided by the RTOS. In the original, the queue was shared between the producer and consumer by placing it in a (compartment-local) global variable. In this version, it needs to be passed to the consumer explicitly by calling the set_queue function that the consumer exposes.

	// Allocate the queue
	CHERI_SEALED(MessageQueue *) queue;
	non_blocking<queue_create_sealed>(
	  MALLOC_CAPABILITY, &queue, sizeof(int), 16);
	// Pass the queue handle to the consumer.
	set_queue(queue);

Listing 56. Allocating a message queue for use in a between compartments.examples/producer_consumer_compartment/producer.cc

Listing 57 shows the implementation of this function. This explicitly checks that the queue is valid before keeping it by calling queue_items_remaining_sealed. This function returns zero on success or an error code on failure and needs to unseal the queue to succeed, so this works to check that the sealed pointer really is a queue handle. After it has checked that this is a valid queue handle, this function looks much like the code that woke the other thread in the single-compartment case in Listing 53.

	// Check that this is a valid queue
	size_t items;
	if (queue_items_remaining_sealed(newQueue, &items) != 0)
	{
		return;
	}
	// Set it in the global and allow the thread to start.
	queue = newQueue;
	Debug::log("Queue set to {}", queue);
	futex_wake(reinterpret_cast<uint32_t *>(&queue), 1);

Listing 57. Receiving a queue endpoint in the consumer compartment.examples/producer_consumer_compartment/consumer.cc

This queue can be destroyed by calling queue_destroy_sealed. You cannot free an object pointed to by a sealed capability unless you also have the capability that authorises unsealing. This means that, unlike the unsealed version, you cannot destroy a queue created with queue_create_sealed except by calling the correct destroy function.

Similarly, you cannot free a sealed object unless you hold the allocation capability that allocated it. In the example, the queue is allocated from the producer's allocation capability. This means that the consumer cannot free it. Instead, the producer waits for the consumer to finish draining the queue and then frees it, in Listing 58.

	size_t itemsRemaining;
	while ((queue_items_remaining_sealed(
	          queue, &itemsRemaining) == 0) &&
	       (itemsRemaining > 0))
	{
		Timeout t{1};
		thread_sleep(&t);
	}
	int ret = blocking_forever<queue_destroy_sealed>(
	  MALLOC_CAPABILITY, queue);
	Debug::Assert(
	  ret == 0, "Failed to destroy queue: {}", ret);
	Debug::log("Destroyed queue");

Listing 58. Freeing a message queue once it is empty.examples/producer_consumer_compartment/producer.cc

Documentation for the queue_destroy_sealed function

int queue_destroy_sealed(Timeout * timeout, AllocatorCapability heapCapability, struct MessageQueue * __sealed_capability queueHandle)

Destroy a queue handle. If this is called on a restricted endpoint (returned either from queue_receive_handle_create_sealed or from queue_send_handle_create_sealed), this frees only the handle. If called with the queue handle returned from queue_create_sealed, this will destroy the queue.

The corresponding send and receive functions are identical to their library counterparts, but take sealed queue handles. Sometimes, it's useful to be able to give one compartment the ability to write to a queue and another the ability to read. The queue compartment provides two APIs that let you allocate a handle that is authorised for only sending or receiving, queue_receive_handle_create_sealed and queue_send_handle_create_sealed.

Documentation for the queue_receive_handle_create_sealed function

int queue_receive_handle_create_sealed(struct Timeout * timeout, AllocatorCapability heapCapability, struct MessageQueue * __sealed_capability handle, struct MessageQueue * __sealed_capability * outHandle)

Convert a queue handle returned from queue_create_sealed into one that can be used only for receiving.

Returns 0 on success and writes the resulting restricted handle via outHandle. Returns -ENOMEM on allocation failure or -EINVAL if the handle is not valid.

Documentation for the queue_send_handle_create_sealed function

int queue_send_handle_create_sealed(struct Timeout * timeout, AllocatorCapability heapCapability, struct MessageQueue * __sealed_capability handle, struct MessageQueue * __sealed_capability * outHandle)

Convert a queue handle returned from queue_create_sealed into one that can be used only for sending.

Returns 0 on success and writes the resulting restricted handle via outHandle. Returns -ENOMEM on allocation failure or -EINVAL if the handle is not valid.

With these APIs, you can have an initial setup compartment that creates a queue and then hands the send and receive endpoints to two others. This provides mutual distrust because neither compartment that holds a send or receive handle can free the queue, nor can they be used for the opposite operation, and the queue compartment protects the integrity of the queue itself.

In many cases, the trust relationship may be asymmetric. For example, a compartment may provide a queue that other untrusted compartments can send messages to, but the senders trust the receiving compartment. The APIs also support this asymmetric use case, where the trusted compartment keeps the original handle but uses queue_send_handle_create_sealed to create handles to pass to other compartments.

Try modifying the example to remove the ability from the consumer to send messages. This change should not require restructuring the code, only calling queue_receive_handle_create_sealed. Next, try moving queue creation to a separate compartment that sets the endpoint in the consumer and then passes it into the consumer and deallocates it once the producer returns and the queue is empty. The producer should receive a restricted handle that can send but not receive. Try modifying both the producer and consumer to do the operation that they should not be allowed to do and make sure that it fails.

6.12. Waiting for multiple events

The multiwaiter API allows waiting for any of a set of independent events. It is conceptually similar to select, poll, epoll, and kqueue in *NIX operating systems or WaitForMultipleObjects in Windows. It is designed to bound the amount of time that the scheduler must spend checking multiwaiters and to minimise the amount of memory that multiwaiters consume. Memory is allocated only when a multiwaiter is created, with multiwaiter_create. This creates a multiwaiter with space for a fixed number of events.

Documentation for the multiwaiter_create function

int multiwaiter_create(Timeout * timeout, AllocatorCapability heapCapability, MultiWaiter * ret, size_t maxItems)

Create a multiwaiter object. This is a stateful object that can wait on at most maxItems event sources.

Listing 59 shows how to create a multiwaiter that can wait for two futexes. This example generalises the producer-consumer example from earlier in this chapter to have two producer threads, each of which writes to a different message queue, but a single consumer that reads both. The consumer will use this multiwaiter to block until one or both queues has a message available.

	// Create the multiwaiter object in the scheduler with
	// space for two event sources.
	MultiWaiter multiwaiter;
	blocking_forever<multiwaiter_create>(
	  MALLOC_CAPABILITY, &multiwaiter, 2);

Listing 59. Creating a multiwaiter object.examples/multiwaiter/queue.cc

Each multiwaiter_wait call is a one-shot operation. The call is passed a set of things to wait for and the associated condition via the events array and returns the waited status via the same array. This is typically an on-stack array.

Documentation for the multiwaiter_wait function

int multiwaiter_wait(Timeout * timeout, MultiWaiter waiter, struct EventWaiterSource * events, size_t newEventsCount)

Wait for events. The first argument is the multiwaiter to wait on. New events can optionally be added by providing an array of newEventsCount elements as the newEvents argument.

Return values:

On success, this function returns 0.
If the arguments are invalid, this function returns -EINVAL.
If the timeout is reached without any events being triggered then this returns -ETIMEOUT.

The multiwaiter can natively wait only for futex notifications but higher-level mechanisms are built out of futexes. For example, if you wish to wait for a message queue (see Section 6.10. Sending messages) to be ready to send, you can call multiwaiter_queue_receive_init to initialise a multiwaiter event with the queue's receive counter and expected value. This event will then fire if the queue becomes non-full. The normal caveats about race conditions apply: the queue may become full again if another thread sends messages in between your receiving the notification and sending a message. For example, consider the following sequence:

You try to send buy cannot because the queue is full.
You use the multiwaiter to wait for the queue to become non-full.
Another thread receives a message from the queue, which makes the queue non-full.
Your thread becomes runnable because the futex wake triggered your futex wait to return but does not yet run.
Another thread sends a message to the queue, which makes the queue full again.
You try to send but the send fails because the queue is now full.

Listing 60 initialises two events, one for each of the queues, and then calls multiwaiter_wait. This will return immediately if an equivalent futex_wait call on either futex would have done so, or will block until one of the queues becomes non-empty. The loop then tries to handle one event from each message queue that has events available.

		// Initialise the events for this wait.
		std::array<EventWaiterSource, queues.size()> events;
		multiwaiter_queue_receive_init(&events[0], queues[0]);
		multiwaiter_queue_receive_init(&events[1], queues[1]);
		// Block until at least one event fires.
		int ret = blocking_forever<multiwaiter_wait>(
		  multiwaiter, events.data(), events.size());
		Debug::Assert(ret == 0, "Multiwaiter failed: {}", ret);
		// For each message queue, fetch a message if the
		// multiwaiter indicated that one was available.
		for (int i = 0; i < queues.size(); i++)
		{
			if (events[i].value)
			{
				ret = non_blocking<queue_receive>(queues[i],
				                                  &values[i]);
				Debug::Assert(
				  ret == 0,
				  "Failed to receive message from queue: {}",
				  ret);
				Debug::log("Received {} on queue {}", values[i], i);
			}
		}

Listing 60. Using a multiwaiter to wait for either of two queues.examples/multiwaiter/queue.cc

Try modifying this to avoid the multiwaiter_wait call if either of the queues is already non-empty. You can do this either by repeating the non-blocking receive and using the multiwaiter only if it fails, or by checking whether the value field of the event source has been initialised to -1 by multiwaiter_queue_receive_init. Make sure that you don't starve one of the producers by biasing your consumption towards one of the queues.

Remember that the CHERIoT RTOS scheduler also exposes interrupts as futexes. This means that the multiwaiter API can wait for multiple hardware and software events in a single call.

CHERIoT Programmers' Guide

Table of contents