11. Networking

The CHERIoT network stack is intended to serve three purposes:

An example of a compartmentalized structure incorporating large amounts of existing code.
An off-the-shelf solution for common IoT device networking needs.
An example for building more specialised networking systems.

The current stack contains code from several third-party projects: the FreeRTOS TCP/IP stack, along with their SNTP and MQTT libraries, and the BearSSL TLS implementation. These are wrapped in rich capability interfaces and deployed in several compartments.

Currently, none of the simulators provide a network connection. The examples in this chapter will default to using Sonata, but should also work on the Arty A7 and future hardware.

11.1. Understanding the structure of the network stack

The core compartments in the network stack are shown in Figure 6. The core compartments in the network stack. These do not include the SNTP and MQTT compartments, which we'll see later.

An illustration of the compartments in the network stack

Figure 6. The core compartments in the network stack.

The TCP/IP and TLS stacks are largely existing code, from the FreeRTOS+TCP and BearSSL projects, respectively. The BearSSL code has no platform dependencies and is simply recompiled. The FreeRTOS+TCP code, unsurprisingly, assumes that it is running on FreeRTOS and is ported using the compatibility layer described in Chapter 14. Porting from FreeRTOS.

In the initial port, the FreeRTOS+TCP code required only one change. It normally expects to create threads during early initialisation. The file that did this was wrapped in something that instead triggered a barrier to allow the statically created threads to start running. Later changes for network-stack reset required some additional steps, though none of these modified any of the FreeRTOS+TCP code.

Each box in the diagram is a compartment (the User Code box is a placeholder for at least one compartment). The compartments have different goals and requirements.

The firewall does both ingress and egress filtering and is the only component in the system that has access to the memory-mapped I/O range for the Ethernet device. Ingress filtering reduces the attack surface of the TCP/IP layer. If there are no listening TCP sockets or unrestricted UDP ones, the firewall will drop all packets that do not come from an approved peer. Typically, an attacker on the local network segment can forge origin addresses but that gets harder across the Internet. Egress filtering is less common on embedded devices, which is unfortunate. The Mirai botnet launched large distributed denial of service (DDoS) attacks by compromising large numbers of embedded systems and using them to each generate relatively small amounts of traffic. With the CHERIoT network stack, this is much harder because the firewall compartment will not usually allow other compartments to send packets to arbitrary targets.

The Network API compartment is new code and implements the control plane. When you want to create a socket or authorise a remote endpoint, you must call this compartment. It uses a software capability model to determine whether callers are allowed to talk to remote endpoints and then opens holes in the firewall to authorise this. When you want to create a connected socket, you present this compartment with a software capability that authorises you to talk to a remote host on a specific port. It briefly opens a firewall hole for DNS requests and instructs the DNS compartment to perform the lookup, then it closes that firewall hole and opens one for the connection. The socket that it returns is created by the TCP/IP compartment so you can then send and receive data by calling the TCP/IP compartment directly.

11.2. Synchronising time with SNTP

The Network Time Protocol (NTP) is a complex protocol for synchronising time with a remote server. It is designed to build a tree of clock sources where each stratum is synchronised with a more authoritative one. Clients send messages to an NTP server and receive the current time back. The full protocol uses some complex statistical techniques to dynamically calculate the time taken for the response to arrive across the network and minimise clock drift. The Simple Network Time Protocol (SNTP) is a subset of NTP intended for simple embedded devices. It will not give the same level of accuracy but can run on very resource-constrained devices.

Using SNTP doesn't require writing any code that talks directly to the network but it does require building and linking the network stack, so that is a good place to start. First, you need to find the network-stack code. Listing 77 shows one way to do this, which is similar to how we find the SDK. This provides a hard-coded relative location and allows it to be overridden with an environment variable.

networkstackdir = os.getenv("CHERIOT_NETWORK") or
    "../../network-stack/"
includes(path.join(networkstackdir,"lib"))

Listing 77. Build system code for including the network stack.examples/sntp/xmake.lua

Next, you need to make sure that code using the network stack finds the headers by adding the include directory (Listing 78). You must also explicitly add the SNTP compartment as a dependency in the compartment target, though this is somewhat redundant because we'll also add it globally later. Finally, the network stack provides an option to users to decide whether they want IPv6 support. This affects some of the definitions in headers so you must define the same flag in your compartment to avoid linker errors.

compartment("sntp_example")
  add_includedirs(path.join(networkstackdir,"include"))
  add_deps("freestanding", "SNTP")
  add_files("sntp.cc")
  on_load(function(target)
    target:add('options', "IPv6")
    local IPv6 = get_config("IPv6")
    target:add("defines",
        "CHERIOT_RTOS_OPTION_IPv6=" .. tostring(IPv6))
  end)

Listing 78. Build system code for building a compartment that uses the network stack.examples/sntp/xmake.lua

Next, the firmware definition needs to contain two things. First, it must add dependencies on the components of the network stack, as shown in Listing 79. The first four are ones that we've already discussed. The SNTP compartment is (hopefully) obvious. The time helpers library is not something that we've looked at so far and you'll see what it does when we start using the SNTP APIs.

add_deps("DNS", "TCPIP", "Firewall", "NetAPI",
           "SNTP", "time_helpers")

Listing 79. Build system code for adding dependencies on the network stack.examples/sntp/xmake.lua

Finally, you need to create the threads that the network stack uses. The thread that starts in the Firewall compartment handles incoming packets. This calls into the TCP/IP compartment for each packet, to enqueue it for handling. The other thread handles TCP retransmissions, keep-alive packets, and so on. TCP provides a reliable transport over an unreliable network and so has to buffer each outgoing packet until the receiver acknowledges receipt. Dropped packets are retransmitted until the acknowledgement arrives.

{
        -- TCP/IP stack thread.
        compartment = "TCPIP",
        priority = 1,
        entry_point = "ip_thread_entry",
        stack_size = 0xe00,
        trusted_stack_frames = 5
      },
      {
        -- Firewall thread, handles incoming packets as they
        -- arrive.
        compartment = "Firewall",
        -- Higher priority, this will be back-pressured by
        -- the message queue if the network stack can't keep
        -- up, but we want packets to arrive immediately.
        priority = 2,
        entry_point = "ethernet_run_driver",
        stack_size = 0x1000,
        trusted_stack_frames = 5
      }

Listing 80. Build system code for defining the network stack's threads.examples/sntp/xmake.lua

With the build system logic done, you can start using the network stack. Anything that uses the network stack will need to call network_start early on, as shown in Listing 81. This brings up the network stack, gets the DHCP lease, and so on. This is a blocking call and will return once the network is initialised.

	Debug::log("Starting network stack");
	network_start();

Listing 81. Initialisation for the network stack.examples/sntp/sntp.cc

Next, you must ask the SNTP compartment to update the time. The sntp_update function, shown in Listing 82, is a blocking call that will attempt to update the time and return failure if it does not manage within the timeout. In this example, we simply keep trying in a loop. In a real system, you would probably want to handle the case where the network is unavailable more gracefully.

	Timeout t{MS_TO_TICKS(1000)};
	Debug::log("Trying to fetch SNTP time");
	while (sntp_update(&t) != 0)
	{
		Debug::log("Failed to update NTP time");
		t = Timeout{MS_TO_TICKS(1000)};
	}

Listing 82. Updating the time from the SNTP server.examples/sntp/sntp.cc

A lot of things happen behind the scenes for this to work. The SNTP compartment holds a capability that authorises it to talk to the remote NTP server. It presents this capability to the network API compartment, which opens the firewall hole for DNS lookups and then instructs the DNS compartment to perform the lookup. The DNS compartment then sends a DNS lookup to the firewall, which forwards it to the Ethernet device and forwards the response back. Next, the network API compartment opens a firewall hole for the local UDP port to the remote host on port 123 (the NTP port) and returns the socket to the SNTP compartment. The SNTP compartment then passes this socket to the TCP/IP compartment to send and receive NTP packets. Finally, it asks the TCP/IP compartment to close the socket and the TCP/IP compartment asks the firewall compartment to close the firewall hole. When the SNTP compartment receives the response and knows the time, it sets some state in a pre-shared object for detecting the time.

Once the current time has been fetched, you can get the current time of day. Listing 83 shows a loop that runs roughly every 50 ms and prints the time (as a UNIX epoch timestamp) if the number of seconds has changed since last time. The gettimeofday function called here is from the time helpers library that was mentioned earlier.

	time_t lastTime = 0;
	while (true)
	{
		timeval tv;
		int     ret = gettimeofday(&tv, nullptr);
		if (ret != 0)
		{
			Debug::log("Failed to get time of day: {}", ret);
		}
		else if (lastTime != tv.tv_sec)
		{
			lastTime = tv.tv_sec;
			// Truncate the epoch time to 32 bits for printing.
			Debug::log("Current UNIX epoch time: {}", tv.tv_sec);
		}
		Timeout shortSleep{MS_TO_TICKS(50)};
		thread_sleep(&shortSleep);
	}

Listing 83. Printing the current UNIX epoch time.examples/sntp/sntp.cc

The SNTP compartment and the time helpers library share a pre-shared object (see Section 5.7. Sharing globals between compartments) which contains the UNIX timestamp at the time of the last NTP update, the cycle time of the last update, and the current epoch. The SNTP compartment has a read-write view of this, the time helpers library a read-only view. When the SNTP compartment updates this, it increments the epoch once, writes the new value, and then increments the epoch again. The time library can therefore get a consistent snapshot of the values by reading the epoch, reading the other values, and then reading the epoch again to make sure that it hasn't changed. If the epoch value is odd, the time helpers library does a futex wait operation to block until the value has changed. The SNTP compartment does a futex-wake operation after the update to wake any waiters.

This means that, most of the time, calling gettimeofday does not require any cross-compartment calls.

When you run this example, you should see the time printed once per second, something like this:

Network test: Starting network stack
Network test: Trying to fetch SNTP time
Network test: Current UNIX epoch time: 1735563080
Network test: Current UNIX epoch time: 1735563081
Network test: Current UNIX epoch time: 1735563082
Network test: Current UNIX epoch time: 1735563083
Network test: Current UNIX epoch time: 1735563084

At the time of writing, there is a problem with the Sonata network interface's ability to receive IPv6 packets. If you try this example on Sonata and it does not work, try adding --IPv6=n to the end of your xmake line during the config stage.

If you leave this running for a while, the clock will eventually drift. Try modifying this example to update the time from the NTP server once per minute.

11.3. Creating a connected socket

In the traditional Berkeley Sockets model, creating a connected socket is a multi-step operation. First, you must create the socket. Next, you may (optionally) bind it to a specific local port, though this step is usually omitted. Finally, you connect it. The CHERIoT network stack combines these into a single network_socket_connect_tcp call.

Documentation for the network_socket_connect_tcp function

Socket network_socket_connect_tcp(Timeout * timeout, AllocatorCapability mallocCapability, ConnectionCapability hostCapability)

Create a connected TCP socket.

This function will block until the connection is established or the timeout is reached.

The mallocCapability argument is used to allocate memory for the socket and must have sufficient quota remaining for the socket.

The hostCapability argument is a capability authorising the connection to a specific host.

This returns a valid sealed capability to a socket on success, or an untagged value on failure.

As you might expect from CHERIoT, this is a capability-based API. It requires a capability to authorise connecting to a specific host, along with a capability to allocate memory for the socket state. The latter ensures that all memory used for a network connection is accounted to the compartment that created it.

You need to define a connection capability before you can use one. Listing 84 shows an example that allows connecting with TCP to the towel.blinkenlights.nl host, on port 23, the well-known telnet port. This capability will show up in the auditing report for the firmware image (as discussed in Chapter 10. Auditing firmware images), so you can ensure that specific compartments in your firmware image are permitted to connect only to remote hosts that you authorised.

DECLARE_AND_DEFINE_CONNECTION_CAPABILITY(
  Server,
  "towel.blinkenlights.nl",
  23,
  ConnectionTypeTCP);

Listing 84. A static capability that authorises connecting to a remote server.examples/tcp/tcp.cc

The connect call is shown in Listing 85. This passes the capability for the server along with this compartment's default malloc capability. You can separate the quota that your compartment uses for network-related things and provide a different capability. This is useful if, for example, you wish to call heap_free_all on your default malloc capability but not affect any network state

	Timeout unlimited{UnlimitedTimeout};
	auto    socket =
	  network_socket_connect_tcp(&unlimited,
	                             MALLOC_CAPABILITY,
	                             STATIC_SEALED_VALUE(Server));
	if (!CHERI::Capability{socket}.is_valid())
	{
		Debug::log("Failed to connect");
		return;
	}

Listing 85. Connecting to a remote server.examples/tcp/tcp.cc

The result of this call is a valid sealed capability to the socket. All of the state required for the socket will be allocated with the allocator capability that you passed (and counted against your quota), but is not directly accessible to you. On a POSIX system, the result of a socket call is a file descriptor. On Windows, it is a HANDLE. These are both opaque types that reference some internal data structure that the kernel associates with your process. In contrast, a sealed capability is just a pointer, but a type-safe tamper-proof one. You can pass it between compartments (allowing multiple compartments to use the same socket) but only the TCP/IP compartment can unseal it to access the internal state. If the connection fails, you will get back an untagged capability.

Currently, network_socket_connect_tcp does not report the reason for a failure. A future version will likely use negative error codes in the address of untagged capabilities, so it's important to check whether the returned value is a valid capability, rather than comparing it against NULL or nullptr.

Assuming that the connection succeeded, you are now ready to start trying to receive data, as shown in Listing 86. The network_socket_receive call is quite different from a conventional socket receive. On most operating systems, a system call cannot allocate userspace memory and must take a buffer for the kernel to write into. This is unfortunate because the kernel knows the amount of data available, but the caller does not. If the caller provides too small a buffer, they must then do another call to get the rest of the data. If they provide too large a buffer, they have wasted memory. In contrast, the network_socket_receive API allows the TCP/IP compartment to allocate a buffer large enough for the available data.

	while (true)
	{
		auto [received, buffer] = network_socket_receive(
		  &unlimited, MALLOC_CAPABILITY, socket);
		if (received < 0)
		{
			Debug::log("Error: {}", received);
			return;
		}
		for (size_t i = 0; i < received; i++)
		{
			MMIO_CAPABILITY(Uart, uart)
			  ->blocking_write(buffer[i]);
		}
		free(buffer);
	}

Listing 86. Receiving data from a remote server.examples/tcp/tcp.cc

The network_socket_receive interface is convenient but it does not guarantee that the TCP/IP stack has not kept a pointer to the returned buffer. The TCP/IP compartment will not do this in normal operation but if an attacker manages to gain arbitrary-code execution in the TCP/IP compartment then they may be able to exploit time-of-check-to-time-of-use (TOCTOU) bugs in your code. This is not a problem for this example, which reads each byte in the returned buffer exactly once.

The result of the network_socket_receive is a struct NetworkReceiveResult, which contains two fields. The first field, bytesReceived, is the number of bytes received, or a negative error code. The second, buffer is the buffer (which will be null in error cases). This example uses C++ structured binding to decompose the structure and make it appear as if the function returned two values.

In this example, we are assuming that the TCP/IP stack is trusted. The TCP/IP compartment could attack this example by providing a received size that is greater than the claimed size, or one that lacks read permission. This example has no secrets and, if the network stack is compromised, can do nothing, and so does not worry about these potential problems. If you have such concerns, then you should put the code that uses the result in an error-handling block, or use network_socket_receive_preallocated instead.

This example is simply writing the result to the UART directly. The server that it connects to will provide you with an ASCII-art rendering of Star Wars: A New Hope. After the initial banner and the scrolling text, you should see something like this:

                               /~\                   
                              |oo )       What plans? 
                              _\=/_                  
              ___            /  _  \                 
             / ()\          //|/.\|\\                
           _|_____|_       ||  \_/  ||               
          | | === | |      || |\ /| ||               
          |_|  O  |_|      #  \_ _/ #                
           ||  O  ||          | | |                   
           ||__*__||          | | |                   
          |~ \___/ ~|         []|[]                  
          /=\ /=\ /=\         | | |                  
__________[_]_[_]_[_]________/_]_[_\_________________

Documentation for the network_socket_receive_preallocated function

int network_socket_receive_preallocated(Timeout * timeout, Socket socket, void * buffer, size_t length)

Receive data from a socket into a preallocated buffer. This will block until data are received or the timeout expires. If data are received, they will be stored in the provided buffer.

NOTE: Callers should remove global and load permissions from buffer before passing it to this function if they are worried about a potentially compromised network stack.

The return value is either the number of bytes received, or a negative error code.

The negative values will be errno values:

-EPERM: buffer and/or length are invalid.
-EINVAL: The socket is not valid.
-ETIMEDOUT: The timeout was reached before data could be received.
-ENOTCONN: The socket is not connected.

11.4. Creating a listening socket

Listening sockets, like connected ones, require an authorising capability. This is shown in Listing 87 and includes the local port number that you can bind to along with the number of pending connections that are allowed. The second is important for limiting the amount of the TCP/IP compartment's memory you can consume. Each unaccepted socket requires some state in the TCP/IP stack so allowing an unbounded number would consume an unlimited amount of memory. For most embedded uses, one or two is adequate.

DECLARE_AND_DEFINE_BIND_CAPABILITY(
  /* Name */ ServerPort,
  /* Bind on IPv6? */ UseIPv6,
  /* Port number */ 1234,
  /* Concurrent connection limit */ 1);

Listing 87. A static capability that authorises binding to a local port.examples/tcp_echo_server/tcp.cc

As with the connect operation, the authorising capability is not the only place that the CHERIoT network stack's APIs differ from the traditional Berkeley Sockets APIs. As shown in Listing 88, the socket, bind and listen operations are combined. The network_socket_listen_tcp call creates the socket, binds it to the local port associated with the authorising capability, and makes it ready to accept.

	Timeout unlimited{UnlimitedTimeout};
	auto    socket = network_socket_listen_tcp(
    &unlimited,
    MALLOC_CAPABILITY,
    STATIC_SEALED_VALUE(ServerPort));
	if (!CHERI::Capability{socket}.is_valid())
	{
		Debug::log("Failed to bind to local port");
		return;
	}

Listing 88. Listening for TCP connections to a local port.examples/tcp_echo_server/tcp.cc

A listening socket is simply a placeholder for a local endpoint. You cannot send or receive with it, all that you can do is accept new connections. The network_socket_accept_tcp call, shown in Listing 89, creates a new socket for the accepted connection and, optionally, returns the remote IP address and port. If you do not care about the address of the connecting host, you can pass null to the last two arguments.

	while (true)
	{
		Debug::log("Listening for connections...");
		NetworkAddress address;
		uint16_t       port;
		auto           accepted =
		  network_socket_accept_tcp(&unlimited,
		                            MALLOC_CAPABILITY,
		                            socket,
		                            &address,
		                            &port);
		if (!CHERI::Capability{accepted}.is_valid())
		{
			continue;
		}
		Debug::log("Received connection from {} on port {}",
		           address,
		           int32_t(port));
		char byte;
		while (network_socket_receive_preallocated(
		         &unlimited, accepted, &byte, 1) == 1)
		{
			network_socket_send(&unlimited, accepted, &byte, 1);
			MMIO_CAPABILITY(Uart, uart)->blocking_write(byte);
		}
		network_socket_close(
		  &unlimited, MALLOC_CAPABILITY, accepted);
	}

Listing 89. Accepting TCP connections and running a simple echo-server loop.examples/tcp_echo_server/tcp.cc

After accepting a connection, this example simply sits in a loop reading one byte at a time and sending it back. It also writes the received byte to the UART. The send function is very similar to the receive. It takes a pointer to a buffer and a length. The network stack's interface is written defensively. If the length is smaller than the bounds of the buffer, or if the buffer has the wrong permissions, this call will fail.

Documentation for the network_socket_send function

ssize_t network_socket_send(Timeout * timeout, Socket socket, void * buffer, size_t length)

Send data over a TCP socket. This will block until the data have been sent or the timeout expires.

Note here that the on-stack buffer (the single byte local variable) is derived from our stack pointer and so is automatically local. This ensures that the TCP/IP compartment cannot capture it.

The inner loop is waiting for the receive call to return a value other than 1, indicating that it has failed to receive. This should happen when the connection is dropped.

The inner loop uses an unlimited timeout, so that the demo doesn't fail if you get distracted in the middle of running it. A more realistic example would use a shorter timeout on the receive call. Short timeouts are useful to prevent denial of service issues. This simple example, like many embedded network servers, is single threaded and handles one connection at a time. Without the timeout, a single client failing to gracefully disconnect could prevent any future access until the device is restarted.

If you connect to this example with netcat, you can try sending it some text, which it should echo back. Here, my Sonata board has joined my local network with a DHCP-assigned address of 192.168.1.154:

$ nc 192.168.1.154 1234
Hello world!
Hello world!

On the UART console, you can see the debugging messages, along with the echoed text:

TCP Server Example: Starting network stack
TCP Server Example: Creating listening socket
TCP Server Example: Listening for connections...
TCP Server Example: Received connection from 192.168.1.86 on port 62599
Hello world!

11.5. Securing connections with TLS

In general, the kind of unencrypted communication that we've seen so far is inappropriate for the modern Internet. Anyone who has control of any node on the network between the device and the remote server can tamper with messages. Such malicious messages may attack software on the device, attempting to exploit vulnerabilities.

This is the threat model for a lot of the network stack work on CHERIoT: a remote attacker is trying to compromise the device. The firewall makes this class of attack somewhat harder, by ensuring that an attacker must spoof packets for a valid connection. This defence is weakened if your device uses a server socket because, by design, these must allow packets from unknown remote hosts.

An attacker who sneaks a packet past the firewall can attack the TCP/IP compartment. This is a fairly complex piece of code, which does dangerous things like packet parsing. It is written in MISRA C and is more likely to be correct and secure than most C code, but it may still contain bugs. The simple act of compiling it for a CHERIoT target mitigates a large number of possible bugs, as does the memory management strategy. Every incoming packet (and every outgoing packet) is a fresh heap allocation, which ensures that dangling references to processed packets will trap, as will bounds errors. Any such bugs will cause the network stack to gracefully reset, as described in Section 11.8. Understanding TCP/IP-stack reset.

Without encryption, the TCP/IP stack is not the limit of the attack surface. An attacker can push data through the network stack and into the next compartment. Using authenticated encryption, such as TLS, mitigates this.

With authenticated encryption, you can ensure that only messages from a trusted endpoint, such as your cloud server, reach your code. The TLS stack checks each incoming message for cryptographic integrity and forwards the plaintext to you only after it has been decrypted.

Using the TLS stack makes it a critical part of the attack surface. Fortunately, it has a very narrow interface with the TCP/IP stack. Internally, BearSSL uses a ring buffer for messages that are ready to be sent and those awaiting decryption. Before calling the send or receive functions in the TCP/IP stack, the TLS compartment removes all permissions except load or store (for send and receive, respectively) and sets the bounds to exactly the required amount. Removing the global permission protects the TLS stack from time-of-check-to-time-of-use (TOCTOU) attacks by guaranteeing that the TCP/IP compartment cannot capture the buffer for longer than the duration of the call. Similarly, removing permissions and bounding the pointers to the buffers ensures that no data can leak to the TCP/IP compartment and it cannot overwrite anything.

Beyond this, the TLS compartment has no global state. All state associated with a TLS connection is stored in the connection object, exposed as a sealed capability. This means that two concurrent calls into the TLS compartment for different TLS connections have no shared state, giving flow isolation. An attacker who compromises one TLS connection cannot use this to attack another.

When you communicate with a remote server via TLS, you have to identify the server in two ways. As with unencrypted connections, you must provide a host name that can be mapped to a network address. Additionally, you need to provide a TLS certificate to identify the remote host.

A TLS certificate is public key along with some metadata describing what it can be used for and when it is valid. Each TLS certificate also has an associated private key, which is (or, at least, should be) kept secret. If you sign something with the private key, someone else can use the certificate to validate that it really was signed by you.

In the simplest case, TLS can use a single certificate. You generate the pair of this certificate and its private key and embed the certificate on your device. This is a dangerous practice because there is no possible way of revoking the certificate if the key is compromised. The key must be in memory on the server that the device connects to and so is vulnerable to attack.

TLS certificates can also be arranged in certificate chains, where each certificate is signed by the private key associated with the next certificate in the chain. The root of a certificate chain is usually signed by a certificate authority (CA).

With a certificate chain, you can store a certificate on the device that does not correspond to the private key on the server, but which can still be used to verify that key. It is quite common for the server to have a very short-lived certificate, generated every week, so that if the key is compromised the associated certificate expires after a short amount of time and an attacker has a narrow window to use it. This requires your device to hold a certificate that it trusts will appear somewhere further up the chain. The set of trusted certificates is referred to as your trust anchors. Any certificate signed with the key corresponding to one of your trust anchors is considered valid. This property is transitive, so any number of certificates can exist between the one corresponding to the server's private key and the one that you hold. This provides a lot of flexibility, at the cost of computational power. Verifying a certificate chain is very fast on a multi-gigahertz machine with wide vector units but can be slow (a second or longer of CPU time) on an embedded device.

Most of the network stack APIs are intended to hide the exact implementations that we use. For example, we may wish to replace the FreeRTOS TCP/IP compartment's code with something designed for CHERIoT, perhaps written in a safe language. The TLS compartment currently leaks the fact that it uses BearSSL at the API level, by exposing trust anchors in BearSSL's internal format. This will be addressed in a future version.

If you control the remote server then you already have the .pem file that contains the certificate. If you are connecting to a server that someone else controls then you need to extract it first, or include a large set of trusted anchors. Modern web browsers do the latter, but the certificate bundle is larger than most embedded platforms would like. Fortunately, you can use the openssl command to connect to a server and report the certificate chain. Try this for example.com on the HTTPS port:

$ openssl s_client -connect example.com:443 -showcerts </dev/null 
Connecting to 2606:2800:21f:cb07:6820:80da:af6b:8b2c
CONNECTED(00000005)
depth=2 C=US, O=DigiCert Inc, OU=www.digicert.com, CN=DigiCert Global Root G2
verify return:1
depth=1 C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
verify return:1
depth=0 C=US, ST=California, L=Los Angeles, O=Internet Corporation for
   Assigned Names and Numbers, CN=www.example.org
verify return:1

The first bit of the output shows the certificate chain. The first certificate is the DigitCert Global Root G2, a certificate that the DigitCert CA uses to sign their own signing certificates. This certificate is the root that you are expected to deliver out of band. Typically, your openssl install will have some system-provided root certificates that include this one. This certificate is valid from August 2013 to January 2038. It is probably safe to use with your device.

The lifetime is longer than most embedded devices last. The CA claims (and their auditors support the claim) that this certificate is stored securely and is used only to sign the intermediate certificates that are used to sign keys for clients. Information about the intermediate certificate, DigiCert Global G2 TLS RSA SHA256 2020 CA1, shows up later in the output:

 1 s:C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
   i:C=US, O=DigiCert Inc, OU=www.digicert.com, CN=DigiCert Global Root G2
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 30 00:00:00 2021 GMT; NotAfter: Mar 29 23:59:59 2031 GMT

This expires in six years at the time of writing this book, so it might seem safe to use as a trust anchor (for now). Unfortunately, this is not the case. Although this certificate is valid for another six years, there's no guarantee that this intermediate certificate will be the one used to sign the certificate for example.com next time. The certificate that the site operator created is the first to be displayed in the output:

Certificate chain
 0 s:C=US, ST=California, L=Los Angeles, O=Internet Corporation for
        Assigned Names and Numbers, CN=www.example.org
   i:C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jan 30 00:00:00 2024 GMT; NotAfter: Mar  1 23:59:59 2025 GMT

This is valid for one year and will probably have expired by the time that you read this. Note that the lifetime of the certificate is not the same as the lifetime of the key pair. You can easily generate a new certificate signing request for the same key and have a newly signed certificate valid for another year (or just for a week) using the same key.

If we wanted to use either of the certificates that are directly sent by the server then we could simply copy the bit between BEGIN CERTIFICATE and END CERTIFICATE lines into a file. Unfortunately, we don't so we have to go to the DigiCert web site and download the correct certificate.

Once you have the certificate, BearSSL's command-line tool can convert it into a form that the library expects. The command-line tools are not built by the CHERIoT network stack, so you will need to either build them from the copy of BearSSL in network-stack/third_party/BearSSL or install them from your operating system's package manager. You can then convert the certificate file into a header that contains the trust anchor that you need:

$ brssl  ta DigiCertGlobalRootG2.crt.pem  > DigiCertGlobalRootG2.h
Reading file 'DigiCertGlobalRootG2.crt.pem': 1 trust anchor

Once you have the trust anchors and the hostname and port, you have everything that you need to be able to create a TLS connection. The current implementation of TLS in the CHERIoT network stack uses BearSSL, which avoids heap allocation. Unfortunately, this includes all of the big-number arithmetic, which causes it to require very large stacks. Listing 90 shows the stack for this example: it is just under 8 KiB.

{
        compartment = "https_example",
        priority = 1,
        entry_point = "example",
        -- TLS requires *huge* stacks!
        stack_size = 6144,
        trusted_stack_frames = 6
      },

Listing 90. Build system code for a thread that will use TLS.examples/tls/xmake.lua

This thread is now able to connect to a TLS server without running out of stack space. Recall from the certificates earlier that each has a period when it is valid. The TLS stack will check that the certificate is currently valid, which requires that the TLS stack has access to the current time. This means that you need some code that is similar to the SNTP example at the start. After initialising the network stack, you need to synchronise the clock, as shown in Listing 91.

	network_start();
	Timeout t{MS_TO_TICKS(1000)};
	// SNTP must be run for the TLS stack to be able to check
	// certificate dates.
	while (sntp_update(&t) != 0)
	{
		Debug::log("Failed to update NTP time");
		t = Timeout{MS_TO_TICKS(1000)};
	}

Listing 91. Setup required before a TLS connection is possible.examples/tls/https.cc

Connecting to a TLS server is very much like connecting to a TCP server. Compare Listing 85, which established an unencrypted connection, to Listing 92, which creates an encrypted connection. Aside from the connect function name, the only difference is that the TLS connect function requires the trust anchors. This is an intentional API choice: CHERIoT aims to be secure by default, so it should be as easy to create secured connections as it is to create insecure ones.

	Timeout unlimited{UnlimitedTimeout};
	auto    tlsSocket = tls_connection_create(
    &unlimited,
    TEST_MALLOC,
    STATIC_SEALED_VALUE(ExampleComTLS),
    TAs,
    TAs_NUM);
	if (!CHERI::Capability{tlsSocket}.is_valid())
	{
		Debug::log("Failed to connect.  Error: {}",
		           -static_cast<int32_t>(
		             reinterpret_cast<intptr_t>(tlsSocket)));
	}

Listing 92. Connecting to a remote server with TLS.examples/tls/https.cc

The return value from the tls_connection_create call is either a valid sealed capability, or null. In the future, it will use a negated error code in the untagged capability to report failure. Currently, the only failure that will be reported as a non-null capability is -ECOMPARTMENTFAIL, which will occur if there is a crash in the TLS compartment. If you did not provide a large stack (as in Listing 90) then you may see this result. Try reducing the stack size to 4 KiB and you will see failure like this in the output:

HTTPS Client: Failed to connect.  Error: 1
HTTPS Client: TLS socket: 0xffffffff (v:0 0xfffffe00-0xfffffe00 l:0x0 o:0x0 p: - ------ -- ---)

Note that the returned capability for this does not expose the socket. The TLS compartment owns the socket on behalf of the caller. This demonstrates the value of capability delegation. The TLS compartment takes the caller's malloc capability as an argument and can subsequently forward it to the TCP/IP compartment to allocate the socket. This encapsulation means that it is impossible for the caller to accidentally send data over the socket unencrypted.

This example implements a minimal HTTP client to demonstrate sending and receiving data over TLS. Do not use this HTTP client in production, it does no error checking and ignores most HTTP headers. The send code is shown in Listing 93. As with the underlying TCP send call, the TLS send call may send less than the requested amount of data. This code is therefore called in a loop, which will try to send more of the buffer if only some is sent successfully. The echo server did not need to handle this case because it only ever sent individual bytes and provided an unlimited timeout, so either the byte entered the TCP socket's send queue or the caller blocked.

This is unnecessary for the example because the amount of sent data is smaller than the TLS socket's internal buffer size so the loop will never execute, but you can force it to by adding additional headers. Internally, the TLS stack needs to assemble a complete message and then send it. The message may need to contain padding if it is too small, so the underlying APIs provide an explicit flush. CHERIoT's wrapper aims to be easy for the common case and so automatically flushes after a send. If this is not what you want, you are free to extend the source code.

	while (sent < toSend)
	{
		size_t remaining = toSend - sent;

		ssize_t sentThisCall =
		  tls_connection_send(&unlimited,
		                      tlsSocket,
		                      &(message[sent]),
		                      remaining,
		                      0);
		Debug::log("Sent {} bytes", sentThisCall);

		if (sentThisCall >= 0)
		{
			sent += sentThisCall;
		}
		else
		{
			Debug::log("Send failed: {}", sentThisCall);
			break;
		}
	}

Listing 93. Sending data over a TLS connection.examples/tls/https.cc

Receiving the response data is almost identical to receiving unencrypted data. The call shown in Listing 94 directly mirrors the TCP API from Listing 86. The only difference is that you can trust that the data has not been tampered with in-flight (unless the TLS compartment is compromised).

		auto [received, buffer] =
		  tls_connection_receive(&unlimited, tlsSocket);

Listing 94. Receiving data over a TLS connection.examples/tls/https.cc

The threat model for the TLS compartment is directed towards the TCP/IP stack as the main adversary. It implicitly trusts the caller for availability. You can almost certainly crash the TLS compartment if you call it with insufficient stack, insufficient trusted stack, and so on. If you do, you will not impact other TLS flows and the flow that you will impact is allocated from your own heap quota, so you can only attack yourself doing this.

11.6. Communicating with an MQTT server

A lot of IoT applications use MQTT (which doesn't stand for anything and isn't a message queue) as a publish-subscribe protocol for messaging. MQTT exposes an abstraction of a tree of topics, where clients can subscribe to topics and publish new values to that topic. When a client publishes a message on a topic, a copy is sent to every client that has subscribed to that topic. The protocol supports multiple levels of quality of service (QoS):

At most once: The server will attempt to deliver the message. If delivery fails, neither the client nor the server will do any additional steps.
At least once: The server will attempt to deliver the message and wait for an acknowledgement. If delivery fails, the server will try again until the message is acknowledged.
Exactly once: The server will attempt to deliver the message and use a two-way handshake to ensure that the message arrives exactly once.

The QoS levels are intended to work even if the network breaks. Clients connect with a unique 23-character identifier. If a client is already connected with the same identifier, new clients may not connect with the same identifier but they may reconnect. A reconnecting client will disconnect the original and take ownership of the ID and any messages with higher QoS levels that were destined for the original will be sent to the new owner. The CHERIoT MQTT library contains a helper for creating random client IDs, shown in Listing 95.

	Debug::log("Generating client ID...");
	constexpr std::string_view clientIDPrefix{"cheriotMQTT"};
	// Prefix with something recognizable, for convenience.
	memcpy(clientID.data(),
	       clientIDPrefix.data(),
	       clientIDPrefix.size());
	// Suffix with random character chain.
	mqtt_generate_client_id(
	  clientID.data() + clientIDPrefix.size(),
	  clientID.size() - clientIDPrefix.size());

Listing 95. Creating a client ID to use with MQTT.examples/mqtt/mqtt.cc

The CHERIoT MQTT interface doesn't support unencrypted connections. Connecting to a server requires everything that you needed for a TLS connection. This example is using the Mosquitto public MQTT test server. This server is intended for demos and is not always reliable. If the demo doesn't work, check their web interface to see if it is down.

The mqtt_connect call to connect to the server is shown in Listing 96. This API takes quite a lot of arguments. The first few are familiar from previous connection APIs: they provide the timeout, the allocation capability, and the connection capability. The next two are callbacks for publish messages (someone has published to a node that you subscribed to) and acknowledgement messages (a message that you sent has been acknowledged by the server). Next come the trust anchors, as we saw in Listing 92. The function then takes the sizes for some internal buffers and finally the client ID.

This example omits the last parameter, which has a default value of false in C++. Setting this to true will cause the library to reconnect, rather than connecting, to the MQTT server.

	Debug::log("Connecting to MQTT broker...");
	auto handle =
	  mqtt_connect(&t,
	               STATIC_SEALED_VALUE(mqttTestMalloc),
	               CONNECTION_CAPABILITY(MosquittoOrgMQTT),
	               publishCallback,
	               ackCallback,
	               TAs,
	               TAs_NUM,
	               networkBufferSize,
	               incomingPublishCount,
	               outgoingPublishCount,
	               clientID.data(),
	               clientID.size());

	if (!Capability{handle}.is_valid())
	{
		Debug::log("Failed to connect.");
		return;
	}

Listing 96. Connecting to an MQTT broker.examples/mqtt/mqtt.cc

As with other networking APIs, all of the state associated with this connection is allocated from the caller's quota. This includes the TLS and TCP/IP state that is allocated indirectly. Similarly, the result is a sealed capability that encapsulates the state of the connection. This includes a sealed capability to the TLS state, which includes a sealed capability to the TCP socket state. The MQTT, TLS, and TCP states are visible only to the compartment that owns each of them.

Once you have connected to an MQTT broker, you can send publish and subscribe messages and invoke the run loop to process incoming messages. This example first subscribes to a topic, in Listing 97.

	ret = mqtt_subscribe(&t,
	                     handle,
	                     1, // QoS 1 = delivered at least once
	                     testTopic.data(),
	                     testTopic.size());
	Debug::Assert(
	  ret >= 0, "Failed to subscribe, error {}.", ret);

Listing 97. Subscribing to an MQTT topic.examples/mqtt/mqtt.cc

The return value will be either a negative error code or a non-negative packet ID. We don't care about the packet ID in this example, so simply assert that we didn't see an error.

Documentation for the mqtt_run function

int mqtt_run(Timeout * t, MQTTConnection mqttHandle)

Fetch ACK and PUBLISH notifications on a given MQTT connection, and keep the connection alive.

This function will invoke the callbacks passed to mqtt_connect. The connection object is protected by a recursive mutex, so these callbacks can call additional publish and subscribe functions. If doing so, care must be taken to ensure that the buffer is not exhausted. Calling mqtt_run from a callback is not supported.

The return value is zero if notifications were successfully fetched, or a negative error code.

The negative values will be errno values:

-EINVAL: A parameter is not valid.
-ETIMEDOUT: The timeout was reached before notifications could be fetched.
-ECONNABORTED: The connection to the broker was lost. The client should now call mqtt_disconnect to free resources associated with this handle.
-EAGAIN: An unspecified error happened in the underlying coreMQTT library. Try again.

The server will send a reply message to acknowledge the subscription. When you call mqtt_run, it will process incoming messages and invoke the relevant callbacks. The call is shown in Listing 98.

	while (ackReceived == 0)
	{
		t   = Timeout{MS_TO_TICKS(1000)};
		ret = mqtt_run(&t, handle);
		Debug::Assert(
		  ret >= 0,
		  "Failed to wait for the SUBACK, error {}.",
		  ret);
	}

Listing 98. Waiting for acknowledgement after subscribing to an MQTT topic.examples/mqtt/mqtt.cc

The callbacks that mqtt_run invokes are the ones that were passed in Listing 96. These are CHERIoT cross-compartment callbacks. The one for acknowledgements is shown in Listing 99. This will run in the compartment that defined it, invisible to the MQTT compartment, and on a new trusted stack activation record. This example callback is not written defensively. A buggy (or malicious) MQTT compartment could pass invalid pointers that would cause a trap. If this happens, the switcher will unwind the trusted stack out of the callback, as if the callback simply returned early.

void __cheriot_callback ackCallback(uint16_t packetID,
                                    bool     isReject)
{
	Debug::log("Got an ACK for packet {}", packetID);
	if (isReject)
	{
		Debug::log(
		  "However the ACK is a SUBSCRIBE REJECT notification");
	}
	ackReceived++;
}

Listing 99. Callback for acknowledging MQTT messages.examples/mqtt/mqtt.cc

Running the example to this point should give output like this:

MQTT example: Generating client ID...
MQTT example: Connecting to MQTT broker...
MQTT example: Connected to MQTT broker!
MQTT example: Subscribing to test topic 'cheriot-book-example'.
MQTT example: Now fetching the SUBACK.
MQTT example: Got an ACK for packet 0x1

Next, the example will publish a message on the same topic and make sure that it is received. The publish part is shown in Listing 100. As with the subscribe call, this returns a negative error code or a non-negative packet number.

	ret = mqtt_publish(
	  &t,
	  handle,
	  1, // QoS 1 = delivered at least once
	  testTopic.data(),
	  testTopic.size(),
	  static_cast<const void *>(testPayload.data()),
	  testPayload.size());
	Debug::Assert(
	  ret >= 0, "Failed to publish, error {}.", ret);

Listing 100. Publishing to an MQTT topic.examples/mqtt/mqtt.cc

Publishing the message will trigger two messages from the server. There will be an acknowledgement of the publish and, because the example is subscribed to this topic, it will also receive the publish notification. The latter will be sent to the callback in Listing 101, which logs the received message.

void __cheriot_callback
publishCallback(const char *topicName,
                size_t      topicNameLength,
                const void *payload,
                size_t      payloadLength)
{
	Debug::log(
	  "Got a PUBLISH for topic {}: {}",
	  std::string_view{topicName, topicNameLength},
	  std::string_view{static_cast<const char *>(payload),
	                   payloadLength});
	publishReceived++;
}

Listing 101. Callback for receiving published MQTT messages.examples/mqtt/mqtt.cc

Running to this point should give you output like the following:

MQTT example: Publishing a value to test topic 'cheriot-book-example'.
MQTT example: Now fetching the PUBACK and waiting for the publish notification.
MQTT example: Got a PUBLISH for topic cheriot-book-example: Cheriots of fire!
MQTT example: Got an ACK for packet 0x2

The demo will then wait for four more messages on the same topic. If you happen to run this demo at the same time as other people, you might see them. Alternatively, if you install the command-line tools that come with Mosquitto, you can send a message from the command line:

$ mosquitto_pub -h test.mosquitto.org 
	-t cheriot-book-example 
	-m 'My name is David'

This will then show up as:

MQTT example: Got a PUBLISH for topic cheriot-book-example: My name is David

Don't put anything secret in the message, it will go to anyone running this demo or anyone observing the public test server.

Finally, the demo disconnects. This is often unnecessary. Most IoT devices will simply remain connected for their entire operation. They will explicitly reconnect if the connection drops but never disconnect explicitly.

	ret = mqtt_disconnect(
	  &t, STATIC_SEALED_VALUE(mqttTestMalloc), handle);
	Debug::Assert(
	  ret == 0, "Failed to disconnect, error {}.", ret);

Listing 102. Gracefully disconnecting from an MQTT server.examples/mqtt/mqtt.cc

This function gracefully disconnects, allowing the server to clean up all state associated with the current connection. It can fail, for example by running out of memory to hold the disconnection messages.

11.7. Enforcing network access policies

The network stack comes with a network_stack.rego file that provides helpers for inspecting the state of the network stack. You pass this as an argument to the --module (or -m) flag for cheriot-audit. For the rest of this section, we'll use cheriot-audit to inspect and audit the Section 11.6. Communicating with an MQTT server example. From the examples/mqtt directory, you will need to run a command like this:

$ cheriot-audit -m path/to/network-stack/network_stack.rego \
	-b path/to/sdk/boards/sonata.json \
	-j build/cheriot/cheriot/release/mqtt.json \
	-q {query}

This assumes that cheriot-audit is in your path. If it is not, provide the full path, for example /cheriot-tools/bin/cheriot-audit in the dev container. The first two arguments need to be paths to wherever the network stack and CHERIoT RTOS sources are located. The -j flag should be copied as-is, this finds the JSON file that the linker created with the audit report for the firmware image. Finally, you will provide a query for the -q, which will be different as you work through the examples.

If you want to actually read the JSON output, you will find that piping it to jq is helpful, which will pretty-print (and colour) the output.

If you're copying the Rego queries to the command line, make sure that you quote them. Placing the query text in single quotes should work for all of the examples in this section.

Let's start with a query that invokes one of the more complex rules. This will find every software-defined capability in the firmware image that is sealed with the type for connection capabilities, and then decodes them into JSON objects. Try this query:

data.network_stack.all_connection_capabilities

You should see the following JSON as the result:

[
  {
    "capability": {
      "connection_type": "UDP",
      "host": "pool.ntp.org",
      "port": 123
    },
    "owner": "SNTP"
  },
  {
    "capability": {
      "connection_type": "TCP",
      "host": "test.mosquitto.org",
      "port": 8883
    },
    "owner": "mqtt_example"
  }
]

This tells you that there are two compartments that can make sockets. The MQTT example compartment can make a TCP connection to the Mosquitto test server on port 8883. The SNTP compartment can create a UDP socket and open a firewall rule that allows it to communicate with the public NTP pool on the well-known NTP port. TCP is connection-oriented so the network stack implicitly opens firewall rules on connection. UDP is connectionless so there are explicit APIs for opening firewall rules that allow a UDP host to communicate with explicit peers. The connection capabilities are similar in both cases but their use is different.

Remember that capabilities can be delegated. The MQTT example compartment does not open a socket directly, it is passing this capability to the MQTT compartment, which passes it to the TLS compartment, which then passes it to the network API compartment to access the socket. You can validate this with another query:

data.compartment.compartments_calling_export_matching("NetAPI",
`network_socket_connect_tcp(.*`)

Rego uses double quotes for normal strings. These follow similar escaping rules to C so, for example, "\t" is a string containing a tab character. You can avoid this processing by using backticks to designate a raw string. The Rego raw `\t` is a string containing a backslash and the letter 't'. It's common to use raw strings when constructing regular expressions to avoid needing to escape backslashes.

The report contains the mangled name of the export, which includes the types. This query uses a regular expression to match anything with the function name followed by an open bracket, so will catch any overload of the function (this function has no overloads but specifying all of the arguments is tedious). The output should look like this:

[
  "TLS"
]

The only compartment that creates TCP connections is the TLS compartment. This is interesting but not very useful.

The policy that we actually want is that no unencrypted data leaves the device. The way to express that is that nothing sends data over a socket except via the TLS compartment. This query is very similar to the last one:

data.compartment.compartments_calling_export_matching(
	"TCPIP",
	`network_socket_send(.*`)

And, again, tells you that only the TLS compartment is sending data:

[
  "TLS"
]

If you remember the result of the first query, this might be a surprise. Didn't the SNTP compartment also have a capability that allows it to connect to the network? SNTP doesn't run over TLS, so what's happening here?

You don't send UDP data with network_socket_send, you send it with network_socket_send_to. This requires another variant of the same query:

data.compartment.compartments_calling_export_matching(
	"TCPIP",
	"network_socket_send_to.*")

And now that we see that the only compartment sending data over UDP is the SNTP compartment:

[
  "SNTP"
]

Now we can think about ways that a compartment might be able to exfiltrate data with this. First, let's see what this compartment exports:

input.compartments.SNTP.exports

This compartment exports a single symbol, which takes a single Timeout argument:

[
  {
    "export_symbol": "__export_SNTP__Z11sntp_updateP7Timeout",
    "exported": false,
    "interrupt_status": "enabled",
    "kind": "Function",
    "register_arguments": 1,
    "start_offset": 208
  }
]

This could potentially leak data via the timeout. If you are concerned about this, you can wrap the calls to this function in another compartment and audit the source of that.

There's another way that you might leak data to the SNTP compartment: via pre-shared objects. You can ask if the SNTP compartment has access to any pre-shared objects with the following query:

data.compartment.shared_object_imports_for_compartment(
	input.compartments.SNTP)

This tells you that, yes, it does:

[
  {
    "kind": "SharedObject",
    "length": 24,
    "permits_load": true,
    "permits_load_mutable": false,
    "permits_load_store_capabilities": false,
    "permits_store": true,
    "shared_object": "sntp_time_at_last_sync",
    "start": 1237648
  }
]

This can't contain capabilities, but it is readable so if another compartment has write access to this object then it could communicate data to the SNTP compartment. We can check that with an allow-list query:

data.compartment.shared_object_writeable_allow_list(
	"sntp_time_at_last_sync",
	{"SNTP"})

This takes the name of a shared object as the first argument and a set of compartments that may hold writeable capabilities to it as the second. Unlike the prior queries, this does not expand to a complex JSON response, it is a single JSON value: true.

This is one of the checks performed by the valid rule in the network_stack package. This takes the network interface as its argument. On Sonata, the Ethernet device is accessed via the second SPI channel. You can check the integrity of the network stack with the following query:

data.network_stack.valid(spi2)

Again, this should simply evaluate to true. You can use this, along with the other things that you've seen in this section, to build a policy for this example. The start is shown in Listing 103. This is the head of a Rego rule that is parameterised on the device name and forwards to the network stack's validity rule. The network stack checks access for the shared object.

# Rule for defining 
valid(ethernetDevice) {
	# Check the integrity of the network stack
	data.network_stack.valid(ethernetDevice)

Listing 103. The start of the Rego policy for the MQTT example.examples/mqtt/mqtt.rego

Next, the policy checks that there are exactly two connection capabilities and that they are the two that we expect. This is shown in Listing 104. The first check uses the count operator to ensure that the length of the array containing all capabilities is two. The next two checks are more interesting because they use the fact that Rego expressions include JSON. Each of these starts with a JSON object literal for the capability that we expect to find (the one that we saw earlier using cheriot-audit for introspection) and then uses the in operator to check that this object is part of the array.

JSON is tree-structured data with a small number of primitive types so it is easy to do exact equality comparisons on arbitrary JSON data. The in operator uses this to operate over a collection (set, array, or object) and return whether the collection contains the requested value. This is not string comparison. The indentation in this example is purely for readability.


	# Check that only the authorised set of remote hosts are
	# allowed
	count(data.network_stack.all_connection_capabilities) == 2

	{
		"capability": {
			"connection_type": "UDP",
			"host": "pool.ntp.org",
			"port": 123
		},
		"owner": "SNTP"
	} in data.network_stack.all_connection_capabilities

	{
		"capability": {
			"connection_type": "TCP",
			"host": "test.mosquitto.org",
			"port": 8883
		},
		"owner": "mqtt_example"
	} in data.network_stack.all_connection_capabilities

Listing 104. Rego rules for restricting output in the MQTT example.examples/mqtt/mqtt.rego

Finally, in Listing 105 the rule contains checks for the property that you saw earlier: no unencrypted data can leave the device. This is implemented with two allow-list rules, which pass only if the set of allowed compartments contains every compartment that can call the specified set of entry points.


	# Restrict which compartments can send data
	data.compartment.compartment_call_allow_list(
		"TCPIP",
		`network_socket_send\(.*`,
		{ "TLS" })
	data.compartment.compartment_call_allow_list(
		"TCPIP",
		`network_socket_send_to\(.*`,
		{ "SNTP" })

Listing 105. Rego rules to ensure that no data leaves the device unencrypted for the MQTT example.examples/mqtt/mqtt.rego

These are all in the mqtt.rego file in the example so you can add -m mqtt.rego to your cheriot-audit command line to use them. Now, you can simply run data.mqtt.valid(spi2) (or data.mqtt.valid(kunyan_ethernet), if you're using the Arty A7 builds) to check that the firmware image that you've built from this example complies with the policy.

If you write a similar policy for your real firmware and incorporate it into your code-signing flow then you can ensure that everything running on your device has the properties that we've described. If a developer accidentally leaves an unencrypted debug channel enabled in a release build, for example, then the policy check will fail. Similarly, if someone adds integration with another cloud service, you will see the checks fail and need to update the policy to make sure that it matches your new security goals.

11.8. Understanding TCP/IP-stack reset

CHERIoT provides a lot of out-of-the-box security guarantees simply by recompiling code. The FreeRTOS+TCP codebase was audited in 2019 and the auditors found ten vulnerabilities. Of these, eight were memory-safety bugs that could either allow arbitrary-code execution or information disclosure. One was a division by zero, which could cause a trap. The remaining one was a failure to properly implement DNS, which could allow DNS cache poisoning.

All of these are mitigated by the compartmentalisation model in the CHERIoT network stack. The DNS attack may still be possible, but very hard to exploit. The vulnerability was that DNS responses were processed even if they did not accompany a query so sending a DNS response to the device would cause it to add the entry to its cache and then not do the DNS query when it was requested. The CHERIoT firewall drops in-bound DNS packets except when a DNS request is known to be in flight, so attempting to send the response to the device early would simply be ignored. An attacker would have needed to time the attack for when a DNS response was in flight. An attacker who can observe DNS requests leave the device and send packets in response can simply lie in the DNS response (unless DNSSEC is being used) and could achieve the same result on any system even without the bug. Alternatively, an attacker could flood the device with responses and hope that theirs arrived first. This would be likely to succeed but would show up as unusual traffic on any network with some monitoring.

The memory-safety bugs would all have the same impact as the division-by-zero error. They would cause the hardware to raise a trap, which would then crash the TCP/IP compartment.

Crashing is usually better than allowing an attacker to gain control of a device, but it's far from ideal. Crashing a compartment is somewhat better because it allows other functionality to keep working. For an IoT device, the Internet bit may be a core part of the functionality. Fortunately, CHERIoT compartmentalisation provides two benefits:

The fault happens before anything can corrupt memory outside objects that it has access to.
The blast radius is limited to the compartment boundary and things that are explicitly shared.

This combination means that it's possible to handle the error and gracefully recover. Recovery is complicated in a TCP/IP stack because it is multithreaded. A crash may happen in the thread where the firewall provides the network stack with new packets. It may happen in the thread that handles TCP/IP retransmissions. It may also happen in any thread that another compartment uses to call network-stack functions. When a crash occurs, the first thing that the error handler needs to do is ensure that all of the threads rendezvous.

The socket structures that the TCP/IP compartment allocates and exposes via sealed capabilities are added to a linked list when they're created. When a crash occurs, the error handler walks this list and places the locks in destruction mode. In destruction mode, all threads waiting on a lock will wake and fail to acquire the lock. This forces any threads that were waiting for the socket lock to return failure.

Next, the error handler does the same to global locks and begins freeing memory. This can cause other threads to crash. That's fine because they will just enter their error handlers as well. The error handlers will check a global variable that tracks the reset state machine to determine whether they need to do anything or just exit.

When a user calls into the TCP/IP compartment, the API functions increment a counter of the number of threads that are present. This is then decremented in the error handler, or if they gracefully exit. When it reaches zero, the error handler knows that reset is finished.

Other threads may allocate memory during the shutdown process, so the error handler will call heap_free_all several times during the shutdown process.

Once everything is deallocated, the error handler increments an epoch counter. This is a 64-bit counter (so it will never overflow in the plausible lifetime of the device).

Every socket structure contains a copy of the epoch counter from when it was created. If a socket is not currently being used, it will have been removed from the list, but the memory won't have been freed because memory is allocated with the caller's quota and not the network stack's. The next time the socket is used, the send or receive function will compare the epoch of the socket to the current epoch of the TCP/IP stack. If they differ then the socket belonged to a previous incarnation of the TCP/IP stack. The function will simply report that the connection dropped. This can happen asynchronously, after reset.

Shutting down the TCP/IP stack is the difficult part, but not the part that is useful to users. The next step is to restart it. First, the error handler resets all of the global variables to their initial states (except the epoch). Next, it resumes the IP thread from its initial state and reruns initialisation. Most of the time is spent waiting for a DHCP lease, the rest of the reset happens very quickly.

If you want to test this, you can use the network_inject_fault function. This is not compiled in by default, you must add --network-inject-faults=y to your xmake config line. When you call this function, it sets a flag so that the next incoming packet will have incorrect bounds applied. This will cause the TCP/IP stack to crash somewhere.

Documentation for the network_inject_fault function

void network_inject_fault()

Inject a memory-safety bug into the network stack.

This is disabled unless compiled with the network-inject-faults option.

From your perspective, you should simply see a connection-dropped error. If you've written robust networking code, you're handling this anyway. Networks are intrinsically unreliable and will sometimes fail for reasons beyond your control. When this happens, you need to reconnect.

The TCP/IP compartment crashing is no different; it will appear as if the connection dropped. If DHCP is taking its usual amount of time, attempting to reconnect may fail for a second or two, and will then succeed.

The failure will be propagated through any of the other compartments that you're using from the network stack. For example, if you're using MQTT, the TLS compartment will have a send or receive fail. It will then report that the TLS session has been disconnected to the MQTT compartment. This, in turn, will report to you that the MQTT connection has dropped the next time you call publish, subscribe, or run functions.

Try modifying the MQTT example to handle reconnection if any of the later functions report disconnection. Remember that MQTT supports reconnection (as opposed to connection) to resume an existing connection if the network went away. Change the timeout for one of the mqtt_run calls and read a switch or UART to determine when to call network_inject_fault.

You should be able to make the network stack crash repeatedly without more than intermittent disconnection.

For a more complete example, look at the Hugh the Lightbulb demo. This is a demo that runs on Sonata and uses an Android app to control the multi-colour LED on the Sonata board via MQTT. It also uses the monochrome LEDs to show the network connection state, so you can see each of the stages in the system:

The system has started.
The network stack is initialised.
The clock is synchronised with NTP time.
The connection to the MQTT server is established.
The MQTT subscription to the topic for the controller is registered.

If you flip the rightmost DIP switch, it will trigger a crash. The LCD shows a CPU usage graph at the top and a heap-memory usage graph at the bottom, as you can see in Figure 7. The Sonata LEDs and LCD display running the Hugh the Lightbuld demo. You'll see a sharp drop in heap usage as all of the TCP/IP state is freed (and then TLS and MQTT state is freed as their respective compartments see the failure). Then you'll see a short pause as the TCP/IP stack recovers its DHCP lease. Next, you'll see a burst of 100% CPU usage as the TLS session is reestablished.

Figure 7. The Sonata LEDs and LCD display running the Hugh the Lightbuld demo.

The whole reset process takes a few seconds, most of which is either waiting for DHCP or reestablishing the TLS connection. During this time, all of the other demo functionality (updating the LCD display and the other LEDs) works fine. The failure is contained to the compartment with the bug and the reset means that other code can continue to be oblivious to this failure.

CHERIoT Programmers' Guide

Table of contents