Over the past few months, the Cloud Foundry Garden and GrootFS teams have been working to bring support for rootless containers to the platform. Now that we’re nearing the end of this piece of work, it feels like a good opportunity to talk a little about what we’ve done and what we’ve got planned for the future of Cloud Foundry’s container runtime.
A Walk through the Garden
Let’s begin with a brief overview of Garden, containers and Cloud Foundry in general. Cloud Foundry is a multi-tenant PaaS, which means a single Cloud Foundry deployment supports the ability to run applications from a range of disparate and unrelated users. In order to support this use case, it’s vital that the platform guarantees strict isolation between each running application. At no point should an application pushed by some random user on the Internet be able to see my application, or have any unintentional impact on it at all.
This is where containers come into play. Containers enable us to run applications from any user on the same host machine while providing the required level of isolation and resource limiting for each application instance. They’re also extremely quick to create and destroy, making them the ideal tool for the job.
In Garden, container creation and management is handled through the Garden API, which is exposed by a server process and accessed via unix socket. The API is designed to be platform-agnostic but implemented by platform-specific backends. The primary Garden backend today is garden-runc, which, when using default configuration, uses a CLI tool named runC to create and run containers according to the OCI (Open Containers Initiative) specification.
Side note: Somewhat confusingly, garden-runc can now be configured to use runtimes other than runC, namely a Windows implementation called winC. Naming things is hard…
Let’s take a look at how containers are created with runC.
The “bundle” is a directory containing a config.json and a rootfs directory. Config.json specifies the container’s configuration values and the rootfs directory is used as the container’s root filesystem. Every container has a corresponding bundle, and a large part of what Garden does is to prepare these bundles for runC to consume and create containers from. When asked to create a container, Garden invokes the “runc create” command on a bundle and runC performs the low-level kernel operations needed to cause that container to exist.
RunC has to do a lot of things once the “runc create” command has been called. It has to create new Linux namespaces, setup cgroups, pivot root filesystems, set user/groups, drop capabilities and enforce seccomp/apparmor profiles, among other things. Some of these operations require root-level permissions to perform, which, until recently, meant that runC had to be run as the root user. This meant we also had to run Garden as the root user. This isn’t a great situation to be in because it means that an exploit in either Garden or runC opens up the possibility for a malicious user to gain root access to the host. This is especially bad when considering the multi-tenant nature of the platform. Fortunately there is a wonderful solution to this problem — rootless containers.
Rootless Containers
A rootless container can be thought of as one that is created, run and managed entirely by an unprivileged user. A few months ago, runC added initial support for rootless containers, which means it can now be run as a non-root user, which in turn means that we can run Garden as a non-root user as well. This is a big win for the overall security of the Cloud Foundry Runtime.
In order to better illustrate this point, let’s compare a sample section of Garden’s process tree when running with rootful containers vs when running with rootless containers (using a CF Diego cell VM as an example).
Here each box represents a linux process and the colors indicate which process is being run as which user. Red indicates a process running as the root user and blue/green indicate a process running as a non-root user. The vcap user has been highlighted as this is the user applications run as inside Cloud Foundry (but vcap can also be considered a non-root user).
Although somewhat simplified, it’s clear that a much higher percentage of processes are able to run as a non-root user when using rootless containers. In fact, the only process left running as the root user in rootless mode is the “gdn setup” process. While not ideal, the security implications of running this particular process as the root user are fairly minimal. This process is a one-time-only, short-lived process that runs and exits before the first container on the host system is or even can be created. As such, there’s zero risk of a user application being able to hijack the setup process for malicious intent.
Now, let’s answer what I’m sure is probably your first question about the diagram: what on earth is dadoo? Dadoo has a few responsibilities but its primary purpose is to run containerized processes via runC, to stream process IO, to wait for the processes to exit and to then report the exit codes back to the server. In short, dadoo run runC dadoo run run … (sorry).
There is a subtlety in the above diagram worth pointing out: in rootless mode, the dadoo process (running as some non-root user) is able to start a process that’s running as a different non-root user, vcap. Typically speaking, a non-root user would require root-level privileges to be able to do this, so what’s going on here?
The answer lies with the user namespace, which incidentally is the same core feature that allows for rootless containers to exist in the first place.
Who Am I?
The user namespace is a huge topic in itself and is mostly out of scope for this particular article, but we can’t talk about rootless containers without at least mentioning it. So for now, here are the need-to-know basics accompanied by a nice diagram.
- The user namespace provides isolation of UIDs and GIDs
- Every linux process runs inside a user namespace
- There can be multiple, distinct user namespaces at any one time on a single host system
- User namespaces have parent/child relationships
- The UID of a process may change depending on which user namespace it’s being read from
- UID/GID mapping provides a mechanism for mapping IDs between parent and child user namespaces
- An unprivileged user can create a new user namespace
Pictured are two user namespaces, A and B, along with their corresponding UID/GID tables. Note that process 3, running as “non-root” user is able to create process 4, running as “root” user. While that may sound horribly insecure, the key implementation detail that prevents it from being so is the mapping between the two user namespaces, pictured above by the dashed arrow. Process 4 is only running as the root user within the context of user namespace B. If you were to read the UID/GID of process 4 from the perspective of user namespace A, you would instead see a value of 1000, non-root. Importantly, kernel permission checks are namespace aware: being uid 0 in a container does not give you elevated privileges on resources outside of the namespace because the kernel still knows – and performs access checks based on – your real uid.
Replace user namespace A with the host/parent user namespace and user namespace B with container/child user namespace and you can start to form a picture of what makes rootless containers possible.
Let’s Get Ready to Rootless!
Now that we’ve covered the “why” and “how” of rootless, let’s talk about the “when” — when will rootless containers be ready for use in Cloud Foundry? The short answer is “Hopefully very soon!” We’ve just finished a spike to run the CATs (Cloud Foundry Acceptance Tests) against a Cloud Foundry deployment with rootless mode enabled and, for the most part, all tests are green and passing! This gives us the confidence we need to start down the path of recommending and enabling rootless containers for production use.
Getting the CATs to go green was one of the end goals for the rootless track of work, but it wasn’t quite as simple as just running runC as a non-root user. There were a few key features missing from the initial support for rootless in runC we knew would be required to get the CATs passing.
The first was the ability to map multiple user and group IDs into a rootless container. Without this feature, all files in the container’s root filesystem would have to be owned by the same user, which would make it impossible for us to run user applications as the vcap user. The second was rootless support for cgroups, which meant we would not be able to apply resource limitations to rootless containers. And the third, while not technically the responsibility of runC, was still something we’d need to address before even considering rootless for Cloud Foundry, and that was support for networking.
Unfortunately networking is hard, and there are some strict limitations in place today that mean the root user pretty much has to be involved at some point during container creation. In order to work around these limitations, we apply the setuid bit to the part of Garden responsible for setting up networking — the network plugin. Fortunately Garden’s plugin architecture allows us to minimize the surface area of the setuid to only the parts of the codebase that actually require it (i.e. the networking part, configured via the network plugin).
As for the remaining missing features, multiple users and cgroups support, we were now presented with a good opportunity to get more involved with the OCI community. A couple of years ago when we made the decision to move to runC, there was some uncertainty about whether or not it was worth the investment. At the time, we had our own homegrown container runtime — garden-linux — that was adequately fulfilling Cloud Foundry’s container needs — so why invest the time in switching to something else?
There are a plethora of reasons — community-backed standards, security, and new features (such as rootless) to name a few. Time and time again, switching to runC has proven to be the correct decision, so it’s always nice to be in a position to contribute something back. With this in mind, we set about contributing the missing functionality and, with the help of the OCI community (particularly Aleksa Sarai), this work is now tagged for release in runC 1.1.0.
Kick the Tires
If you’d like try rootless containers for yourself there are two ways to do so — the BOSH way and the non-BOSH way. If using BOSH, rootless containers can be enabled by setting the “garden.experimental_rootless_mode” property to “true” (although please note that this is not quite enough to get rootless containers support in a full Cloud Foundry deployment just yet). Alternatively, we provide a step-by-step guide for installing a standalone rootless Garden here.
Special Thanks
As briefly touched on above, support for rootless containers in Cloud Foundry would not be possible without the support of runC, the OCI community and particularly Aleksa Sarai, so a huge thanks to everyone involved.