
Anatomy of Containers, Part I: They Are Just Fancy Processes!
/ 13 min read
A while ago, I stumbled upon some LinkedIn copy-pasta about Go 1.25 finally making GOMAXPROCS ‘container-aware’. For some reason, if you try to get CPU or memory info from within a container, you get the host’s total resources. This is probably one of the few instances where containers behave differently, and I didn’t really understand why.
Despite relying on k8s almost daily for quite a while now, my understanding of how containers work had been fairly limited. Partly because these abstractions work so well and there is so much to learn about simply using k8s itself that I never thought of looking beneath the abstraction layers. The broad statements I came across while learning Docker like “containers don’t do virtualization,” “they don’t have a hypervisor layer like VMs,” and “they sit directly on the host OS kernel” didn’t offer much real insight either.
So for that reason, the statement “containers are just fancy processes” really clicked for me. At the risk of over-simplifying, they essentially package the whole application along with all its user-space dependencies and run them as isolated processes on the host’s kernel.
In this series of blogs, I hope to share an anatomy of containers that I have learnt by tinkering around in the past few weeks. Rather than diving deep into a specific topic, it’s meant to provide a bird’s eye view of containers and the surrounding ecosystem. I’ll include resources to explore particular areas in more detail. A lot if not all of this was new to me, so feel free to correct me if I goof something up.
PS: Much of this first blog is borrowed from this highly recommended series of videos by Liz Rice.
The Kernel Features That Allow Containerisation
At the core of containers are two Linux kernel features: “namespaces” and “cgroups”.
- Namespaces isolate system resources, so processes inside a container see only their own view of resources like files, processes, and network interfaces.
- Cgroups (control groups) limit the amount of resources (CPU, memory, disk I/O, network, etc.) that a group of processes can use.
Both namespaces & cgroups are inherited by child processes, meaning that when a process spawns another, the child processes remain in the same namespace and cgroup as the parent unless explicitly changed. As you’ve probably gathered by now, if we use these two features together, we should be able to run and manage a set of isolated processes on the same host system. And that’s what containers are all about! Here’s a closer look at both of these concepts.
Namespaces
Namespaces allow processes inside them to have their own private view of certain resources of the system such as process IDs, filesystems, network interfaces, and hostnames. This is what makes it possible for multiple containers to run on the same host without interfering with each other, even though they share the same kernel. The way to enter/create a namespace is to pass appropriate flags during process creation using the clone syscall or via the setns syscall to join an existing namespace. The namespaces of a process are visible in the proc fs as symlinks under /proc/<pid>/ns. A few important namespaces for isolation are:
PID (
pid): Isolates the set of processes & PIDs. Processes can only see other processes in the same or child namespace. The first process inside this namespace starts with PID 1 in the new namespace and is considered the init process in this new namespace. All the signals (SIGTERM, SIGINT, SIGQUIT, etc.) sent to the container are received by this init process. So it’s important that this process handles these signals properly to manage the lifecycle of the container.UTS (
uts): Isolates the hostname so that processes in different namespaces can have different hostnames and NIS domain names. Fun fact: the pod names you see in k8s are actually the hostname inside the pod’s UTS namespace!Mount (
mnt): Isolates the set of filesystem mount points. You start off with a copy of the calling process’s mount points, but can add, remove, or change mount points without affecting the host or other namespaces. We’ll explore filesystems further in part two.Network (
net): Isolates the whole network stack, including interfaces, routing tables, firewall rules, etc. Networking is covered in detail in part two.User (
user): Isolates user and group IDs. Inside a user namespace a process can have a different mapped UID/GID than outside. This also allows unprivileged users on the host to become UID 0 (root) inside this namespace. We will see how the user namespace works in part three.
There are a few other namespaces like IPC (ipc), cgroups (cgroup) & time (time for CLOCK_MONOTONIC and CLOCK_BOOTTIME).
Cgroups
As mentioned earlier, cgroups allow us to restrict the resource usage (CPU, memory, max processes, etc.) of a set of processes. This is what allows us to set resource limits in k8s & Docker. It works via a pseudo-filesystem usually mounted at /sys/fs/cgroup. Similar to namespaces, the file /proc/<pid>/cgroup in the proc fs contains the cgroup of a process. There are two versions of cgroups, we will use cgroups v2. To create a cgroup, we have to create a folder with its name in this filesystem, then add PIDs in a file cgroup.procs inside it. Files like memory.max, cpu.max in this folder describe the max amount of resources this process can use. Similarly, the cpuset.cpus restricts which CPU cores the processes are allowed to run on.
One interesting realization I had from this is that the CPU limits specified are shared across the cores of the host. Setting a container’s CPU limit to 1000m in k8s doesn’t mean it will have a whole core available, it simply means that it will have one core’s worth of CPU time available! So it’s possible that your container might actually be utilizing multiple cores & running parallel threads even with a very low CPU limit, it would simply exhaust the limit quickly. The fact that resource limits are set via cgroups is also one of the biggest reasons why programs need to be container-aware. Traditionally, programs use the proc filesystem (e.g., /proc/meminfo or /proc/cpuinfo), which reflects the host’s resources rather than the container’s actual limits.
Building Our Own Simple Container
Now we have a rough idea of how to create a simple container ourselves. We need to start a process with its own namespaces, set up cgroups for resource limits, mount a few special filesystems like proc, tmp, etc. and finally chroot (change the root directory of the process) to the new root filesystem. This simple sketch obviously leaves out tons of details, some of which I plan to discuss in part two.
Making a Root Filesystem
First we need a root filesystem. We can get this by extracting it from any existing image. Let’s extract it from the BusyBox image:
- Make the
mountdirectory to hold the root filesystem.
mkdir -p "mount"- Create a temporary container from the BusyBox image, export its filesystem and extract it to the
mountdirectory.
docker create --name "busybox-temp" "busybox:latest"docker export "busybox-temp" | tar -C "mount" -xvf -- Remove the temporary container.
docker rm "busybox-temp"The Step-by-Step Process
Here’s what the code below does:
- Starts the program, parses the command to run in the container, its arguments, etc.
- Sets up a new cgroup by writing to
/sys/fs/cgroup/. - Clones a new child container process from the current process with new namespaces for mount, PID, UTS, etc.
- Adds the container process to the cgroup created earlier.
- Inside the cloned process, it sets up the new hostname, mounts a few special filesystems like proc, temp, etc., while also bind mounts the host dev devices like
/dev/null,/dev/zeroto the container. - Finally, it does a
chrootto the new root filesystem and execs the command passed from the parent process.
The Code
package main
import ( "fmt" "os" "os/exec" "path/filepath" "strconv" "syscall")
var ( rootfs string cgroupPath = "/sys/fs/cgroup/container" defaultHostname = "container")
func main() { if len(os.Args) < 2 { panic("usage: run <cmd> or child <cmd>") }
wd, _ := os.Getwd() rootfs = filepath.Join(wd, "mount")
//Print current pid & command cmd := os.Args[1] pid := os.Getpid() fmt.Printf("PID: %d, CMD: %s\n", pid, cmd)
//Handles setup of parent & child processes based on first program argument switch os.Args[1] { case "run": //Parent process run() case "child": //Child process child() default: panic("unknown command") }}
// This function sets up the namespaces, cgroups and re-executes the same process, passing a "child" argument,// thus creating a child process in the new namespacefunc run() { //Find the current executable bin, _ := os.Executable()
//Re-run it with different args cmd := exec.Command(bin, append([]string{"child"}, os.Args[2:]...)...)
//Connect std streams from child to parent process cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr
//Flags passed to clone syscall to create new namespaces for the process cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWNS | syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNET | syscall.CLONE_NEWIPC | syscall.CLONE_NEWCGROUP, }
// Setup cgroups before starting child process setupCgroups() defer cleanupCgroups()
check(cmd.Start())
// Add child process to cgroup addToCgroup(cmd.Process.Pid)
// Wait for child to complete _ = cmd.Wait()}
// child executes inside the containerized namespacesfunc child() { fmt.Println("Inside container, PID:", os.Getpid()) fmt.Printf("uid: %d gid: %d\n", os.Getuid(), os.Getgid())
// Setup mounts and dev character devices before chroot setupMounts() createDevices()
// Now chroot into the rootfs check(syscall.Chroot(rootfs)) check(os.Chdir("/"))
//Setup host name for this process check(syscall.Sethostname([]byte(defaultHostname)))
//Create the command from passed arguments cmd := exec.Command(os.Args[2], os.Args[3:]...)
//Connect std streams from the command to child / container process cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr
_ = cmd.Run()}
// mounts special filesystems like /proc, /tmpfunc setupMounts() { // Make mount propagation private to prevent mounts from leaking to host check(syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, ""))
procPath := filepath.Join(rootfs, "proc") tmpPath := filepath.Join(rootfs, "tmp")
check(os.MkdirAll(procPath, 0755)) check(os.MkdirAll(tmpPath, 0755))
check(syscall.Mount("proc", procPath, "proc", 0, "")) check(syscall.Mount("tmpfs", tmpPath, "tmpfs", 0, ""))}
// createDevices creates essential character device nodes in /dev using mknod syscall.// Reference: https://man7.org/linux/man-pages/man2/mknod.2.htmlfunc createDevices() { devPath := filepath.Join(rootfs, "dev") check(os.MkdirAll(devPath, 0755))
syscall.Mknod(filepath.Join(devPath, "null"), syscall.S_IFCHR|0666, int(1<<8|3)) syscall.Mknod(filepath.Join(devPath, "zero"), syscall.S_IFCHR|0666, int(1<<8|5)) syscall.Mknod(filepath.Join(devPath, "random"), syscall.S_IFCHR|0666, int(1<<8|8)) syscall.Mknod(filepath.Join(devPath, "urandom"), syscall.S_IFCHR|0666, int(1<<8|9))}
// setupCgroups creates a simple v2 cgroup with limitsfunc setupCgroups() { fmt.Printf("Setting up cgroup at: %s\n", cgroupPath)
// Create the cgroup directory check(os.MkdirAll(cgroupPath, 0755))
// Enable controllers by writing to cgroup.subtree_control check(os.WriteFile(filepath.Join("/sys/fs/cgroup", "cgroup.subtree_control"), []byte("+cpu +memory +pids"), 0644))
// Set resource limits (these may fail without proper permissions) check(os.WriteFile(filepath.Join(cgroupPath, "memory.max"), []byte("50M"), 0644)) check(os.WriteFile(filepath.Join(cgroupPath, "pids.max"), []byte("20"), 0644)) check(os.WriteFile(filepath.Join(cgroupPath, "cpu.weight"), []byte("100"), 0644))}
// addToCgroup adds a process to the container cgroupfunc addToCgroup(pid int) { check(os.WriteFile(filepath.Join(cgroupPath, "cgroup.procs"), []byte(strconv.Itoa(pid)), 0644))}
// cleanupCgroups removes the cgroup directoryfunc cleanupCgroups() { check(os.RemoveAll(cgroupPath))}
func check(err error) { if err != nil { panic(err) }}Steps to Compile & Run
- Save this as
main.go. Note that the working directory should also contain themountdirectory created earlier with the root filesystem. - Compile it using
go build main.go. - Run the binary with sudo, passing the command to run in the container, for example the shell:
sudo ./main run /bin/sh
And voilà! You should now be inside a shell running in the container. Try exploring using commands like ps, ip, ls to verify the isolation from the host system. Remove a few namespaces to see how isolation is affected. On the host, check the cgroup & namespaces using the proc filesystem (/proc/<pid>/cgroup and /proc/<pid>/ns). You can also try creating a fork bomb inside the container to verify it’s limited to the number of processes we specified in cgroups.
Seeing It in Action with Docker
To verify that Docker relies on the same underlying Linux primitives, run a container with memory and CPU limits:
docker run -it --rm --memory="128m" --cpus="0.5" busybox shLooking at the output of ps -e -o pid,user,cmd --forest, we can find the container process (at the bottom of this tree)
705 root /usr/bin/python3 -u /usr/sbin/waagent -daemon 927 root \_ /usr/bin/python3 -u bin/WALinuxAgent-2.15.0.1-py3.12.egg -run-exthandlers 712 root /usr/sbin/cron -f -P 744 root /sbin/agetty -o -p -- \u --keep-baud 115200,57600,38400,9600 - vt220 750 root /usr/bin/containerd 755 root /sbin/agetty -o -p -- \u --noclear - linux 788 root /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 806 _chrony /usr/sbin/chronyd -F 1 816 _chrony \_ /usr/sbin/chronyd -F 1 808 root /usr/sbin/ModemManager 838 syslog /usr/sbin/rsyslogd -n -iNONE 897 root /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock 97639 silverb+ /usr/lib/systemd/systemd --user 97641 silverb+ \_ (sd-pam) 98362 silverb+ \_ /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only 159253 root /usr/bin/containerd-shim-runc-v2 -namespace moby -id 265e0a69b69a65aa2e9c036433810fb1693efbb45908abe957307ea1648ecc6e -address /run/containerd/containerd.sock 159276 root \_ shThe pid 159276 is the actual Docker container process. Each Docker container also has a shim process (pid 159253) which we will discuss later.
Now using the proc filesystem, we can confirm that this process has a different cgroup & namespace from other processes.
Namespaces
The namespaces of a process are visible in the proc fs as symlinks under /proc/<pid>/ns. Doing sudo ls -al /proc/<pid>/ns gives us something like the following:
total 0lrwxrwxrwx 1 root root 0 Oct 21 08:00 cgroup -> 'cgroup:[4026532163]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 ipc -> 'ipc:[4026532161]'lrwxrwxrwx 1 root root 0 Oct 21 07:50 mnt -> 'mnt:[4026532158]'lrwxrwxrwx 1 root root 0 Oct 21 07:50 net -> 'net:[4026532164]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 pid -> 'pid:[4026532162]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 pid_for_children -> 'pid:[4026532162]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 time -> 'time:[4026531834]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 time_for_children -> 'time:[4026531834]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 user -> 'user:[4026531837]'lrwxrwxrwx 1 root root 0 Oct 21 08:00 uts -> 'uts:[4026532159]'You can do the same for a normal process to confirm that the container process has different namespaces from the default ones used by other processes. We can enter a particular namespace of the process by using the nsenter utility. To enter all its namespaces at once, do the following:
sudo nsenter -t 137649 -aExploring around using ls, ip etc., you’ll see that you are really inside the Docker container!
Cgroups
Similar to namespaces, cat /proc/<pid>/cgroup gives the name of the cgroup. It typically looks like this:
0::/system.slice/docker-<container-id>.scopeThe corresponding cgroups folder can be found under:
/sys/fs/cgroup/system.slice/docker-<container-id>.scopeOpening files such as memory.max and cpu.max in that directory confirms that this is where the process’s resource limits are enforced using cgroups.
What’s Next?
So far, we have implemented a basic container ourselves and seen Docker containers in action using the same Linux primitives. Even Kubernetes pods are built on these very same concepts. By default, containers in a pod share the ipc, network, and uts namespaces. It is also possible to share the PID namespace between containers in a pod. Meanwhile, cgroups are applied at the individual container level.
This should be enough to digest the fact that containers are, at their core, just processes. But calling them “fancy” feels like an understatement. They are much more sophisticated than a typical process. The next part dives into even fancier aspects like networking, filesystems & security.