Anatomy of Containers, Part II: The Fancy Stuff

In part one, we built a bare-bones container. To make it truly useful, we must handle a few essential components like networking, storage, and security. In this post, I’ll explore how some of these are implemented in practice, attempt to recreate them ourselves, and finally compare our approach with Docker, like we did in part one.

Before we get started, take a look at the output of docker inspect <container_id> to see all the configurations Docker does for a container. Pay attention to HostConfig, Mounts, NetworkSettings, GraphDriver, etc.

Networking

Networking works in containers via a virtual network interface called veth (Virtual Ethernet). It’s quite similar to an ethernet cable joining two devices. Packets going into one end of veth appear on the other. The special thing about it though is that the ends of the veth pair can be moved to different network namespaces, essentially allowing us to connect the container’s namespace to the host namespace as if they were joined by a physical cable.

Setting Up a Network Interface

Using the veth network interface, we can now set up a small subnet containing our host & container, allowing them to communicate with each other. Then, we can configure NAT using iptables, allowing the container to access the internet through the host’s network interface.

Here’s the setup process:

First, find the PID of the container process (from part one) and set it as a variable CONTAINER_PID.
Define some other variables for interface names, host & container IP, subnet, etc.

HOST_IF=veth-host
CONT_IF=veth-container
SUBNET=10.200.0.0/24
HOST_IP=10.200.0.1
CONT_IP=10.200.0.2

Create a veth pair:

sudo ip link add $HOST_IF type veth peer name $CONT_IF

Move one end of the veth pair to the container’s network namespace:

sudo ip link set $CONT_IF netns $CONTAINER_PID

On the host, assign an IP to the host end of veth and bring up the interface:

sudo ip addr add $HOST_IP/24 dev $HOST_IF
sudo ip link set $HOST_IF up

Do the same inside the container using nsenter to enter its network namespace. Also, bring up the loopback interface:

sudo nsenter -t $CONTAINER_PID -n ip addr add $CONT_IP/24 dev $CONT_IF
sudo nsenter -t $CONTAINER_PID -n ip link set $CONT_IF up
sudo nsenter -t $CONTAINER_PID -n ip link set lo up

Add a default route in the container via the host IP. A default route is used to send packets to destinations outside the local subnet, for example to the internet:

sudo nsenter -t $CONTAINER_PID -n ip route add default via $HOST_IP

Enable IP forwarding on the host. This allows the host to forward packets between interfaces. In our case, between the container’s veth interface and the host’s main network interface:

sudo sysctl -w net.ipv4.ip_forward=1

In case iptables is being used for firewall rules, we need to explicitly set iptables FORWARD rules to allow forwarding packets between the container and the host’s network interface. So, we first find the host’s default/main network interface:

HOST_NET_IF=$(ip route | grep default | awk '{print $5}')

Then, add the FORWARD rules to allow traffic to flow between the container and the host’s network interface:

 sudo iptables -A FORWARD -i $HOST_IF -o $HOST_NET_IF -j ACCEPT
 sudo iptables -A FORWARD -i $HOST_NET_IF -o $HOST_IF -m state --state RELATED,ESTABLISHED -j ACCEPT

Finally, we need to set up NAT, so that packets from the container can be routed to the internet via the host’s IP:

sudo iptables -t nat -A POSTROUTING -s $SUBNET -o $HOST_NET_IF -j MASQUERADE

We now have a fully functional network interface for our container! You can verify this by pinging the container IP from the host and vice versa, and also by trying to access the internet from within the container using curl or wget, etc.

The Bridge Interface

Container networking also uses another virtual interface called bridge. A bridge is like a virtual switch that allows us to connect multiple network interfaces together. This enables communication between multiple containers. Docker, for example, creates a default bridge network called docker0 on the host machine. When a container is started, it creates a veth pair, attaches one end to the container’s network namespace, and the other end to the docker0 (by default) bridge on the host. The bridge could be further connected to the host’s main network interface like we did above.

Mapping a Port

Once we have a network interface set up for the container, we can simply add a DNAT rule in the host’s iptables to forward traffic from a specific port on the host to the container’s IP and port. Extending our previous example, to map port 8080 of the container to port 80 on the host, add the following iptables rule on the host:

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination $CONT_IP:8080

Now, any incoming traffic to port 80 on the host will be forwarded to port 8080 on the container.

DNS Resolution

DNS resolution is another piece of the puzzle, though not required for our simple container. Docker, Kubernetes, and other platforms usually have their own DNS servers to handle name resolution for containers. In Docker’s case, it’s an embedded DNS server that runs on the host. Kubernetes uses a dedicated DNS service (like CoreDNS) running within the cluster. This enables features like service discovery when orchestrating multiple containers.

Docker’s Networking in Action

To see Docker’s implementation of these concepts in action, start a simple HTTP server container and access it from another container using Docker’s default bridge network:

Start a simple HTTP server container in detached mode:

docker run -d --name server -p 80:8080 python:3-slim python -m http.server 8080

Start a busybox container to access the server:

docker run -it --rm busybox sh

Get the server container’s IP using:

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' server

Finally, inside the busybox container, use wget to access the server:

wget -qO- http://<ip>:8080

At the same time, run ip link show on the host to see the docker0 bridge & two veth interfaces created for the server & busybox container, as shown below:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:22:48:6e:89:8f brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether b6:cf:f2:e4:2f:63 brd ff:ff:ff:ff:ff:ff
53: vethd615595@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether 36:a6:48:12:2c:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
64: vethf0b5df8@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether 1a:c9:a7:ef:a0:95 brd ff:ff:ff:ff:ff:ff link-netnsid 1

Run brctl show docker0 on the host to verify the veth interfaces are attached to the bridge:

bridge name  bridge id    STP enabled  interfaces
docker0    8000.b6cff2e42f63  no    vethd615595
                                   vethf0b5df8

Also run ip addr show on both containers & the host to confirm they’re part of the same subnet.
Running sudo iptables -t nat -L on the host will show all the NAT rules Docker has set up. The port we mapped earlier shows up as a DNAT rule:

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere
DNAT       tcp  --  anywhere             anywhere             tcp dpt:http to:172.17.0.2:8080

Finally, inside the server container, check out /etc/resolv.conf to see that Docker is using its own embedded DNS server for name resolution.

Filesystems & Storage

Earlier, we saw how to isolate mounts inside a container using the mount namespace. The mount namespace, however, has quite a few quirks about how mounts are shared & propagated between different namespaces. See this lwn article, man page for more details on mount namespaces. For this discussion, you can list all the mounts inside a container using the mount command. Observe that we created a few of these mounts ourselves in our implementation as well. A few special mounts you will see are:

overlay on / type overlay (rw...: The overlay root filesystem, which we will discuss shortly.
proc on /proc type proc (rw..: The proc filesystem mounted on /proc to provide process & kernel information.
/dev/root on /etc/resolv.conf, and others on /etc/hostname, /etc/hosts: These are bind mounts from the host to provide DNS resolution, hostname, and hosts file inside the container.
sysfs on /sys type sysfs (ro..: The sysfs filesystem mounted on /sys to expose kernel & device information. The ro flag indicates it’s mounted read-only to prevent modifications from within the container.

Bind Mounts And Persistent Storage

Persistent storage used by containers, be it in Docker or Kubernetes, relies on bind mounts. A bind mount is essentially a re-mapping of a directory or file from one location to another, achieved with the mount --bind syscall. In containers, this provides persistent storage by bind-mounting a directory on the host system into a location inside the container’s filesystem. Docker volumes work the same way internally. They’re managed bind mounts created by Docker, stored under /var/lib/docker/volumes/ on the host machine.

The Overlay Filesystem

In our container implementation from the previous blog, we simply extracted the root filesystem from an existing base image and chroot into it. If we keep doing the same for each and every container, we’ll end up with multiple redundant copies of the same base image, consuming space and increasing container startup time. To avoid this, a container’s root filesystem consists of multiple read-only layers with a final writable layer stacked on top of each other. These are the same layers you see when building or fetching a Docker image. Each layer records a set of diffs / changes from the previous layer. These layers are merged into a single view using a union mount filesystem like OverlayFS.

This enables sharing common read-only layers between multiple containers, saving disk space and improving startup time. When something is written to the container’s filesystem, the changes are recorded in the top writable layer, leaving the underlying read-only layers unchanged. If you try to modify parts of the lower read-only layers, changes are “copied up” to the top writable layer, this is called the “copy-on-write” strategy.

The overlay filesystem can be set up using the mount syscall as well:

mount -t overlay overlay -o lowerdir=/lower1:/lower2:/lower3,upperdir=/upper,workdir=/work /merged

Here, /lower1, /lower2, /lower3 are the read-only layers, topmost on the left to bottom on the right, /upper is the writable top layer, /work is a working directory for OverlayFS, and /merged is the final merged view. You can try modifying the container code from part one to mount an overlay filesystem for the container’s rootfs using the same mount syscall.

Docker’s OverlayFS in Action

The entries listed by the mount command earlier had a line like this:

overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/....

This indicates that the container’s root filesystem itself is an overlay filesystem as expected. You can also check out the lower and upper dirs mentioned in the output. These layers are under /var/lib/docker/overlay2/ on the host. If you run two containers from the same image, they share the lower dirs but have different upper dirs. Try modifying files inside these containers, the changes will appear only in their respective upper layers, leaving the shared lower layers untouched.

What’s Next?

The next post explores one final big piece of the puzzle: security. It covers how root privileges work in Linux, concepts like Linux capabilities, syscall filtering, user namespaces, and rootless containers, and how containers leverage these mechanisms to enhance security and isolation.