
Anatomy of Containers, Part II: The Fancy Stuff
/ 10 min read
In part one, we built a bare-bones container. To make it truly useful, we must handle a few essential components like networking, storage, and security. In this post, I’ll explore how some of these are implemented in practice, attempt to recreate them ourselves, and finally compare our approach with Docker, like we did in part one.
Before we get started, take a look at the output of docker inspect <container_id> to see all the configurations Docker does for a container. Pay attention to HostConfig, Mounts, NetworkSettings, GraphDriver, etc.
Networking
Networking works in containers via a virtual network interface called veth (Virtual Ethernet). It’s quite similar to an ethernet cable joining two devices. Packets going into one end of veth appear on the other. The special thing about it though is that the ends of the veth pair can be moved to different network namespaces, essentially allowing us to connect the container’s namespace to the host namespace as if they were joined by a physical cable.
Setting Up a Network Interface
Using the veth network interface, we can now set up a small subnet containing our host & container, allowing them to communicate with each other. Then, we can configure NAT using iptables, allowing the container to access the internet through the host’s network interface.
Here’s the setup process:
- First, find the PID of the container process (from part one) and set it as a variable
CONTAINER_PID. - Define some other variables for interface names, host & container IP, subnet, etc.
HOST_IF=veth-hostCONT_IF=veth-containerSUBNET=10.200.0.0/24HOST_IP=10.200.0.1CONT_IP=10.200.0.2- Create a veth pair:
sudo ip link add $HOST_IF type veth peer name $CONT_IF- Move one end of the veth pair to the container’s network namespace:
sudo ip link set $CONT_IF netns $CONTAINER_PID- On the host, assign an IP to the host end of veth and bring up the interface:
sudo ip addr add $HOST_IP/24 dev $HOST_IFsudo ip link set $HOST_IF up- Do the same inside the container using
nsenterto enter its network namespace. Also, bring up the loopback interface:
sudo nsenter -t $CONTAINER_PID -n ip addr add $CONT_IP/24 dev $CONT_IFsudo nsenter -t $CONTAINER_PID -n ip link set $CONT_IF upsudo nsenter -t $CONTAINER_PID -n ip link set lo up- Add a default route in the container via the host IP. A default route is used to send packets to destinations outside the local subnet, for example to the internet:
sudo nsenter -t $CONTAINER_PID -n ip route add default via $HOST_IP- Enable IP forwarding on the host. This allows the host to forward packets between interfaces. In our case, between the container’s veth interface and the host’s main network interface:
sudo sysctl -w net.ipv4.ip_forward=1- In case iptables is being used for firewall rules, we need to explicitly set iptables FORWARD rules to allow forwarding packets between the container and the host’s network interface. So, we first find the host’s default/main network interface:
HOST_NET_IF=$(ip route | grep default | awk '{print $5}')- Then, add the FORWARD rules to allow traffic to flow between the container and the host’s network interface:
sudo iptables -A FORWARD -i $HOST_IF -o $HOST_NET_IF -j ACCEPT sudo iptables -A FORWARD -i $HOST_NET_IF -o $HOST_IF -m state --state RELATED,ESTABLISHED -j ACCEPT- Finally, we need to set up NAT, so that packets from the container can be routed to the internet via the host’s IP:
sudo iptables -t nat -A POSTROUTING -s $SUBNET -o $HOST_NET_IF -j MASQUERADEWe now have a fully functional network interface for our container! You can verify this by pinging the container IP from the host and vice versa, and also by trying to access the internet from within the container using curl or wget, etc.
The Bridge Interface
Container networking also uses another virtual interface called bridge. A bridge is like a virtual switch that allows us to connect multiple network interfaces together. This enables communication between multiple containers. Docker, for example, creates a default bridge network called docker0 on the host machine. When a container is started, it creates a veth pair, attaches one end to the container’s network namespace, and the other end to the docker0 (by default) bridge on the host. The bridge could be further connected to the host’s main network interface like we did above.
Mapping a Port
Once we have a network interface set up for the container, we can simply add a DNAT rule in the host’s iptables to forward traffic from a specific port on the host to the container’s IP and port. Extending our previous example, to map port 8080 of the container to port 80 on the host, add the following iptables rule on the host:
sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination $CONT_IP:8080Now, any incoming traffic to port 80 on the host will be forwarded to port 8080 on the container.
DNS Resolution
DNS resolution is another piece of the puzzle, though not required for our simple container. Docker, Kubernetes, and other platforms usually have their own DNS servers to handle name resolution for containers. In Docker’s case, it’s an embedded DNS server that runs on the host. Kubernetes uses a dedicated DNS service (like CoreDNS) running within the cluster. This enables features like service discovery when orchestrating multiple containers.
Docker’s Networking in Action
To see Docker’s implementation of these concepts in action, start a simple HTTP server container and access it from another container using Docker’s default bridge network:
- Start a simple HTTP server container in detached mode:
docker run -d --name server -p 80:8080 python:3-slim python -m http.server 8080- Start a busybox container to access the server:
docker run -it --rm busybox sh- Get the server container’s IP using:
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' server- Finally, inside the busybox container, use
wgetto access the server:
wget -qO- http://<ip>:8080- At the same time, run
ip link showon the host to see thedocker0bridge & two veth interfaces created for the server & busybox container, as shown below:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:22:48:6e:89:8f brd ff:ff:ff:ff:ff:ff3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether b6:cf:f2:e4:2f:63 brd ff:ff:ff:ff:ff:ff53: vethd615595@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether 36:a6:48:12:2c:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 064: vethf0b5df8@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether 1a:c9:a7:ef:a0:95 brd ff:ff:ff:ff:ff:ff link-netnsid 1- Run
brctl show docker0on the host to verify the veth interfaces are attached to the bridge:
bridge name bridge id STP enabled interfacesdocker0 8000.b6cff2e42f63 no vethd615595 vethf0b5df8- Also run
ip addr showon both containers & the host to confirm they’re part of the same subnet. - Running
sudo iptables -t nat -Lon the host will show all the NAT rules Docker has set up. The port we mapped earlier shows up as a DNAT rule:
Chain DOCKER (2 references)target prot opt source destinationRETURN all -- anywhere anywhereDNAT tcp -- anywhere anywhere tcp dpt:http to:172.17.0.2:8080- Finally, inside the server container, check out
/etc/resolv.confto see that Docker is using its own embedded DNS server for name resolution.
Filesystems & Storage
Earlier, we saw how to isolate mounts inside a container using the mount namespace. The mount namespace, however, has quite a few quirks about how mounts are shared & propagated between different namespaces. See this lwn article, man page for more details on mount namespaces. For this discussion, you can list all the mounts inside a container using the mount command. Observe that we created a few of these mounts ourselves in our implementation as well. A few special mounts you will see are:
overlay on / type overlay (rw...: The overlay root filesystem, which we will discuss shortly.proc on /proc type proc (rw..: The proc filesystem mounted on/procto provide process & kernel information./dev/root on /etc/resolv.conf, and others on/etc/hostname,/etc/hosts: These are bind mounts from the host to provide DNS resolution, hostname, and hosts file inside the container.sysfs on /sys type sysfs (ro..: The sysfs filesystem mounted on/systo expose kernel & device information. Theroflag indicates it’s mounted read-only to prevent modifications from within the container.
Bind Mounts And Persistent Storage
Persistent storage used by containers, be it in Docker or Kubernetes, relies on bind mounts. A bind mount is essentially a re-mapping of a directory or file from one location to another, achieved with the mount --bind syscall. In containers, this provides persistent storage by bind-mounting a directory on the host system into a location inside the container’s filesystem. Docker volumes work the same way internally. They’re managed bind mounts created by Docker, stored under /var/lib/docker/volumes/ on the host machine.
The Overlay Filesystem
In our container implementation from the previous blog, we simply extracted the root filesystem from an existing base image and chroot into it. If we keep doing the same for each and every container, we’ll end up with multiple redundant copies of the same base image, consuming space and increasing container startup time. To avoid this, a container’s root filesystem consists of multiple read-only layers with a final writable layer stacked on top of each other. These are the same layers you see when building or fetching a Docker image. Each layer records a set of diffs / changes from the previous layer. These layers are merged into a single view using a union mount filesystem like OverlayFS.
This enables sharing common read-only layers between multiple containers, saving disk space and improving startup time. When something is written to the container’s filesystem, the changes are recorded in the top writable layer, leaving the underlying read-only layers unchanged. If you try to modify parts of the lower read-only layers, changes are “copied up” to the top writable layer, this is called the “copy-on-write” strategy.
The overlay filesystem can be set up using the mount syscall as well:
mount -t overlay overlay -o lowerdir=/lower1:/lower2:/lower3,upperdir=/upper,workdir=/work /mergedHere, /lower1, /lower2, /lower3 are the read-only layers, topmost on the left to bottom on the right, /upper is the writable top layer, /work is a working directory for OverlayFS, and /merged is the final merged view. You can try modifying the container code from part one to mount an overlay filesystem for the container’s rootfs using the same mount syscall.
Docker’s OverlayFS in Action
The entries listed by the mount command earlier had a line like this:
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/....This indicates that the container’s root filesystem itself is an overlay filesystem as expected. You can also check out the lower and upper dirs mentioned in the output. These layers are under /var/lib/docker/overlay2/ on the host. If you run two containers from the same image, they share the lower dirs but have different upper dirs. Try modifying files inside these containers, the changes will appear only in their respective upper layers, leaving the shared lower layers untouched.
What’s Next?
The next post explores one final big piece of the puzzle: security. It covers how root privileges work in Linux, concepts like Linux capabilities, syscall filtering, user namespaces, and rootless containers, and how containers leverage these mechanisms to enhance security and isolation.
References
- Introduction to Linux interfaces for virtual networking
- Container Networking From Scratch - Kristen Jacobs, Oracle
- Understanding Kubernetes Networking: Pods
- Deep Dive into Docker Internals - Union Filesystem
- Overlay Filesystem - The Linux Kernel documentation
- Overlay filesystem - ArchWiki
- Mount namespaces and shared subtrees