To isolate the processes running inside a container from its host system, container engine uses the following four features:
- Namespaces
- Control Groups
- Secure Computing
- Security-Enhanced Linux
Namespaces
Namespaces are created to limit the reach of a container to its host’s resources. It helps with security and well as limits resources available to the container.
Linux command lsns
could be used for listing details of namespaces.
The namespaces essential for containers are User, Mount, Unix Timesharing System, Process ID, Network, and Inter-Process Communication.
User
The users and groups created inside a container are different from its host. Processes running inside the container as a root
user could be mapped to a non-root user on the host.
Using the id
command you can verify that the containers are present on a different user namespace than other processes on your host.
Running id
on host:
|
|
Running id
inside a container created from docker.io/library/httpd:latest
container image:
|
|
Process ID (PID
)
Each process running on the host has a unique Process ID (PID) assigned to it. The PIDs of processes running inside container are separate from PIDs assigned by the host. Due to process ID isolation, a container can’t access the details of processes running on its host.
To fetch the list of PID namespaces you can use the command:
|
|
Just like other processes, containers also have PIDs assigned to them by the host. You can fetch the PID of a running container using the following command:
|
|
ps aux
command could be used to list the running processes on the system along with their details.
|
|
Mount (mnt
)
By using different mount namespaces for different processes we can ensure that they won’t be able to access each other’s files.
You can use df
command to view the filesystems mounted on your system.
|
|
A container has its separate file system hierarchy which could be viewed by using df
command on its shell.
|
|
It could also be viewed by the host in the file /proc/<CONTAINER_PID>/mounts
.
|
|
Unix Timesharing System (UTS)
Unix Time System (UTS) namespace allows containers to have hostnames. We can verify this with the hostname
command.
On the host:
|
|
Inside a container:
|
|
Network
Each container has a IP address and network ports assigned to it by its network namespace. It allows the developer to run multiple processes inside the container and expose them over different network ports.
To access or communicate with a process inside the container, port forwarding should be established from the host.
Inter-Process Communication (IPC)
Processes in the same IPC namespace can share the resources such as memory, semaphores, and message queues. Keeping separate IPC namespaces ensures that the processes inside a container cannot access the resources used by the host’s processes.
Time
Time namespaces are available since the release of Linux Kernel 5.6.
Maybe in the future containers can have a different time than their host.
Control Groups (Cgroups)
A control group is created to effectively allocate resources of host its processes. These Cgroups are hierarchical i.e. a child Cgroup could be spawned from the parent and it will inherit its certain attributes.
By creating a Cgroup a process in it could be prioritized, paused, removed, or resumed based on the resources allocated to it. It also helps in monitoring the resources used by particular processes.
If you are using an OS with systemd
init system (to verify this you can use the command ps -p 1 -o comm=
) then you can use the command systemctl list-units
to list all the Cgroups. It will open a table containing the Cgroup name, state, and description. The names of the Cgroup will be in the form <parent-cgroup>.<child-cgroup>
like sys-devices-platform-serial8250-tty-ttyS0.device
.
To view the hierarchy of Cgroups you can use the command systemd-cgls
. It presents cgroups as a tree structure.
|
|
Secure Computing (Seccomp)
Using Secure Computing (or seccomp) you can disable the system calls your process can make to the host’s kernel.
A seccomp profile is a definition with a set of restricted and allowed system calls stored in a file. Default seccomp profile used by Docker: default.json.
Docker allows you to define your seccomp profile for a container in JSON format.
|
|
Security-Enhanced Linux (SELinux)
SELinux is a security architecture for GNU/Linux-based OS that defines access to files and processes. It is enforced on users or processes to restrict their access to the resources.
SELinux checks the SELinux context of the file or process to make decisions related to its access control. To view the SELinux context of a file use command ls -Z <FILENAME>
and to view it for a process using the command ps -eZ | grep <PROCESS_NAME>
.
Thank you for taking the time to read this blog post! If you found this content valuable and would like to stay updated with my latest posts consider subscribing to my RSS Feed.
Resources
Cgroups, namespaces, and beyond: what are containers made from?
4 Linux technologies fundamental to containers
systemd(1) — Linux manual page
The 7 most used Linux namespaces
Inter-process communication in Linux: Shared storage
DKER-EE-001250 - The Docker Enterprise hosts IPC namespace must not be shared.
It’s Finally Time: The Time Namespace Support Has Been Added To The Linux 5.6 Kernel
Obtaining Information about Control Groups
seccomp(2) — Linux manual page
What is SELinux?
Podman volumes and SELinux