Kubernetes namespaces isolation - what it is, what it isn't, life, universe and everything

Written by Gaetan Ferry - 26/03/2021 - in Pentest - Download
When speaking about Cloud, containers, orchestration and that kind of things, Kubernetes is the name that comes to mind. We meet it in a lot of situations ranging from microservices implementation to user oriented self service hosting. But developers don't always understand the limits of the system and the mechanisms it implements. In particular, we commonly encounter misunderstanding about namespaces isolation. Time to bring some light in this darkness.

What the heck is a Kubernetes namespace?

If you spent the last six years lost in the interstellar void between Jaglan and the Axel nebula, maybe you have never heard of Kubernetes. Otherwise, chances are you have at least basic knowledge of this container orchestration system. It offers a lot of feature to deploy services on pods, hosted on nodes which, then, are managed by the control plane and the kube API server. If any of those names is unknown to you, I encourage you to cast an eye on the project's documentation before continuing this read.

One of the most important features of the Kubernetes system are the namespaces. Namespaces, namespaces, everywhere namespaces. When hearing "namespace" Linux people will think about Kernel namespaces, a feature used to isolate resources from each other, and used to implement containers (so that Kubernetes containers in pods are implemented thanks to Linux kernel namespaces and also are located in a Kubernetes namespace while not the same kind of namespace, funny). On the contrary, XML experts will probably think about a mechanism to avoid name confusion in documents. Two domains, two visions with important divergence that can greatly impact security vision.

On Kubernetes side, namespaces are defined as follow (https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/):

Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces.

Which does not really help deciding between isolation and name disambiguation. The documentation also states:

Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces.

One point for the name disambiguation. However, if you already worked with Kubernetes, you probably know limitations are also set up between namespace. To figure out what really is going on, let's make some tests.

Testing environment

To investigate namespace isolation, we are going to use a very simple Kubernetes cluster with two user defined namespaces: one and two. In each of them we create a simple pod based on the busybox system image. The following configuration is used for their deployment:

apiVersion: v1
kind: Pod
metadata:
  name: busybox1
  namespace: one
spec:
  containers:
    - name: busybox
      image: busybox
      resources:
        limits:
          memory: "128Mi"
          cpu: "500m"
      stdin: true

The cluster will hold three nodes, including the control plane. All configurations are left on "sane" defaults for the Kubernetes version used: 1.20.2.

# kadmin get nodes
NAME        STATUS   ROLES                  AGE   VERSION
debian      Ready    control-plane,master   29d   v1.20.2
k8s-node1   Ready                     28d   v1.20.2
k8s-node2   Ready                     26d   v1.20.2

In the following of this article we will use the alias kadmin to indicate the kubectl command being executed as admin on the cluster.

Disclaimer:

All the statements of this post are bound to the Kubernetes version used. The project evolves quickly, in particular regarding default configuration for authorization.

API isolation

The most trivial thing you could expect namespaces to protect are API resources. Elements of a namespace should not be able to list, read or modify elements of an other namespace. Of course, this is something that is indeed implemented. Or maybe it is a bit more complicated than that.

In fact, Kubernetes does not implement any kind of privilege separation itself. Instead, it delegates those controls to a dedicated authorization plugin. By default RBAC (for Role Based Access Control) is used. This can be verified by looking for the --authorization-mode flag in the API server's command line options.

# kadmin --namespace kube-system get pods kube-apiserver-master -o yaml
[...]
spec:
  containers:
  - command:
    - kube-apiserver
    - --advertise-address=10.55.56.137
    - --allow-privileged=true
    - --authorization-mode=Node,RBAC
[...]

It is therefore up to RBAC to decide whether or not cross namespace API resources access is allowed. And guess who configures RBAC policies? You, as a cluster admin.

By default, RBAC only creates one role for each existing namespace, the default one. It is automatically attributed to all the pods and other components created and, by default, does not allow any action, whatever the namespace is. You can not even get your own pod configuration or the secret associated to you.

root@busybox1:~ # ./kubectl get pods busybox1
Error from server (Forbidden): pods "busybox1" is forbidden: User "system:serviceaccount:one:default" cannot get resource "pods" in API group "" in the namespace "one"

Asking politely, the API server will even explain you that you cannot do anything except listing your nonexistent permissions and accessing two or three uninteresting endpoints.

# ./kubectl --namespace=one auth can-i --list
Resources                                       Non-Resource URLs                     Resource Names   Verbs
selfsubjectaccessreviews.authorization.k8s.io   []                                    []               [create]
selfsubjectrulesreviews.authorization.k8s.io    []                                    []               [create]
                                                [/.well-known/openid-configuration]   []               [get]
                                                [/api/*]                              []               [get]
                                                [/api]                                []               [get]
                                                [/apis/*]                             []               [get]
                                                [/apis]                               []               [get]
                                                [/healthz]                            []               [get]
                                                [/healthz]                            []               [get]
                                                [/livez]                              []               [get]
                                                [/livez]                              []               [get]
                                                [/openapi/*]                          []               [get]
                                                [/openapi]                            []               [get]
                                                [/openid/v1/jwks]                     []               [get]
                                                [/readyz]                             []               [get]
                                                [/readyz]                             []               [get]
                                                [/version/]                           []               [get]
                                                [/version/]                           []               [get]
                                                [/version]                            []               [get]
                                                [/version]                            []               [get]

So we could say isolation is pretty well enforced by default. But this default account behavior might quickly become limiting and you, as a developer, will probably want to extend on this.

Roles are defined inside namespaces. For example, the following role can be created inside the one namespace.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: getSecrets
  namespace: one
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]

No options exist for rules to specify a namespace on which they apply. All rules only apply to the namespace the role belongs to. If we apply this role to the default service account, using an appropriate RoleBinding, the busybox1 pod can now query its namespace's secrets.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: busybox1-getSecrets
  namespace: one
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: getSecrets
subjects:
  - apiGroup: ""
    kind: ServiceAccount
    name: default
root@busybox1:~ # ./kubectl auth can-i --list
Resources                                       Non-Resource URLs                     Resource Names   Verbs
selfsubjectaccessreviews.authorization.k8s.io   []                                    []               [create]
selfsubjectrulesreviews.authorization.k8s.io    []                                    []               [create]
                                                [/.well-known/openid-configuration]   []               [get]
                                                [/api/*]                              []               [get]
                                                [/api]                                []               [get]
[...]
secrets                                         []                                    []               [get]

However, it is still not possible to access the same resource on the two namespace.

root@busybox1:~ # ./kubectl --namespace=two get secrets default-token-szvk5
Error from server (Forbidden): secrets "default-token-szvk5" is forbidden: User "system:serviceaccount:one:default" cannot get resource "secrets" in API group "" in the namespace "two"

Even with a good improbability generator, one can hardly design a set of (account | role | role binding) that will allow getting access to another namespace. Indeed, roles, service accounts and role bindings are all bound to a namespace. Trying to bind a role from cluster two to a service account in cluster one is totally beyond possible as a role binding is namespace scoped and won't be able to see roles from outside its namespace.

But don't get your breath back too fast, Kubernetes also defines cluster roles and cluster role bindings. Those are similar to normal roles but without a namespace limitation. Of course, using them can break through the isolation we just saw, that's what they are meant to do.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: clusterGetSecrets
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: busybox1-getSecrets-cluster
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: clusterGetSecrets 
subjects:
  - apiGroup: ""
    kind: ServiceAccount
    name: default
    namespace: one

Deploying this configuration in our cluster, pod busybox can now access any secret, including those from the two namespace.

root@busybox1:~ # ./kubectl --namespace=two get secrets default-token-szvk5
NAME                  TYPE                                  DATA   AGE
default-token-szvk5   kubernetes.io/service-account-token   3      28d

So all in all, Kubernetes, and RBAC, does the job of isolating API resources between namespaces. But it is still possible to break the design with a badly configured cluster role.

Network isolation

Networking in Kubernetes really is a thing. Or, in fact, no, it is not. A bit like what RBAC does for authz, everything is left at the discretion of a third party component. This component is called a Container Network Interface (CNI) based Pod network add-on and is deployed during cluster's installation.

There are many such add-ons available, which are all supposed to implement the same networking model. Our testing environment is deployed with the Calico add-on but switching to a different one should not change most of the observation we will make.

It does not seem exceptionally unreasonable to hope that Kubernetes or its networking add-on set up a network isolation policy based on the pods namespaces. After all, if each namespace conceptually has to be seen as a different cluster, this even seems logical. Let's check the network configuration of our good old busybox1 pod.

root@busybox1:~ # ip a
1: lo:  mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if4:  mtu 1350 qdisc noqueue 
    link/ether 5e:8a:42:b9:2f:ee brd ff:ff:ff:ff:ff:ff
    inet 10.42.36.67/32 brd 10.42.36.67 scope global eth0
       valid_lft forever preferred_lft forever
root@busybox1:~ # ip r
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0 scope link 

So each pod has a private IPv4 address located in a /32 range and uses a link-local address as its default route. Calico handles the role of the default gateway and should therefore be responsible for the filtering. Not to get into the details of the internals of Kubernetes networking (this could be the topic of an upcoming post), let's see what happens in practice.

Having deployed two pods in two different namespaces, we will bind an unexpected port on one and reach it from the other.

root@busybox2: # echo "Don't panic!" | nc -vv 10.42.36.67 42
10.42.36.67 (10.42.36.67:42) open
################
root@busybox1: # nc -s 10.42.36.67 -lvp 42
listening on 10.42.36.67:42 ...
connect to 10.42.36.67:42 from 10.42.36.68:33353 (10.42.36.68:33353)
Don't panic!

It obviously appears that absolutely no isolation exists between the namespaces from a network point of view. On our setup, pods busybox1 and busybox2 are hosted on the same node

# kadmin get pods --all-namespaces -o=custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,NODE:.spec.nodeName
NAME                                       NAMESPACE         NODE
busybox1                                   one               k8s-node1
busybox2                                   two               k8s-node1

Could there be some dark routing magic going on in the nodes themselves ? To end this incertitude, we create a new pod on the second node and check again.

# kadmin get pods --all-namespaces -o=custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,NODE:.spec.nodeName
NAME                                       NAMESPACE         NODE
busybox1                                   one               k8s-node1
busybox22                                  two               k8s-node2

root@busybox22: # echo "Don't panic!" | nc -v 10.42.36.67 42
10.42.36.67 (10.42.36.67:42) open

root@busybox1: # nc -s 10.42.36.67 -lvp 42
listening on 10.42.36.67:42 ...
connect to 10.42.36.67:42 from 10.42.169.129:39423 (10.42.169.129:39423)
Don't panic!

End of the matter, no network filtering is happening between namespaces whatever may be the deployment topology.

Cluster isolation

Let's suppose at some point in spacetime someone, why not a whale or a flowerpot, get to reach high privileges on a given namespace. The response as to what constitutes a "high privilege" on a namespace is not straightforward. While definitions differ (see this article from CyberArk about risky permissions), in our case we will consider it to be any role that allows you to create pods or to compromise such a role with any hop number.

For our testing purposes, we will give the default service account in the one namespace full access to it.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: admin
  namespace: one
rules:
  - apiGroups: [""]
    resources: ["*"]
    verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: busybox1-admin
  namespace: one
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: admin
subjects:
  - apiGroup: ""
    kind: ServiceAccount
    name: default

The admin role literally allows accessing any resource in the one namespace with any verb. In particular, the default service account will be able to create pods, with whatever configuration it could choose. A good question here would be: does any resource accessible within a namespace have the ability to reach out of namespace objects? And the answer would be yes, as there exist privileged pods.

Privileged pods are just like normal pods but with a few added permissions. Depending on their configuration, they might be able to access their host namespaces (as in Linux namespace, things are getting tricky, might need a namespace for namespace kinds) of any flavor: filesystem, network, process, etc.

In concrete terms, a privileged pod can completely impersonate the node it is hosted on. For that purpose, the following configuration creates a privileged pod with full access to the file system, PID and network namespaces (Linux namespaces that is) of their node host.

apiVersion: v1
kind: Pod
metadata:
  name: priv-pod
spec:
  containers:
  - name: shell
    image: busybox
    stdin: true
    securityContext:
      privileged: true
    volumeMounts:
    - name: host-root-volume
      mountPath: /host
      readOnly: true
  volumes:
  - name: host-root-volume
    hostPath:
      path: /
  hostNetwork: true
  hostPID: true
  restartPolicy: Always

Applying this configuration from our busybox1 pod is possible thanks to our brand new admin privileges. And it is then possible to get a shell on the new pod's node with a single command.

root@busybox1:~ # ./kubectl apply -f priv_pod.yaml 
pod/priv-pod created
root@busybox1:~ # ./kubectl exec -it priv-pod -- chroot /host
root@k8s-node2:/# id
uid=0(root) gid=0(root) groups=0(root),10(uucp)
root@k8s-node2:/# hostname
k8s-node2

What does Kubernetes do to ensure the isolation between the namespaces once you reach this position? Not much to be honest. At least, not much in a default configuration. Indeed, if you remember a bit about Kubernetes concepts you might recall that nodes are non namespaced resources which host pods from all namespaces depending on the overall process load of the cluster. Pods are also not much more than docker containers.

Therefore, having access to the node host with high privileges you also get access to the docker instance it hosts and the pods in it. Whatever their namespace may be.

root@k8s-node2:/# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
f4ae85dbf2b9        busybox                "sh"                     15 minutes ago      Up 15 minutes                           k8s_shell_priv-pod_one_c5b5e0f8-06c7-444f-8e6c-a10c0d334f4c_0
e2a540087af0        k8s.gcr.io/pause:3.2   "/pause"                 15 minutes ago      Up 15 minutes                           k8s_POD_priv-pod_one_c5b5e0f8-06c7-444f-8e6c-a10c0d334f4c_0
92b3e44fef56        busybox                "sh"                     3 hours ago         Up 3 hours                              k8s_busybox_busybox22_two_9dc68143-52b1-439c-9d28-2b234d54a0a0_0
57645876f8ff        k8s.gcr.io/pause:3.2   "/pause"                 3 hours ago         Up 3 hours                              k8s_POD_busybox22_two_9dc68143-52b1-439c-9d28-2b234d54a0a0_0
2565abe81bbb        919a16510f41           "/sbin/tini -- calic…"   46 hours ago        Up 46 hours                             k8s_calico-typha_calico-typha-5d5cd6d49c-mkqhz_calico-system_22c031f1-7536-4c5e-933b-23e2c4ccc87a_2
c6128db2d4c6        183b53858d7d           "start_runit"            46 hours ago        Up 46 hours                             k8s_calico-node_calico-node-27flb_calico-system_e9f1a618-a8d0-4d97-ad04-d4d1aa60dcf8_1
80f82e021df5        43154ddb57a8           "/usr/local/bin/kube…"   46 hours ago        Up 46 hours                             k8s_kube-proxy_kube-proxy-phcgj_kube-system_c807aeef-4297-4f49-bf43-ef9522c1a917_1
51d93b60bbf6        k8s.gcr.io/pause:3.2   "/pause"                 46 hours ago        Up 46 hours                             k8s_POD_calico-node-27flb_calico-system_e9f1a618-a8d0-4d97-ad04-d4d1aa60dcf8_1
6df276576f62        k8s.gcr.io/pause:3.2   "/pause"                 46 hours ago        Up 46 hours                             k8s_POD_kube-proxy-phcgj_kube-system_c807aeef-4297-4f49-bf43-ef9522c1a917_1
e87b3da21b6e        k8s.gcr.io/pause:3.2   "/pause"                 46 hours ago        Up 46 hours                             k8s_POD_calico-typha-5d5cd6d49c-mkqhz_calico-system_22c031f1-7536-4c5e-933b-23e2c4ccc87a_1

Reaching the pods of the two namespace only requires a docker exec call.

root@k8s-node2:/# docker exec -it 92b3e44fef56 /bin/sh
/ # hostname
busybox22
/ # cat /var/run/secrets/kubernetes.io/serviceaccount/namespace 
two

Of course, with access to a node, you get far more than just cross namespace access. But that's not the point of our topic. And, all in all, what we see is that Kubernetes does not ensure any isolation past a certain point.

Putting it all together

Much like a lot of other software, including in other technical segments, Kubernetes does not do the job in your place. And if you do not know what you are doing, you might end up shooting yourself in the foot during your deployment. Understanding what the software does and does not do is a key to a safer environment.

To clearly summarize:

  • Kubernetes ensures a segregation between namespaces at the API level when not instructed differently.
  • Kubernetes does not ensure any network level isolation. It's all an open world.
  • Kubernetes does not isolate namespaces at the cluster level. If a namespace is compromised, your cluster is compromised whatever the number of intrusion step required may be.

To conclude on that part, unlike what some developers or administrators tend to think, Kubernetes is not a security software. It as been designed as an orchestration system and does this job great. If you need security it's up to you to implement it.

Sir, please halp

If you are part of those people that thought their namespaces were highly isolated, you might ask yourself how to reach the security level you thought you had. The end of this article is for you.

API level isolation

As we said, API resources isolation is not really a matter by default. Only adding over powerful cluster roles to namespaced objects can threaten the default isolation. You can list the existing cluster roles and cluster roles bindings using the kubectl tool.

# kadmin get clusterRoles
# kadmin get clusterRoleBindings

You can even list the associations between the roles and their subject in a single command.

# kadmin get clusterRoleBindings --output=custom-columns='ROLE:.roleRef.name,SUBJECTS:.subjects[*].name'
ROLE                                                                   SUBJECTS
clusterGetSecrets                                                      default
calico-kube-controllers                                                calico-kube-controllers
calico-node                                                            calico-node
calico-typha                                                           calico-typha
cluster-admin                                                          system:masters
kubeadm:get-nodes                                                      system:bootstrappers:kubeadm:default-node-token
[...]

The only thing left is reviewing permissions set and cleaning all cluster scoped privileges that appear superfluous. Easy job isn't it?

Network level isolation

As Kubernetes delegates network management to a third party component, the solution to set up namespace isolation from a network point of view depends on the plugin you use. For Calico, the add-on used in our example, network policies allow filtering access to cluster components.

Detailed information about network policies can be found in the official Calico documentation. But, in essence, if all you want is preventing cross namespace network access, a few simple steps can be followed.

First, apply a label to the namespace you want to isolate. Here we add the network=one label to the one namespace.

apiVersion: v1
kind: Namespace
metadata:
  name: one
  labels:
    network: one

Then apply the following network policy, that only allows communications from the namespace to itself, with kubernetes-calico.

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: isolate-one
spec:
  namespaceSelector: network == 'one'
  ingress:
  - action: Allow
    protocol: TCP
    source:
      namespaceSelector: network == 'one'
# DATASTORE_TYPE=kubernetes KUBECONFIG=/etc/kubernetes/admin.conf kubectl calico apply -f example_isolation.yaml
Successfully applied 1 'GlobalNetworkPolicy' resource(s)

That way, the one namespace is now unreachable from the two one. (Note to myself: choose better namespace names next time.)

root@busybox1: # nc -s 10.42.36.67 -lvp 42
listening on 10.42.36.67:42 ...

root@busybox22: # echo "Don't panic!" | nc -w 2 -v 10.42.36.67 42
nc: 10.42.36.67 (10.42.36.67:42): Connection timed out

This is the most minimal namespace network isolation possible. It does a great job if your namespaces host totally different projects that are not meant to interact with each others. This is the most common situation we encounter.

However, Calico and its network policies offer much more features than that. You can apply finer grain filtering to allow partial interaction between namespaces or, on the contrary, restrict accesses between endpoints in a namespace.

Hosting isolation and post compromise robustness

As we demonstrated previously, once an attacker get access to a high privileged account, the isolation between namespaces falls short. There are therefore two things you can do to increase your cluster's security:

  • Limit the impact of a compromise by setting up isolation manually.
  • Prevent access to high privileges in the namespaces.

Compromise impact reduction

In terms of security, having a node dedicated for each namespace would ensure the best security. However, this would totally go against the core principles of Kubernetes. Therefore, a good trade-off must be found. Depending on your infrastructure, the perfect solution might differ a bit:

  • Isolate critical namespaces on their own node.
  • Reserve a node for critical pods of each namespace.
  • Deploy a completely independent cluster for critical namespaces.

Apart from the last, most radical one, all solution will leverage the same Kubernetes mechanisms. The general idea is to reserve a set of nodes for special namespaces, pods or else. That way, a compromise of any other node won't directly impact those more critical components.

Kubernetes provides a node tainting and a node affinity feature that can be leveraged to setup some sort of isolation. Unfortunately, the software currently lacks the ability to restrict the definition of toleration on the pods. As those are used to conform to the restrictions set on a node, a problem arises.

Tainting a node has the immediate effect of repelling all pods that do not comply to the taint. For example, on our example cluster, if we want to reserve node1 for a special set of pods, we will start by applying a taint to it.

root@debian:/etc/kubernetes/manifests# kadmin taint nodes k8s-node1 integrity=high:NoExecute
node/k8s-node1 tainted

Once done, only pods that have a toleration on the integrity=high taint will be able to be scheduled on the node. As we said before, restricting the toleration that a pod can have is not an easy task. Currently, the best one can achieve is using the PodTolerationRestriction admission controller. This one allows masking the toleration of all the pods defined in a namespace with a value defined at the namespace level.

For example, we could add the following settings on the two namespace in order to prevent all pods defined in it from defining the integrity=high toleration, therefore preventing them to land on our restricted node.

  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[]'
    scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[]'

There a number of caveats though:

  • This feature is in alpha release. It might get you in trouble one way or another.
  • The exclusion works in blacklist mode. You need to explicitly deny each namespace regarding the taint. From a security point of view, this is far from ideal. It is also highly impractical.
  • If you use toleration for other purposes, setting this will get even more complicated.

Other solutions would require implementing a custom admission controller. This could work but is outside the scope of this article.

We end up with the only real solution to use different clusters for different severity levels. This is often overkill and requires an important overhead. Unfortunately, no satisfying solution exists at the moment. At least, nothing that will work out of the box. However, in most situations, you might find that preventing the privilege escalation in the first place is a sufficient security addition.

Escalation prevention

Regarding privilege escalation, we need to remember that what we exploited was the pod creation privilege and the privileged pod mechanism. While it might be difficult to deny access to pod creation to all service accounts of your cluster, restricting access to the privileged pod is possible using pod security policies.

Disclaimer:

Using pod security policies requires enabling the PodSecurityPolicy admission controller. Doing so without having any policy created and activated will prevent all users from creating pods.

After having activated the PodSecurityPolicy admission controller, by adding PodSecurityPolicy to the --enable-admission-plugins command line option of the API server, it is possible to create so called pod security policies that will restrict the privileges of pods being created. A policy definition includes a whitelist of pod parameters that will be accepted upon creation. For example, the following policy prevents the use of the options we used to takeover the hosting node.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restrict-priv-pod
spec:
  privileged: false
  hostPID: false
  hostIPC: false
  hostNetwork: false
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  volumes:
  - '*'

This policy must then be applied to users or service accounts. This is done thanks to RBAC roles and bindings so that you can fine-grain tune the creation privileges. A role, or cluster role, must allow the use verb on our new policy to actually apply the restriction. For example, the following role allows using the restrictive policy in the one namespace.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: restricted-create
  namespace: one
rules:
  - apiGroups: ["policy"]
    resources: ["podsecuritypolicies"]
    verbs: ["use"]
    resourceNames:
      - restrict-priv-pod

To end the loop, the role must be applied to the entities we want to restrict. If your goal is to prevent all service accounts from creating privileged pods, using the system:serviceaccounts RBAC group is the way to go.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prevent-priv-pod-all
  namespace: one
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: restricted-create
subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: Group
    name: system:serviceaccounts

With that in place, all service accounts from the one namespace are now unable to create privileged pods. Trying to do so will trigger an error remembering the rejected parameters.

root@busybox1:~ # ./kubectl apply -f priv_pod.yaml 
Error from server (Forbidden): error when creating "priv_pod.yaml": pods "priv-pod" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

Once again, DO NOT DO THAT AT HOME AS IS. The newly deployed configuration only target the service accounts of the one namespace. As the admission controller works as a white list filter, all other users and accounts from all other namespaces won't be allowed to create pods anymore. We even ended up breaking our test cluster as the control plane itself was unable to create pods, which might be a problem when you restart your API server to change its configuration ;).

Setting up pod security policies will require a prior reflection about the real needs of the cluster and an analysis of required permissions element by element. All configurations must then be deployed in a coordinated manner.

Summary and conclusions

We saw that Kubernetes is not a security software. It also relies on third party components for the management of critical mechanisms and do not offer fully secure defaults for them.

As a developer, or system administrator, you only need to keep in mind that you will only get the security properties you implemented yourself. Some security measures will get you a high increase for a minimal investment. That's the case for the network segmentation of the cluster. Others will require a bit of work or even turn out to be impractical.

In fact, Kubernetes can be compared to other software when it comes to security in that you need to define what are your real security needs. If your cluster only hosts a single infrastructure, maybe you do not need to overtax yourself. On the contrary, if you host ultra secrets assets along with your website's pre-production, there are probably matters you should take care of.

In all cases, knowing what your software does, or does not do, will be the first step on the way to a fitted security.

That's all folks