Docker网络研究

经过昨天非主流玩法,今天直接在Mac上VMware Fusion里安装了一个CentOS7的Linux虚拟机,内核版本 3.10.0-693.el7.x86_64,然后在Linux环境下运行Docker,研究下网络

安装Docker,通过一个安装脚本,直接命令行执行:wget -qO- https://get.docker.com/ | sh

安装完之后,查看版本

[lihui@2018 ~]$ docker version
Client:
 Version:	18.02.0-ce
 API version:	1.36
 Go version:	go1.9.3
 Git commit:	fc4de44
 Built:	Wed Feb  7 21:14:12 2018
 OS/Arch:	linux/amd64
 Experimental:	false
 Orchestrator:	swarm

启动docker很简单

[lihui@2018 ~]$ sudo systemctl start docker

docker起来之后,首先关注多了一块网卡docker0

[lihui@2018 ~]$ ip a show docker0
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN
    link/ether 02:42:a1:4c:bc:f9 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

实际上它是一个Linux Bridge,通过brctl查询一波

[lihui@2018 ~]$ brctl show docker0
bridge name	bridge id		STP enabled	interfaces
docker0		8000.0242a14cbcf9	no

这里就先不管了,bridge的作用只是作为容器和宿主机的桥梁,先将容器创建起来

看下镜像,空白

[lihui@2018 ~]$ sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

那就随便pull一个,比如来个ubuntu17.10

[lihui@2018 ~]$ sudo docker pull ubuntu:17.10
17.10: Pulling from library/ubuntu
c3b9c0688e3b: Pull complete
e9fb5affebb0: Pull complete
0f1378f511ad: Pull complete
96a961dc7843: Pull complete
16564141bc83: Pull complete
Digest: sha256:91680dba9ee085d9d4d33e907842dbecb8891e3cc9f81175ba991d2d27bd862f
Status: Downloaded newer image for ubuntu:17.10

这样本地镜像就有了

[lihui@2018 ~]$ sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
ubuntu              17.10               1af812152d85        3 days ago          98.4MB

启动一个容器

[lihui@2018 ~]$ sudo docker run -itd ubuntu:17.10 /bin/bash
d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9

这时候关注一下系统进程

[lihui@2018 ~]$ ps aux | grep docker
root      1297  0.4  2.5 541312 52052 ?        Ssl  14:16   0:06 /usr/bin/dockerd
root      1303  0.0  1.1 226916 24296 ?        Ssl  14:16   0:01 docker-containerd --config /var/run/docker/containerd/containerd.toml
root      1650  0.0  0.1   8916  2932 ?        Sl   14:39   0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
lihui     1711  0.0  0.0 112676   980 pts/0    R+   14:40   0:00 grep --color=auto docker

这个PID为1650的进程是多出来的(从中间一串ID可以看出来),我先以为是容器的进程,其实并不是,只是每启动一个容器就会起来的一个进程,先不关注

此时宿主机上又多了一张网卡

[lihui@2018 ~]$ ip a show vethb5bcc47
5: vethb5bcc47@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP
    link/ether ca:90:80:17:7b:85 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c890:80ff:fe17:7b85/64 scope link
       valid_lft forever preferred_lft forever

从这个命名就可以看出来,应该是创建了一堆veth pair,其中一个绑到了Linux Bridge docker0上,至于命名规则暂时未知,可以查下bridge的信息

[lihui@2018 ~]$ brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.0242a14cbcf9	no		vethb5bcc47

不出所料,对比上面bridge,interfaces多了张网卡,正好是这个veth,接下来我们需要找到与它直连的对端

继续往下研究,先进入到刚创建的容器里看看,直接通过attach命令

[lihui@2018 ~]$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
d414e88d18b1        ubuntu:17.10        "/bin/bash"         12 minutes ago      Up 12 minutes                           quirky_feynman
[lihui@2018 ~]$
[lihui@2018 ~]$
[lihui@2018 ~]$ sudo docker attach quirky_feynman
root@d414e88d18b1:/#
root@d414e88d18b1:/#

进来之后,尴尬的是很多系统工具都没有,因为是ubuntu镜像,试探性地apt-get了下,居然OK的,说明此时容器的网络到外网已经是通了,更激发了我的好奇心

root@d414e88d18b1:/# apt-get update
root@d414e88d18b1:/# apt-get install iproute iputils-ping

主要是需要ip和ping两个命令,装好之后,查看容器里的网络情况,就一张网卡,但是分配了一个IP地址

root@d414e88d18b1:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

认真看下网卡编号,是4,不出意外,这个eth0应该就是和宿主机上绑在Linux Bridge上veth网卡是同一对veth pair,再次看下宿主机网络

[lihui@2018 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:0c:29:ff:bb:c7 brd ff:ff:ff:ff:ff:ff
    inet 192.168.226.191/24 brd 192.168.226.255 scope global dynamic ens33
       valid_lft 1749sec preferred_lft 1749sec
    inet6 fe80::b6df:a4cc:9425:eeab/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:a1:4c:bc:f9 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:a1ff:fe4c:bcf9/64 scope link
       valid_lft forever preferred_lft forever
5: vethb5bcc47@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP
    link/ether ca:90:80:17:7b:85 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c890:80ff:fe17:7b85/64 scope link
       valid_lft forever preferred_lft forever

同时看下容器里面的路由表

root@d414e88d18b1:/# ip r
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2

以及宿主机的路由表

[lihui@2018 ~]$ ip r
default via 192.168.226.2 dev ens33 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.226.0/24 dev ens33 proto kernel scope link src 192.168.226.191 metric 100

这就比较直观了,容器里的eth0和宿主机的vethb5bcc47为同一对veth pair,在启动容器的时候创建,然后eth0绑到了容器里,分配了一个和宿主机Bridge同一网段的IP地址,另一个绑到了宿主机的Linux Bridge上,由于他们是一堆pair,因此容器里的eth0出来的包都能通过veth pair到达Linux Bridge,从而做二层或者三层转发,具体来说:

如果是容器之间做二层转发,两个容器里都分配了一个同网段的eth0(都和docker0同网段),各自通过veth pair走到了Linux Bridge,然后直接进行二层转发

如果是容器访问外网做三层路由转发,容器出方向,首先通过路由表下一跳为172.17.0.1,通过veth pair直接可以走到网桥,因此这一跳可达,到了bridge,然后到宿主机通过走默认路由,访问外网(这里是VMware Fusion虚拟机做NAT出去的,具体下一跳可以忽略),也就是和Linux虚拟机NAT访问外网一样;入方向包只要进了宿主机(Linux虚拟机),就走第二条路由,目的端172.17.0.0/16的包,下一跳为172.17.0.1,接着二层转发,通过MAC地址到达容器的eth0

画一个简单的图就很好理解了

如果对地址转换感兴趣,可以在宿主机上查看一下iptables的NAT规则,不出意外应该是有POSTROUTING链

[lihui@2018 ~]$ sudo iptables -t nat -vnL | grep 172.17 -C 3

Chain POSTROUTING (policy ACCEPT 171 packets, 15046 bytes)
 pkts bytes target     prot opt in     out     source               destination
   10   641 MASQUERADE  all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0
  229 19520 POSTROUTING_direct  all  --  *      *       0.0.0.0/0            0.0.0.0/0
  229 19520 POSTROUTING_ZONES_SOURCE  all  --  *      *       0.0.0.0/0            0.0.0.0/0
  229 19520 POSTROUTING_ZONES  all  --  *      *       0.0.0.0/0            0.0.0.0/0

可以看到中间有一条规则,并不是POSTROUTING链,而是地址伪装MASQUERADE,正常情况下POSTROUTING链会指定源地址转换之后的目标IP地址,但这里没有指定,而是自动获取当前出口的网卡来做NAT;至于MASQUERADE主要针对的场景是避免拨号类上网IP地址时刻会改变,那么直接进行自动获取就不会对具体的地址转换规则产生影响

但是我关心的东西还没找到,就是network namespace,在宿主机上,通过ip netns list根本就找不到任何namespace,由于ip netns是从/var/run/netns文件夹里读取的,但是此时这个文件夹根本就没有,需要做下面几部操作

首先找到容器的PID,下面两种方法均可

[lihui@2018 ~]$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
d414e88d18b1        ubuntu:17.10        "/bin/bash"         2 hours ago         Up 2 hours                              quirky_feynman
[lihui@2018 ~]$
[lihui@2018 ~]$ sudo docker inspect -f '{{.State.Pid}}' quirky_feynman
1661

[lihui@2018 ~]$ cat /sys/fs/cgroup/memory/docker/d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9/cgroup.procs
1661

然后创建 /var/run/netns 目录

[lihui@2018 ~]$ sudo mkdir /var/run/netns
[lihui@2018 ~]$ ls -l /var/run/netns/

[lihui@2018 ~]$

创建一个软链接,注意目标端为容器的名字quirky_feynman

[lihui@2018 ~]$ sudo ls -l /proc/1661/ns/
lrwxrwxrwx. 1 root root 0 3月  10 17:07 ipc -> ipc:[4026532628]
lrwxrwxrwx. 1 root root 0 3月  10 17:07 mnt -> mnt:[4026532626]
lrwxrwxrwx. 1 root root 0 3月  10 14:39 net -> net:[4026532631]
lrwxrwxrwx. 1 root root 0 3月  10 17:07 pid -> pid:[4026532629]
lrwxrwxrwx. 1 root root 0 3月  10 17:07 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 3月  10 17:07 uts -> uts:[4026532627]
[lihui@2018 ~]$ sudo ln -s /proc/1661/ns/net /var/run/netns/quirky_feynman
[lihui@2018 ~]$
[lihui@2018 ~]$

查询namespace,终于回来了

[lihui@2018 ~]$ ip netns list
quirky_feynman
[lihui@2018 ~]$ ip netns exec quirky_feynman ip a
Cannot open network namespace "quirky_feynman": Permission denied
[lihui@2018 ~]$ sudo ip netns exec quirky_feynman ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

这样整个流程就理清了,安装完docker之后,会创建一个Linux Bridge docker0,创建容器的时候,会创建一对veth pair网卡,一个绑到了容器对应的network namespace里,分配一个和宿主机Linux Bridge docker0同网段的IP地址;另一个绑到了docker0上,不需要配置IP地址,只作为通道,能和容器里直连的eth0网卡包能互通

发表评论