经过昨天非主流玩法,今天直接在Mac上VMware Fusion里安装了一个CentOS7的Linux虚拟机,内核版本 3.10.0-693.el7.x86_64,然后在Linux环境下运行Docker,研究下网络
安装Docker,通过一个安装脚本,直接命令行执行:wget -qO- https://get.docker.com/ | sh
安装完之后,查看版本
[lihui@2018 ~]$ docker version Client: Version: 18.02.0-ce API version: 1.36 Go version: go1.9.3 Git commit: fc4de44 Built: Wed Feb 7 21:14:12 2018 OS/Arch: linux/amd64 Experimental: false Orchestrator: swarm
启动docker很简单
[lihui@2018 ~]$ sudo systemctl start docker
docker起来之后,首先关注多了一块网卡docker0
[lihui@2018 ~]$ ip a show docker0 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:a1:4c:bc:f9 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever
实际上它是一个Linux Bridge,通过brctl查询一波
[lihui@2018 ~]$ brctl show docker0 bridge name bridge id STP enabled interfaces docker0 8000.0242a14cbcf9 no
这里就先不管了,bridge的作用只是作为容器和宿主机的桥梁,先将容器创建起来
看下镜像,空白
[lihui@2018 ~]$ sudo docker images REPOSITORY TAG IMAGE ID CREATED SIZE
那就随便pull一个,比如来个ubuntu17.10
[lihui@2018 ~]$ sudo docker pull ubuntu:17.10 17.10: Pulling from library/ubuntu c3b9c0688e3b: Pull complete e9fb5affebb0: Pull complete 0f1378f511ad: Pull complete 96a961dc7843: Pull complete 16564141bc83: Pull complete Digest: sha256:91680dba9ee085d9d4d33e907842dbecb8891e3cc9f81175ba991d2d27bd862f Status: Downloaded newer image for ubuntu:17.10
这样本地镜像就有了
[lihui@2018 ~]$ sudo docker images REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu 17.10 1af812152d85 3 days ago 98.4MB
启动一个容器
[lihui@2018 ~]$ sudo docker run -itd ubuntu:17.10 /bin/bash d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9
这时候关注一下系统进程
[lihui@2018 ~]$ ps aux | grep docker root 1297 0.4 2.5 541312 52052 ? Ssl 14:16 0:06 /usr/bin/dockerd root 1303 0.0 1.1 226916 24296 ? Ssl 14:16 0:01 docker-containerd --config /var/run/docker/containerd/containerd.toml root 1650 0.0 0.1 8916 2932 ? Sl 14:39 0:00 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc lihui 1711 0.0 0.0 112676 980 pts/0 R+ 14:40 0:00 grep --color=auto docker
这个PID为1650的进程是多出来的(从中间一串ID可以看出来),我先以为是容器的进程,其实并不是,只是每启动一个容器就会起来的一个进程,先不关注
此时宿主机上又多了一张网卡
[lihui@2018 ~]$ ip a show vethb5bcc47 5: vethb5bcc47@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP link/ether ca:90:80:17:7b:85 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::c890:80ff:fe17:7b85/64 scope link valid_lft forever preferred_lft forever
从这个命名就可以看出来,应该是创建了一堆veth pair,其中一个绑到了Linux Bridge docker0上,至于命名规则暂时未知,可以查下bridge的信息
[lihui@2018 ~]$ brctl show bridge name bridge id STP enabled interfaces docker0 8000.0242a14cbcf9 no vethb5bcc47
不出所料,对比上面bridge,interfaces多了张网卡,正好是这个veth,接下来我们需要找到与它直连的对端
继续往下研究,先进入到刚创建的容器里看看,直接通过attach命令
[lihui@2018 ~]$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d414e88d18b1 ubuntu:17.10 "/bin/bash" 12 minutes ago Up 12 minutes quirky_feynman [lihui@2018 ~]$ [lihui@2018 ~]$ [lihui@2018 ~]$ sudo docker attach quirky_feynman root@d414e88d18b1:/# root@d414e88d18b1:/#
进来之后,尴尬的是很多系统工具都没有,因为是ubuntu镜像,试探性地apt-get了下,居然OK的,说明此时容器的网络到外网已经是通了,更激发了我的好奇心
root@d414e88d18b1:/# apt-get update root@d414e88d18b1:/# apt-get install iproute iputils-ping
主要是需要ip和ping两个命令,装好之后,查看容器里的网络情况,就一张网卡,但是分配了一个IP地址
root@d414e88d18b1:/# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever
认真看下网卡编号,是4,不出意外,这个eth0应该就是和宿主机上绑在Linux Bridge上veth网卡是同一对veth pair,再次看下宿主机网络
[lihui@2018 ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000 link/ether 00:0c:29:ff:bb:c7 brd ff:ff:ff:ff:ff:ff inet 192.168.226.191/24 brd 192.168.226.255 scope global dynamic ens33 valid_lft 1749sec preferred_lft 1749sec inet6 fe80::b6df:a4cc:9425:eeab/64 scope link valid_lft forever preferred_lft forever 3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:a1:4c:bc:f9 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:a1ff:fe4c:bcf9/64 scope link valid_lft forever preferred_lft forever 5: vethb5bcc47@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP link/ether ca:90:80:17:7b:85 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::c890:80ff:fe17:7b85/64 scope link valid_lft forever preferred_lft forever
同时看下容器里面的路由表
root@d414e88d18b1:/# ip r default via 172.17.0.1 dev eth0 172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2
以及宿主机的路由表
[lihui@2018 ~]$ ip r default via 192.168.226.2 dev ens33 proto static metric 100 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 192.168.226.0/24 dev ens33 proto kernel scope link src 192.168.226.191 metric 100
这就比较直观了,容器里的eth0和宿主机的vethb5bcc47为同一对veth pair,在启动容器的时候创建,然后eth0绑到了容器里,分配了一个和宿主机Bridge同一网段的IP地址,另一个绑到了宿主机的Linux Bridge上,由于他们是一堆pair,因此容器里的eth0出来的包都能通过veth pair到达Linux Bridge,从而做二层或者三层转发,具体来说:
如果是容器之间做二层转发,两个容器里都分配了一个同网段的eth0(都和docker0同网段),各自通过veth pair走到了Linux Bridge,然后直接进行二层转发
如果是容器访问外网做三层路由转发,容器出方向,首先通过路由表下一跳为172.17.0.1,通过veth pair直接可以走到网桥,因此这一跳可达,到了bridge,然后到宿主机通过走默认路由,访问外网(这里是VMware Fusion虚拟机做NAT出去的,具体下一跳可以忽略),也就是和Linux虚拟机NAT访问外网一样;入方向包只要进了宿主机(Linux虚拟机),就走第二条路由,目的端172.17.0.0/16的包,下一跳为172.17.0.1,接着二层转发,通过MAC地址到达容器的eth0
画一个简单的图就很好理解了
如果对地址转换感兴趣,可以在宿主机上查看一下iptables的NAT规则,不出意外应该是有POSTROUTING链
[lihui@2018 ~]$ sudo iptables -t nat -vnL | grep 172.17 -C 3 Chain POSTROUTING (policy ACCEPT 171 packets, 15046 bytes) pkts bytes target prot opt in out source destination 10 641 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0 229 19520 POSTROUTING_direct all -- * * 0.0.0.0/0 0.0.0.0/0 229 19520 POSTROUTING_ZONES_SOURCE all -- * * 0.0.0.0/0 0.0.0.0/0 229 19520 POSTROUTING_ZONES all -- * * 0.0.0.0/0 0.0.0.0/0
可以看到中间有一条规则,并不是POSTROUTING链,而是地址伪装MASQUERADE,正常情况下POSTROUTING链会指定源地址转换之后的目标IP地址,但这里没有指定,而是自动获取当前出口的网卡来做NAT;至于MASQUERADE主要针对的场景是避免拨号类上网IP地址时刻会改变,那么直接进行自动获取就不会对具体的地址转换规则产生影响
但是我关心的东西还没找到,就是network namespace,在宿主机上,通过ip netns list根本就找不到任何namespace,由于ip netns是从/var/run/netns文件夹里读取的,但是此时这个文件夹根本就没有,需要做下面几部操作
首先找到容器的PID,下面两种方法均可
[lihui@2018 ~]$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d414e88d18b1 ubuntu:17.10 "/bin/bash" 2 hours ago Up 2 hours quirky_feynman [lihui@2018 ~]$ [lihui@2018 ~]$ sudo docker inspect -f '{{.State.Pid}}' quirky_feynman 1661
和
[lihui@2018 ~]$ cat /sys/fs/cgroup/memory/docker/d414e88d18b11ca3ce997617d895caa321a3ec2acfad6be43bb1507084ccbcf9/cgroup.procs 1661
然后创建 /var/run/netns 目录
[lihui@2018 ~]$ sudo mkdir /var/run/netns [lihui@2018 ~]$ ls -l /var/run/netns/ [lihui@2018 ~]$
创建一个软链接,注意目标端为容器的名字quirky_feynman
[lihui@2018 ~]$ sudo ls -l /proc/1661/ns/ lrwxrwxrwx. 1 root root 0 3月 10 17:07 ipc -> ipc:[4026532628] lrwxrwxrwx. 1 root root 0 3月 10 17:07 mnt -> mnt:[4026532626] lrwxrwxrwx. 1 root root 0 3月 10 14:39 net -> net:[4026532631] lrwxrwxrwx. 1 root root 0 3月 10 17:07 pid -> pid:[4026532629] lrwxrwxrwx. 1 root root 0 3月 10 17:07 user -> user:[4026531837] lrwxrwxrwx. 1 root root 0 3月 10 17:07 uts -> uts:[4026532627] [lihui@2018 ~]$ sudo ln -s /proc/1661/ns/net /var/run/netns/quirky_feynman [lihui@2018 ~]$ [lihui@2018 ~]$
查询namespace,终于回来了
[lihui@2018 ~]$ ip netns list quirky_feynman [lihui@2018 ~]$ ip netns exec quirky_feynman ip a Cannot open network namespace "quirky_feynman": Permission denied [lihui@2018 ~]$ sudo ip netns exec quirky_feynman ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever
这样整个流程就理清了,安装完docker之后,会创建一个Linux Bridge docker0,创建容器的时候,会创建一对veth pair网卡,一个绑到了容器对应的network namespace里,分配一个和宿主机Linux Bridge docker0同网段的IP地址;另一个绑到了docker0上,不需要配置IP地址,只作为通道,能和容器里直连的eth0网卡包能互通