Neutron Port之binding_failed

Neutron Port在使用过程中,甚至在带网络创建虚拟机的时候,有时候会出现binding_failed的问题,当然如果是创建虚拟机的时候出现,虚拟机状态就会使ERROR,如果仅仅是PORT的binding:vif_type字段变成binding_failed,那么连通性就会出现问题,总的来说这都是出现问题,正常情况下原因有很多,比如Open vSwitch agent没有alive,或者没有重启,还有就是一些br设备没有等等

昨天许多节点进行了内核更新,修改了很多参数,因此整个节点机都需要重启,后来有些上层反映私有网络有的通,有的不通,第一感觉心想会不会是PORT的status没有变成ACTIVE,比如一直timeout状态保持BUILD

查看该PORT,信息如下

$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4
+-----------------------+------------------------------------------------------------------------------------+
| Field                 | Value                                                                              |
+-----------------------+------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                               |
| allowed_address_pairs |                                                                                    |
| binding:capabilities  | {"port_filter": true}                                                              |
| binding:host_id       | 10-10-10-158                                                                       |
| binding:profile       | {}                                                                                 |
| binding:vif_type      | binding_failed                                                                     |
| device_id             | 071c9aa0-eef9-4410-97b0-a3a79de790a7                                               |
| device_owner          | compute:lihui.openstack                                                            |
| extra_dhcp_opts       |                                                                                    |
| filtered_ports        |                                                                                    |
| fixed_ips             | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} |
| id                    | 72904524-9814-4212-b714-0ff0763067a4                                               |
| ingress_max_rate      |                                                                                    |
| mac_address           | fa:16:3e:1a:3d:87                                                                  |
| max_bm_pps            |                                                                                    |
| max_pps               |                                                                                    |
| max_rate              |                                                                                    |
| name                  |                                                                                    |
| network_id            | c108345e-494e-4829-959d-8a0a3a129aa8                                               |
| port_security_enabled |                                                                                    |
| prefer_rate           |                                                                                    |
| status                | ACTIVE                                                                             |
| support_azs           | lihui                                                                              |
| tenant_id             | 48bc786dc981495b94d254410b46ede8                                                   |
+-----------------------+------------------------------------------------------------------------------------+

看到这里,原来并不是状态不对,而是PORT出现了binding_failed,回想下原因应该就是重启节点机,有些虚拟机的设备重新binding的时候出的错,要解决这个问题,就是必须将binding的状态刷回ovs,这里一个简单的方法就是修改PORT的admin_state_up字段,先改为false,再改回true

$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4
+-----------------------+------------------------------------------------------------------------------------+
| Field                 | Value                                                                              |
+-----------------------+------------------------------------------------------------------------------------+
| admin_state_up        | False                                                                              |
| allowed_address_pairs |                                                                                    |
| binding:capabilities  | {"port_filter": true}                                                              |
| binding:host_id       | 10-10-10-158                                                                       |
| binding:profile       | {}                                                                                 |
| binding:vif_type      | ovs                                                                                |
| device_id             | 071c9aa0-eef9-4410-97b0-a3a79de790a7                                               |
| device_owner          | compute:lihui.openstack                                                            |
| extra_dhcp_opts       |                                                                                    |
| filtered_ports        |                                                                                    |
| fixed_ips             | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} |
| id                    | 72904524-9814-4212-b714-0ff0763067a4                                               |
| ingress_max_rate      |                                                                                    |
| mac_address           | fa:16:3e:1a:3d:87                                                                  |
| max_bm_pps            |                                                                                    |
| max_pps               |                                                                                    |
| max_rate              |                                                                                    |
| name                  |                                                                                    |
| network_id            | c108345e-494e-4829-959d-8a0a3a129aa8                                               |
| port_security_enabled |                                                                                    |
| prefer_rate           |                                                                                    |
| status                | DOWN                                                                               |
| support_azs           | lihui                                                                              |
| tenant_id             | 48bc786dc981495b94d254410b46ede8                                                   |
+-----------------------+------------------------------------------------------------------------------------+

为false的时候设备不会转发数据包了,比如TAP设备抓包,毫无音讯

~$ sudo tcpdump -i  tap72904524-98 icmp -en
tcpdump: WARNING: tap72904524-98: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap72904524-98, link-type EN10MB (Ethernet), capture size 65535 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

接着,查看一下设备的详情

Port "tap72904524-98"
tag: 4095
Interface "tap72904524-98"

这里OVS port的vlan tag居然被设置为了4095,设置了这个ID包应该会被丢掉,查下流表是否是drop,先找ofport

~$ sudo ovs-vsctl list interface tap72904524-98
_uuid               : f900b05d-d8c0-4c58-ad49-c0f64366c063
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : full
external_ids        : {attached-mac="fa:16:3e:1a:3d:87", iface-id="72904524-9814-4212-b714-0ff0763067a4", iface-status=active, vm-id="071c9aa0-eef9-4410-97b0-a3a79de790a7"}
ifindex             : 587
ingress_policing_burst: 0
ingress_policing_rate: 0
lacp_current        : []
link_resets         : 1
link_speed          : 10000000
link_state          : up
mac                 : []
mac_in_use          : "fe:16:3e:1a:3d:87"
mtu                 : 1500
name                : "tap72904524-98"
ofport              : 398
ofport_request      : []
options             : {}
other_config        : {enable_packet_hook="false"}
statistics          : {collisions=0, rx_bytes=10055624, rx_crc_err=0, rx_dropped=0, rx_errors=0, rx_frame_err=0, rx_over_err=0, rx_packets=59688, tx_bytes=11015008, tx_dropped=785, tx_errors=0, tx_packets=171204}
status              : {driver_name=tun, driver_version="1.6", firmware_version=""}
type                : ""

查看流表规则

~$ sudo ovs-ofctl dump-flows br-int in_port=398
NXST_FLOW reply (xid=0x4):
cookie=0x583fdd1f00000001, duration=427.391s, table=0, n_packets=118, n_bytes=5336, idle_age=11, priority=2,in_port=398 actions=drop

可见,全部都是drop

再改回true

$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4
+-----------------------+------------------------------------------------------------------------------------+
| Field                 | Value                                                                              |
+-----------------------+------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                               |
| allowed_address_pairs |                                                                                    |
| binding:capabilities  | {"port_filter": true}                                                              |
| binding:host_id       | 10-10-10-158                                                                       |
| binding:profile       | {}                                                                                 |
| binding:vif_type      | ovs                                                                                |
| device_id             | 071c9aa0-eef9-4410-97b0-a3a79de790a7                                               |
| device_owner          | compute:lihui.openstack                                                            |
| extra_dhcp_opts       |                                                                                    |
| filtered_ports        |                                                                                    |
| fixed_ips             | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} |
| id                    | 72904524-9814-4212-b714-0ff0763067a4                                               |
| ingress_max_rate      |                                                                                    |
| mac_address           | fa:16:3e:1a:3d:87                                                                  |
| max_bm_pps            |                                                                                    |
| max_pps               |                                                                                    |
| max_rate              |                                                                                    |
| name                  |                                                                                    |
| network_id            | c108345e-494e-4829-959d-8a0a3a129aa8                                               |
| port_security_enabled |                                                                                    |
| prefer_rate           |                                                                                    |
| status                | ACTIVE                                                                             |
| support_azs           | lihui                                                                              |
| tenant_id             | 48bc786dc981495b94d254410b46ede8                                                   |
+-----------------------+------------------------------------------------------------------------------------+

再看下ovs情况,就不是4095了

Port "tap72904524-98"
tag: 34
Interface "tap72904524-98"

继续看下流表,drop的规则消失了

~$ sudo ovs-ofctl dump-flows br-int in_port=398
NXST_FLOW reply (xid=0x4):
~$ sudo ovs-ofctl dump-flows br-int in_port=398
NXST_FLOW reply (xid=0x4):
~$ sudo ovs-ofctl dump-flows br-int in_port=398
NXST_FLOW reply (xid=0x4):

这样就恢复正常了

虽然问题是binding_failed,但通过修改admin_state_up这个字段,并且结合查看流表规则,来断定数据包的转发

备注:

看了下nova 11上的a040cc66-07c4-43a2-a443-ec5f7c7e5899这个port,大概主要原因如下:
1. nova 11在17:32:01左右重启,直到12:35:38左右才起来,应该是改什么grub配置所置?,总之起的慢了,而agent down time是75s,这样会导致neutron server认为其down了
2. 有客户端请求更新该port的状态,因agent down了,此api将设置为binding_failed
2016-12-12 17:35:59.936 161755 INFO neutron.wsgi [req-4ddbf8ae-c88a-4af1-9c45-2f98e28e213f dac2288842ec4bcaa1550372df94d21e 82890c8d4c4e4efaaf6ec4bae3007146] 10.185.0.123 "PUT /v2.0/ports/a040cc66-07c4-43a2-a443
-ec5f7c7e5899.json HTTP/1.1" status: 200 len: 931 time: 0.1421850
而查看nova11的nova-compute日志:
2016-12-12 17:35:59.866 5134 DEBUG neutronclient.client [-]
REQ: curl -i http://10.185.0.253:9696/v2.0/ports/a040cc66-07c4-43a2-a443-ec5f7c7e5899.json -X PUT -H "X-Auth-Token: {SHA1}01f663ec1cfaeaf09791d7563e03ac8bf02620f6" -H "Content-Type: application/json" -H "Accept:
application/json" -H "User-Agent: python-neutronclient" -d '{"port": {"binding:host_id": "pubt1-nova11.yq.163.org"}}'
http_log_req /usr/lib/python2.7/dist-packages/neutronclient/common/utils.py:212
2016-12-12 17:36:00.010 5134 DEBUG neutronclient.client [-] RESP:{'date': 'Mon, 12 Dec 2016 09:35:59 GMT', 'status': '200', 'content-length': '807', 'content-type': 'application/json; charset=UTF-8'} {"port": {"
filtered_ports": null, "ingress_max_rate": null, "allowed_address_pairs": [], "prefer_rate": null, "extra_dhcp_opts": [], "max_bm_pps": null, "device_owner": "compute:pubt1.rds3", "binding:profile": {}, "port_se
curity_enabled": null, "fixed_ips": [{"subnet_id": "fd2b7d96-d69c-4208-afd1-e0f11c014f07", "ip_address": "10.18.194.144"}], "id": "a040cc66-07c4-43a2-a443-ec5f7c7e5899", "binding:vif_type": "binding_failed", "bi
nding:capabilities": {"port_filter": false}, "mac_address": "fa:16:3e:96:a8:5c", "status": "ACTIVE", "binding:host_id": "pubt1-nova11.yq.163.org", "max_rate": null, "max_pps": null, "device_id": "65580549-2b35-4
f65-8ee3-618eed34e233", "name": "", "admin_state_up": true, "network_id": "723a2210-3451-48bf-8e21-a01496a82970", "tenant_id": "e8ca93f83b2c4a88822e194e7d4cac51"}}
http_log_resp /usr/lib/python2.7/dist-packages/neutronclient/common/utils.py:218
可见port update请求是由nova发出
3. neutron server四个节点上最早于17:36:00收到report state,此时才能把agent标记活
4. 但nova 11上的ovs agent在16:36:00的时间去get_device状态,因port为binding_failed状态,故而认为该port未在server上定义,后续也不再处理了。
那么这个问题的原因归要到底是在机器断电太长时间的情况下,nova的port update不判断binding_failed这个结果并neutron返回的也是200这个成功,导致binding failed将持久化下来,只能通过手工的port_update触发重绑定解决。
因nova和ovs agent是独立恢复的,两者间没有必要的关系,目前nova只等ovs br-int table=0的流表数量大于1做为恢复条件,这个条件可能不满足需要。

发表评论