Neutron Port在使用过程中,甚至在带网络创建虚拟机的时候,有时候会出现binding_failed的问题,当然如果是创建虚拟机的时候出现,虚拟机状态就会使ERROR,如果仅仅是PORT的binding:vif_type字段变成binding_failed,那么连通性就会出现问题,总的来说这都是出现问题,正常情况下原因有很多,比如Open vSwitch agent没有alive,或者没有重启,还有就是一些br设备没有等等
昨天许多节点进行了内核更新,修改了很多参数,因此整个节点机都需要重启,后来有些上层反映私有网络有的通,有的不通,第一感觉心想会不会是PORT的status没有变成ACTIVE,比如一直timeout状态保持BUILD
查看该PORT,信息如下
$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4 +-----------------------+------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:capabilities | {"port_filter": true} | | binding:host_id | 10-10-10-158 | | binding:profile | {} | | binding:vif_type | binding_failed | | device_id | 071c9aa0-eef9-4410-97b0-a3a79de790a7 | | device_owner | compute:lihui.openstack | | extra_dhcp_opts | | | filtered_ports | | | fixed_ips | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} | | id | 72904524-9814-4212-b714-0ff0763067a4 | | ingress_max_rate | | | mac_address | fa:16:3e:1a:3d:87 | | max_bm_pps | | | max_pps | | | max_rate | | | name | | | network_id | c108345e-494e-4829-959d-8a0a3a129aa8 | | port_security_enabled | | | prefer_rate | | | status | ACTIVE | | support_azs | lihui | | tenant_id | 48bc786dc981495b94d254410b46ede8 | +-----------------------+------------------------------------------------------------------------------------+
看到这里,原来并不是状态不对,而是PORT出现了binding_failed,回想下原因应该就是重启节点机,有些虚拟机的设备重新binding的时候出的错,要解决这个问题,就是必须将binding的状态刷回ovs,这里一个简单的方法就是修改PORT的admin_state_up字段,先改为false,再改回true
$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4 +-----------------------+------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+------------------------------------------------------------------------------------+ | admin_state_up | False | | allowed_address_pairs | | | binding:capabilities | {"port_filter": true} | | binding:host_id | 10-10-10-158 | | binding:profile | {} | | binding:vif_type | ovs | | device_id | 071c9aa0-eef9-4410-97b0-a3a79de790a7 | | device_owner | compute:lihui.openstack | | extra_dhcp_opts | | | filtered_ports | | | fixed_ips | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} | | id | 72904524-9814-4212-b714-0ff0763067a4 | | ingress_max_rate | | | mac_address | fa:16:3e:1a:3d:87 | | max_bm_pps | | | max_pps | | | max_rate | | | name | | | network_id | c108345e-494e-4829-959d-8a0a3a129aa8 | | port_security_enabled | | | prefer_rate | | | status | DOWN | | support_azs | lihui | | tenant_id | 48bc786dc981495b94d254410b46ede8 | +-----------------------+------------------------------------------------------------------------------------+
为false的时候设备不会转发数据包了,比如TAP设备抓包,毫无音讯
~$ sudo tcpdump -i tap72904524-98 icmp -en tcpdump: WARNING: tap72904524-98: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tap72904524-98, link-type EN10MB (Ethernet), capture size 65535 bytes ^C 0 packets captured 0 packets received by filter 0 packets dropped by kernel
接着,查看一下设备的详情
Port "tap72904524-98" tag: 4095 Interface "tap72904524-98"
这里OVS port的vlan tag居然被设置为了4095,设置了这个ID包应该会被丢掉,查下流表是否是drop,先找ofport
~$ sudo ovs-vsctl list interface tap72904524-98 _uuid : f900b05d-d8c0-4c58-ad49-c0f64366c063 admin_state : up bfd : {} bfd_status : {} cfm_fault : [] cfm_fault_status : [] cfm_flap_count : [] cfm_health : [] cfm_mpid : [] cfm_remote_mpids : [] cfm_remote_opstate : [] duplex : full external_ids : {attached-mac="fa:16:3e:1a:3d:87", iface-id="72904524-9814-4212-b714-0ff0763067a4", iface-status=active, vm-id="071c9aa0-eef9-4410-97b0-a3a79de790a7"} ifindex : 587 ingress_policing_burst: 0 ingress_policing_rate: 0 lacp_current : [] link_resets : 1 link_speed : 10000000 link_state : up mac : [] mac_in_use : "fe:16:3e:1a:3d:87" mtu : 1500 name : "tap72904524-98" ofport : 398 ofport_request : [] options : {} other_config : {enable_packet_hook="false"} statistics : {collisions=0, rx_bytes=10055624, rx_crc_err=0, rx_dropped=0, rx_errors=0, rx_frame_err=0, rx_over_err=0, rx_packets=59688, tx_bytes=11015008, tx_dropped=785, tx_errors=0, tx_packets=171204} status : {driver_name=tun, driver_version="1.6", firmware_version=""} type : ""
查看流表规则
~$ sudo ovs-ofctl dump-flows br-int in_port=398 NXST_FLOW reply (xid=0x4): cookie=0x583fdd1f00000001, duration=427.391s, table=0, n_packets=118, n_bytes=5336, idle_age=11, priority=2,in_port=398 actions=drop
可见,全部都是drop
再改回true
$ neutron port-show 72904524-9814-4212-b714-0ff0763067a4 +-----------------------+------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:capabilities | {"port_filter": true} | | binding:host_id | 10-10-10-158 | | binding:profile | {} | | binding:vif_type | ovs | | device_id | 071c9aa0-eef9-4410-97b0-a3a79de790a7 | | device_owner | compute:lihui.openstack | | extra_dhcp_opts | | | filtered_ports | | | fixed_ips | {"subnet_id": "84aea34c-8e1c-4841-b392-3cbaca982685", "ip_address": "10.10.188.9"} | | id | 72904524-9814-4212-b714-0ff0763067a4 | | ingress_max_rate | | | mac_address | fa:16:3e:1a:3d:87 | | max_bm_pps | | | max_pps | | | max_rate | | | name | | | network_id | c108345e-494e-4829-959d-8a0a3a129aa8 | | port_security_enabled | | | prefer_rate | | | status | ACTIVE | | support_azs | lihui | | tenant_id | 48bc786dc981495b94d254410b46ede8 | +-----------------------+------------------------------------------------------------------------------------+
再看下ovs情况,就不是4095了
Port "tap72904524-98" tag: 34 Interface "tap72904524-98"
继续看下流表,drop的规则消失了
~$ sudo ovs-ofctl dump-flows br-int in_port=398 NXST_FLOW reply (xid=0x4): ~$ sudo ovs-ofctl dump-flows br-int in_port=398 NXST_FLOW reply (xid=0x4): ~$ sudo ovs-ofctl dump-flows br-int in_port=398 NXST_FLOW reply (xid=0x4):
这样就恢复正常了
虽然问题是binding_failed,但通过修改admin_state_up这个字段,并且结合查看流表规则,来断定数据包的转发
备注:
看了下nova 11上的a040cc66-07c4-43a2-a443-ec5f7c7e5899这个port,大概主要原因如下: 1. nova 11在17:32:01左右重启,直到12:35:38左右才起来,应该是改什么grub配置所置?,总之起的慢了,而agent down time是75s,这样会导致neutron server认为其down了 2. 有客户端请求更新该port的状态,因agent down了,此api将设置为binding_failed 2016-12-12 17:35:59.936 161755 INFO neutron.wsgi [req-4ddbf8ae-c88a-4af1-9c45-2f98e28e213f dac2288842ec4bcaa1550372df94d21e 82890c8d4c4e4efaaf6ec4bae3007146] 10.185.0.123 "PUT /v2.0/ports/a040cc66-07c4-43a2-a443 -ec5f7c7e5899.json HTTP/1.1" status: 200 len: 931 time: 0.1421850 而查看nova11的nova-compute日志: 2016-12-12 17:35:59.866 5134 DEBUG neutronclient.client [-] REQ: curl -i http://10.185.0.253:9696/v2.0/ports/a040cc66-07c4-43a2-a443-ec5f7c7e5899.json -X PUT -H "X-Auth-Token: {SHA1}01f663ec1cfaeaf09791d7563e03ac8bf02620f6" -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: python-neutronclient" -d '{"port": {"binding:host_id": "pubt1-nova11.yq.163.org"}}' http_log_req /usr/lib/python2.7/dist-packages/neutronclient/common/utils.py:212 2016-12-12 17:36:00.010 5134 DEBUG neutronclient.client [-] RESP:{'date': 'Mon, 12 Dec 2016 09:35:59 GMT', 'status': '200', 'content-length': '807', 'content-type': 'application/json; charset=UTF-8'} {"port": {" filtered_ports": null, "ingress_max_rate": null, "allowed_address_pairs": [], "prefer_rate": null, "extra_dhcp_opts": [], "max_bm_pps": null, "device_owner": "compute:pubt1.rds3", "binding:profile": {}, "port_se curity_enabled": null, "fixed_ips": [{"subnet_id": "fd2b7d96-d69c-4208-afd1-e0f11c014f07", "ip_address": "10.18.194.144"}], "id": "a040cc66-07c4-43a2-a443-ec5f7c7e5899", "binding:vif_type": "binding_failed", "bi nding:capabilities": {"port_filter": false}, "mac_address": "fa:16:3e:96:a8:5c", "status": "ACTIVE", "binding:host_id": "pubt1-nova11.yq.163.org", "max_rate": null, "max_pps": null, "device_id": "65580549-2b35-4 f65-8ee3-618eed34e233", "name": "", "admin_state_up": true, "network_id": "723a2210-3451-48bf-8e21-a01496a82970", "tenant_id": "e8ca93f83b2c4a88822e194e7d4cac51"}} http_log_resp /usr/lib/python2.7/dist-packages/neutronclient/common/utils.py:218 可见port update请求是由nova发出 3. neutron server四个节点上最早于17:36:00收到report state,此时才能把agent标记活 4. 但nova 11上的ovs agent在16:36:00的时间去get_device状态,因port为binding_failed状态,故而认为该port未在server上定义,后续也不再处理了。 那么这个问题的原因归要到底是在机器断电太长时间的情况下,nova的port update不判断binding_failed这个结果并neutron返回的也是200这个成功,导致binding failed将持久化下来,只能通过手工的port_update触发重绑定解决。 因nova和ovs agent是独立恢复的,两者间没有必要的关系,目前nova只等ovs br-int table=0的流表数量大于1做为恢复条件,这个条件可能不满足需要。