OpenStack计算资源创建异常日志定位

在云计算虚拟机创建过程中时常会出现异常或者创建失败,而这显然不能跟一般的普通应用软件一样看执行结果,因为往往涉及到很多不同的项目组,而且数量十分庞大,因此会每一个操作都会通过详细的日志记录下来,方便定位各种异常,OpenStack就做到了这一点

比如现在使用方想看看几天前有一个虚拟机出错,但是虚拟机已经删掉了,现在需要知道错误原因,通过日志也是可以查的

用户A:

1:虚拟机已删了

2:租户为7bb024f5e216466d822e426f0444e84d

3:调用创建API的时间点大约在2016-09-09 13:59:36,304

4:availability-zone能够提供

看到这里,与平时出错相比,难点在于虚拟机已不存在,而且不知道UUID,或者说根本不清楚到底有没有成功创建,是API层出错,调度出错还是最终创建计算的时候出错,这都是无从知晓,用户A给出的信息,基本都是创建资源的时候传入的参数,对于定位问题只能协助,无法起到决定性作用,更麻烦的是调用API时间已经过去了一周时间,因此就只有通过日志来进行问题的定位

方法基本如下:

1:首先根据租户ID和调用API的时间戳在该环境下寻找相关API日志,但是要注意API节点不止ONLY ONE,因此每个nova-api节点都要找遍,创建计算资源是POST方法,结合租户ID来进行查询,由于日志都有分割,根据保存的文件时间戳定位到某一个gz文件来进行查询

2016-09-09 13:59:36.323 68188 DEBUG keystoneclient.middleware.auth_token [-] Returning cached token 49463802b1054e51a5b78adc8bf7e262 _cache_get /usr/lib/
python2.7/dist-packages/keystoneclient/middleware/auth_token.py:1050
2016-09-09 13:59:36.323 68188 DEBUG keystoneclient.middleware.auth_token [-] Received request from user: 654dd3cd139543009753ad80a2401881 with project_id
 : 7bb024f5e216466d822e426f0444e84d and roles: Member  _build_user_headers /usr/lib/python2.7/dist-packages/keystoneclient/middleware/auth_token.py:950
2016-09-09 13:59:36.324 68188 DEBUG routes.middleware [-] Matched POST /7bb024f5e216466d822e426f0444e84d/servers __call__ /usr/lib/python2.7/dist-package
s/routes/middleware.py:100
2016-09-09 13:59:36.325 68188 DEBUG routes.middleware [-] Route path: '/{project_id}/servers', defaults: {'action': u'create', 'controller': } __call__ /usr/lib/python2.7/dist-packages/routes/middleware.py:102
2016-09-09 13:59:36.325 68188 DEBUG routes.middleware [-] Match dict: {'action': u'create', 'controller': , 'project_id': u'7bb024f5e216466d822e426f0444e84d'} __call__ /usr/lib/python2.7/dist-packages/routes/middleware.py:103
2016-09-09 13:59:36.325 68188 DEBUG nova.api.openstack.wsgi [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d82
2e426f0444e84d] Action: 'create', body: {"os:scheduler_hints":{"query":"[\"in\",\"$service.availability_zone\",\”xxxx.xxxx3\"]"},"server":{"name":"
xxxx-xxxx-7bb024f5e216-1473400777983-d72efbc3","imageRef":"e2f68bbd-25ab-4d65-a2bc-0ac4948f1deb","flavorRef":"92","personality":[{"path":"/etc/vm_monitor
/info","contents":"eyJzZXJ2aWNlIjoiTkNFIiwib3JpX3VzZXIiOiI3YmIwMjRmNWUyMTY0NjZkODIyZTQyNmYwNDQ0ZTg0ZCIsInJlc291cmNlX3R5cGUiOiJ2bSIsInJlc291cmNlX2lkIjoicHViY2ktTkNFLTdiYjAyNGY1ZTIxNi0xNDczNDAwNzc3OTgzLWQ3MmVmYmMzIiwiYWdncmVnYXRpb25faXRlbXMiOnt9LCJhY2Nlc3NLZXkiOiJhMjNlYjczMDU0ZTM0NmJlOTFlNmM2Zjg1NmFjY2Y0YSIsImFjY2Vzc1NlY3JldCI6ImJhNTU0OWNjNzUyNzQ1MTZiNzZkMGM1YWM4OTc5YzEwIiwibW9uaXRvcldlYlNlcnZlclVybCI6Imh0dHA6Ly8xMC4xODAuMTAuODA6ODE4NiJ9"}],"networks":[{"uuid":"04afc300-3beb-430a-a902-4cb2180a27f9"},{"uuid":"7bb4b018-b077-4499-a4d7-acc9e6a56d10"}]}} _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:940

这一条API时间比较接近,在boot之前还需要获取token,验证image等多花费了少许时间

2:在创建资源这条API日志里,找到reqeustID,进行下一步检索,依旧在这个日志文件当中,找到虚拟机UUID

2016-09-09 13:59:40.004 68188 DEBUG nova.compute.api [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0
444e84d] [instance: 72c04f69-1d5a-4155-808f-2acb3967dec5] block_device_mapping [{'guest_format': None, 'boot_index': 0, 'no_device': None, 'connection_in
fo': None, 'snapshot_id': None, 'volume_size': None, 'device_name': None, 'disk_bus': None, 'image_id': u'e2f68bbd-25ab-4d65-a2bc-0ac4948f1deb', 'source_
type': 'image', 'device_type': 'disk', 'volume_id': None, 'destination_type': 'local', 'delete_on_termination': True}] _update_block_device_mapping /usr/
lib/python2.7/dist-packages/nova/compute/api.py:1107

3:找到了UUID,就比较好说了,这时候可以先根据传入的availability-zone筛选出所有的可用nova-compute计算节点,直接检索UUID,如果能搜到,TRACE信息应该直接给出了错误原因,就可以得到答案;假如计算节点上依旧没有任何该虚拟机的错误信息,那么就又要回到nova-scheduler节点上来找最终的原因

4:在nova-scheduler节点上检索虚拟机UUID查找原因

2016-09-09 13:59:42.989 45197 INFO nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e8
4d] Filter JsonFilter returned 0 hosts
2016-09-09 13:59:42.989 45197 DEBUG nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e
84d] Filtering removed all hosts for the request with reservation ID 'r-a924ikix' and instance ID '72c04f69-1d5a-4155-808f-2acb3967dec5'. Filter results:
 [('RetryFilter', [(u'10-177-0-111', u'10-177-0-111'), (u'10-177-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177
-0-142'), (u'10-177-0-184', u'10-177-0-184'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'),
 (u'10-177-0-112', u'10-177-0-112'), (u'10-177-0-183', u'10-177-0-183'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-18
0-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('AvailabilityZ
oneFilter', [(u'10-177-0-111', u'10-177-0-111'), (u'10-177-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142
'), (u'10-177-0-184', u'10-177-0-184'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'), (u'10
-180-0-112', u'10-177-0-112'), (u'10-177-0-183', u'10-177-0-183'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-16
0', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('RamFilter', [(u'10-
180-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125
', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-1
80-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('ComputeFilter', [(u'10-177-0-
140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-143', u'1
0-180-0-143'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-1
61'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('ImagePropertiesFilter', [(u'10-177-0-140', u'10-177-0-140'), (u'10-177-
0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-143', u'10-177-0-143'), (u'10-177-0-126', u
'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0
-163'), (u'10-177-0-123', u'10-177-0-123')]), ('JsonFilter', None)] get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:126
2016-09-09 13:59:42.990 45197 INFO nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e8
4d] Filtering removed all hosts for the request with reservation ID 'r-a924ikix' and instance ID '72c04f69-1d5a-4155-808f-2acb3967dec5'. Filter results:
['RetryFilter: (start: 16, end: 16)', 'AvailabilityZoneFilter: (start: 16, end: 16)', 'RamFilter: (start: 16, end: 12)', 'ComputeFilter: (start: 12, end:
 11)', 'ImagePropertiesFilter: (start: 11, end: 11)', 'JsonFilter: (start: 11, end: 0)']
2016-09-09 13:59:42.991 45197 WARNING nova.scheduler.driver [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d82
2e426f0444e84d] [instance: 72c04f69-1d5a-4155-808f-2acb3967dec5] Setting instance to ERROR state.

可以看到在filter的时候,创建虚拟机调度的时候满足要求的虚拟机个数为0个,最终这个虚拟机state变成了ERROR,具体原因就在中间

从中间可以看到,调度过程中,分别进行了RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter等等逐步过滤,因为用户提交API请求,会根据传入的flavor大小,系统也会来筛选资源足够,或者说更充裕满足要求的节点来进行创建,因此会逐层筛选最优,而这里的具体原因是RamFilter筛选将满足AZ要求的物理节点都给去除了,因此剩余内存小于申请内存,导致最终筛选完毕后满足要求的为0个,创建失败

OpenStack的日志记录十分齐全,定位问题相对辨识度也挺高,特别是各种TRACE信息比较详细,主要有耐心,应该都能找到具体原因,除非不是太容易重现的异常

发表回复