在云计算虚拟机创建过程中时常会出现异常或者创建失败,而这显然不能跟一般的普通应用软件一样看执行结果,因为往往涉及到很多不同的项目组,而且数量十分庞大,因此会每一个操作都会通过详细的日志记录下来,方便定位各种异常,OpenStack就做到了这一点
比如现在使用方想看看几天前有一个虚拟机出错,但是虚拟机已经删掉了,现在需要知道错误原因,通过日志也是可以查的
用户A:
1:虚拟机已删了
2:租户为7bb024f5e216466d822e426f0444e84d
3:调用创建API的时间点大约在2016-09-09 13:59:36,304
4:availability-zone能够提供
看到这里,与平时出错相比,难点在于虚拟机已不存在,而且不知道UUID,或者说根本不清楚到底有没有成功创建,是API层出错,调度出错还是最终创建计算的时候出错,这都是无从知晓,用户A给出的信息,基本都是创建资源的时候传入的参数,对于定位问题只能协助,无法起到决定性作用,更麻烦的是调用API时间已经过去了一周时间,因此就只有通过日志来进行问题的定位
方法基本如下:
1:首先根据租户ID和调用API的时间戳在该环境下寻找相关API日志,但是要注意API节点不止ONLY ONE,因此每个nova-api节点都要找遍,创建计算资源是POST方法,结合租户ID来进行查询,由于日志都有分割,根据保存的文件时间戳定位到某一个gz文件来进行查询
2016-09-09 13:59:36.323 68188 DEBUG keystoneclient.middleware.auth_token [-] Returning cached token 49463802b1054e51a5b78adc8bf7e262 _cache_get /usr/lib/ python2.7/dist-packages/keystoneclient/middleware/auth_token.py:1050 2016-09-09 13:59:36.323 68188 DEBUG keystoneclient.middleware.auth_token [-] Received request from user: 654dd3cd139543009753ad80a2401881 with project_id : 7bb024f5e216466d822e426f0444e84d and roles: Member _build_user_headers /usr/lib/python2.7/dist-packages/keystoneclient/middleware/auth_token.py:950 2016-09-09 13:59:36.324 68188 DEBUG routes.middleware [-] Matched POST /7bb024f5e216466d822e426f0444e84d/servers __call__ /usr/lib/python2.7/dist-package s/routes/middleware.py:100 2016-09-09 13:59:36.325 68188 DEBUG routes.middleware [-] Route path: '/{project_id}/servers', defaults: {'action': u'create', 'controller': } __call__ /usr/lib/python2.7/dist-packages/routes/middleware.py:102 2016-09-09 13:59:36.325 68188 DEBUG routes.middleware [-] Match dict: {'action': u'create', 'controller': , 'project_id': u'7bb024f5e216466d822e426f0444e84d'} __call__ /usr/lib/python2.7/dist-packages/routes/middleware.py:103 2016-09-09 13:59:36.325 68188 DEBUG nova.api.openstack.wsgi [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d82 2e426f0444e84d] Action: 'create', body: {"os:scheduler_hints":{"query":"[\"in\",\"$service.availability_zone\",\”xxxx.xxxx3\"]"},"server":{"name":" xxxx-xxxx-7bb024f5e216-1473400777983-d72efbc3","imageRef":"e2f68bbd-25ab-4d65-a2bc-0ac4948f1deb","flavorRef":"92","personality":[{"path":"/etc/vm_monitor /info","contents":"eyJzZXJ2aWNlIjoiTkNFIiwib3JpX3VzZXIiOiI3YmIwMjRmNWUyMTY0NjZkODIyZTQyNmYwNDQ0ZTg0ZCIsInJlc291cmNlX3R5cGUiOiJ2bSIsInJlc291cmNlX2lkIjoicHViY2ktTkNFLTdiYjAyNGY1ZTIxNi0xNDczNDAwNzc3OTgzLWQ3MmVmYmMzIiwiYWdncmVnYXRpb25faXRlbXMiOnt9LCJhY2Nlc3NLZXkiOiJhMjNlYjczMDU0ZTM0NmJlOTFlNmM2Zjg1NmFjY2Y0YSIsImFjY2Vzc1NlY3JldCI6ImJhNTU0OWNjNzUyNzQ1MTZiNzZkMGM1YWM4OTc5YzEwIiwibW9uaXRvcldlYlNlcnZlclVybCI6Imh0dHA6Ly8xMC4xODAuMTAuODA6ODE4NiJ9"}],"networks":[{"uuid":"04afc300-3beb-430a-a902-4cb2180a27f9"},{"uuid":"7bb4b018-b077-4499-a4d7-acc9e6a56d10"}]}} _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:940
这一条API时间比较接近,在boot之前还需要获取token,验证image等多花费了少许时间
2:在创建资源这条API日志里,找到reqeustID,进行下一步检索,依旧在这个日志文件当中,找到虚拟机UUID
2016-09-09 13:59:40.004 68188 DEBUG nova.compute.api [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0 444e84d] [instance: 72c04f69-1d5a-4155-808f-2acb3967dec5] block_device_mapping [{'guest_format': None, 'boot_index': 0, 'no_device': None, 'connection_in fo': None, 'snapshot_id': None, 'volume_size': None, 'device_name': None, 'disk_bus': None, 'image_id': u'e2f68bbd-25ab-4d65-a2bc-0ac4948f1deb', 'source_ type': 'image', 'device_type': 'disk', 'volume_id': None, 'destination_type': 'local', 'delete_on_termination': True}] _update_block_device_mapping /usr/ lib/python2.7/dist-packages/nova/compute/api.py:1107
3:找到了UUID,就比较好说了,这时候可以先根据传入的availability-zone筛选出所有的可用nova-compute计算节点,直接检索UUID,如果能搜到,TRACE信息应该直接给出了错误原因,就可以得到答案;假如计算节点上依旧没有任何该虚拟机的错误信息,那么就又要回到nova-scheduler节点上来找最终的原因
4:在nova-scheduler节点上检索虚拟机UUID查找原因
2016-09-09 13:59:42.989 45197 INFO nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e8 4d] Filter JsonFilter returned 0 hosts 2016-09-09 13:59:42.989 45197 DEBUG nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e 84d] Filtering removed all hosts for the request with reservation ID 'r-a924ikix' and instance ID '72c04f69-1d5a-4155-808f-2acb3967dec5'. Filter results: [('RetryFilter', [(u'10-177-0-111', u'10-177-0-111'), (u'10-177-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177 -0-142'), (u'10-177-0-184', u'10-177-0-184'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'), (u'10-177-0-112', u'10-177-0-112'), (u'10-177-0-183', u'10-177-0-183'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-18 0-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('AvailabilityZ oneFilter', [(u'10-177-0-111', u'10-177-0-111'), (u'10-177-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142 '), (u'10-177-0-184', u'10-177-0-184'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'), (u'10 -180-0-112', u'10-177-0-112'), (u'10-177-0-183', u'10-177-0-183'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-16 0', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('RamFilter', [(u'10- 180-0-140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-125 ', u'10-177-0-125'), (u'10-177-0-143', u'10-177-0-143'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-1 80-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('ComputeFilter', [(u'10-177-0- 140', u'10-177-0-140'), (u'10-177-0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-143', u'1 0-180-0-143'), (u'10-177-0-126', u'10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-1 61'), (u'10-177-0-163', u'10-177-0-163'), (u'10-177-0-123', u'10-177-0-123')]), ('ImagePropertiesFilter', [(u'10-177-0-140', u'10-177-0-140'), (u'10-177- 0-141', u'10-177-0-141'), (u'10-177-0-142', u'10-177-0-142'), (u'10-177-0-148', u'10-177-0-148'), (u'10-177-0-143', u'10-177-0-143'), (u'10-177-0-126', u '10-177-0-126'), (u'10-177-0-162', u'10-177-0-162'), (u'10-177-0-160', u'10-177-0-160'), (u'10-177-0-161', u'10-177-0-161'), (u'10-177-0-163', u'10-177-0 -163'), (u'10-177-0-123', u'10-177-0-123')]), ('JsonFilter', None)] get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:126 2016-09-09 13:59:42.990 45197 INFO nova.filters [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d822e426f0444e8 4d] Filtering removed all hosts for the request with reservation ID 'r-a924ikix' and instance ID '72c04f69-1d5a-4155-808f-2acb3967dec5'. Filter results: ['RetryFilter: (start: 16, end: 16)', 'AvailabilityZoneFilter: (start: 16, end: 16)', 'RamFilter: (start: 16, end: 12)', 'ComputeFilter: (start: 12, end: 11)', 'ImagePropertiesFilter: (start: 11, end: 11)', 'JsonFilter: (start: 11, end: 0)'] 2016-09-09 13:59:42.991 45197 WARNING nova.scheduler.driver [req-68e928f1-d36c-484f-9f58-6b78e8a04d72 654dd3cd139543009753ad80a2401881 7bb024f5e216466d82 2e426f0444e84d] [instance: 72c04f69-1d5a-4155-808f-2acb3967dec5] Setting instance to ERROR state.
可以看到在filter的时候,创建虚拟机调度的时候满足要求的虚拟机个数为0个,最终这个虚拟机state变成了ERROR,具体原因就在中间
从中间可以看到,调度过程中,分别进行了RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter等等逐步过滤,因为用户提交API请求,会根据传入的flavor大小,系统也会来筛选资源足够,或者说更充裕满足要求的节点来进行创建,因此会逐层筛选最优,而这里的具体原因是RamFilter筛选将满足AZ要求的物理节点都给去除了,因此剩余内存小于申请内存,导致最终筛选完毕后满足要求的为0个,创建失败
OpenStack的日志记录十分齐全,定位问题相对辨识度也挺高,特别是各种TRACE信息比较详细,主要有耐心,应该都能找到具体原因,除非不是太容易重现的异常