Fault Description
A customer's server room power transformation, power repair is complete, the physical server will be powered up to the normal state,
start openstack in the virtual machine,one of the virtual machine can not be started,
at the same time there is a virtual machine after the start, can not be written to the file system, and IO utilization rate soared,
but the IO does not have the actual amount of reads and writes.
Failure analysis
1, through and customer communication inquiry, in the server before the normal power down, the customer will openstack virtual
machine all shutdown, openstack service stopsbefore the ceph and openstack physical server normal shutdown.
2. After the server is powered up, start ceph first and then start openstack.
3, after starting the openstack service, manually start all the vm.
4, at this time did not find abnormal conditions, until after receiving feedback from the application staff, found that a vm startup is not normal,
a vm file system read and write is not normal.
Log in to openstack and ceph management platformto check the status, and found that the status of a vm in openstack is as follows:
The ceph status is as follows:
Attempting to reboot the virtual machine in an abnormal state reports the following error:
Troubleshooting
Through the output of ceph, it is found that osd.7 prompts slow ops, while 1 pg is in activating state.
1. Determine the status of osd
Determine that osd.7 belongs to the ceph03 node with the above command.
2. Determine the pg status
With the above commands, it was found that pg 7.1d already had a STUCK state when it was shut down last night.
The activating state in ceph means that the pgs have been interconnected, but cannot be active normally.
3. Check the ceph logs
Check the ceph log of ceph03 node, /var/log/ceph/ceph-osd.7.log, with the following contents:
Troubleshooting
1、Try to restart mon service.
Try to restart the ceph.mon service, it did not take effect.
2. Try to reboot to fix pg.
Tried to repair pg, did not work.
3. Restart osd service
Try to restart osd service, the problem is solved.
After the ceph issue was resolved, the vm status in openstack changed to normal.
Summary of experience
1, ceph change, need to shut down, it is recommended to stop all applications, and then shut down the ceph operation.
2, after re-powering the computer, first ensure that the ceph status is normal, and then go to start the application.
3, for the daily operation and maintenance of ceph, we should do more monitoring and establish a performance baseline,
so that when we find problems, we can make effective comparisons.
For more information, please visit Antute's official website:54z9.pearltele.com