OSD Replacement Procedures
Check which disks needs to be put back in procedures.
- To see which osds are down, check with
ceph osd tree down out
.
[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5589.18994 root default
-2 4428.02979 room 0513-R-0050
-6 917.25500 rack RA09
-7 131.03999 host p06253939j03957
430 5.45999 osd.430 down 0 1.00000
-19 131.03999 host p06253939s09190
24 5.45999 osd.24 down 0 1.00000
405 5.45999 osd.405 down 0 1.00000
-9 786.23901 rack RA13
-11 131.03999 host p06253939b84659
101 5.45999 osd.101 down 0 1.00000
-32 131.03999 host p06253939u19068
577 5.45999 osd.577 down 0 1.00000
-14 895.43903 rack RA17
-34 125.58000 host p06253939f99921
742 5.45999 osd.742 down 0 1.00000
-22 125.58000 host p06253939h70655
646 5.45999 osd.646 down 0 1.00000
659 5.45999 osd.659 down 0 1.00000
718 5.45999 osd.718 down 0 1.00000
-26 131.03999 host p06253939v20205
650 5.45999 osd.650 down 0 1.00000
-33 131.03999 host p06253939w66726
362 5.45999 osd.362 down 0 1.00000
654 5.45999 osd.654 down 0 1.00000
- Check the tickets for the machines in Service Now. Those who interest us are the named :
[GNI] exception.scsi_blockdevice_driver_error_reported
orexception.nonwriteable_filesystems
.- If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.
On the OSD:
LVM formatting using ceph-volume
- Simple format: osd as logical volume of one disk
This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.
(Ceph volume listing takes some time to complete)
[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list
===== osd.335 ======
[block] /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
type block
osd id 335
cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb
cluster name ceph
osd fsid c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
encrypted 0
cephx lockbox secret
block uuid PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92
block device /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
vdo 0
crush device class None
devices /dev/sdw
===== osd.311 ======
[block] /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
type block
osd id 311
cluster fsid eecca9ab-161c-474c-9521-0e5118612dbb
cluster name ceph
osd fsid 1bfad506-c450-4116-8ba5-ac356be87a9e
encrypted 0
cephx lockbox secret
block uuid O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a
block device /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
vdo 0
crush device class None
devices /dev/sdt
This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.
[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list
====== osd.29 ======
[block] /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
type block
osd id 29
cluster fsid dd535a7e-4647-4bee-853d-f34112615f81
cluster name ceph
osd fsid dff889e7-5db5-4c5e-9aab-151e8ad17b48
db device /dev/sdac3
encrypted 0
db uuid 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
cephx lockbox secret
block uuid HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY
block device /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
crush device class None
devices /dev/sdk
[ db] /dev/sdac3
PARTUUID 9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
====== osd.88 ======
[block] /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
type block
osd id 88
cluster fsid dd535a7e-4647-4bee-853d-f34112615f81
cluster name ceph
osd fsid f19541f6-42b2-4612-a700-ec5ac8ed4558
db device /dev/sdab6
encrypted 0
db uuid f0b652e1-0161-4583-a50b-45a0a2348e9a
cephx lockbox secret
block uuid cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg
block device /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
crush device class None
devices /dev/sdu
[ db] /dev/sdab6
PARTUUID f0b652e1-0161-4583-a50b-45a0a2348e9a
One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:
[ db] /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
type db
osd id 220
cluster fsid e7681812-f2b2-41d1-9009-48b00e614153
cluster name ceph
osd fsid 81f9ed48-d27d-44b6-9ac0-f04799b5d0d5
db device /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
encrypted 0
db uuid wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL
cephx lockbox secret
block uuid z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR
block device /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8
vdo 0
crush device class None
devices /dev/md125
Replacement procedure: one disk per osd
ceph-volume lvm list
is slow, save its output to~/ceph-volume.out
and work with that file instead.- Check if the ssd device exists and it is failed.
- Check if it is used as a metadata database for osds, or as a regular osd.
- If it is a metadata database:
- Locate all osds that use it (lvm list + grep)
- Follow the procedure for each affected osd
- Treat it as a regular osd (normal replacement)
- If it is a metadata database:
- Mark out the osd:
ceph osd out $OSD_ID
- Destroy the osd:
ceph osd destroy $OSD_ID --yes-i-really-mean-it
- Stop the osd daemon:
systemctl stop ceph-osd@OSD_ID
- Unmount the filesystem:
umount /var/lib/ceph/osd/ceph-$OSD_ID
- If the osd has uses a metadata database (on ssds)
- If it is a regular partition, remove the partition I guess
- If it's an lvm, remove it:
- eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
lvremove cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
- Run
ceph-volume lvm zap /dev/sdXX --destroy
- In case
ceph-volume
fails to list the defected devices or zap the disks. You can get the information you need throughlvs -o devices,lv_tags | grep type=block
and usevgremove
instead for the osd block. - In case you can't get any information through
ceph-volume
orlvs
about the defective devices, you should list the working osds andumount
the unused folders with:$ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
- Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
- If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm.
For naming we use
cache-`uuid -v4`
. Just recreate the lvm you removed at step 7 with:lvcreate -name $name -l 100%FREE $VG
. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG25%VG
instead of100%FREE
. - Recreate the OSD using ceph volume, use a destroyed osd's id from the same host
$ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX
Replacement procedure: two disks striped (raid 0) per osd
- Run this script with the defective device
ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX
(it doesn't take a list of devices) - The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
- In any case the script failed, you can open it, as it is documented and follow the steps manually.
- If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
- Pass the set of disks from step 1 after you have all of them working on this script:
It usesceph-scripts/ceph-volume/striped-osd-prepare.sh /dev/sd[a-f]
ls
inside so you can use wildcards if you are bored to write '/dev/sdX' all the time. - It will output a list of commands to be executed in order, run all EXCEPT THE
ceph-volume create
one. Add at the end of theceph-volume create
line the argument--osd-id XXX
with the number of the destroyed osd id, and run the command.