What to watch?
There are several channels to watch during your Rota shift:
-
Emails to ceph-admins@cern.ch:
- "Ceph Health Warn" mails.
- SNOW tickets from IT Repair Service.
- Prometheus Alerts.
-
SNOW tickets assigned to Ceph Service:
- Here is a link to the tickets needing to be taken: Ceph Assigned
-
Ceph Internal Mattermost channel
-
General informations on clusters (configurations, OSD types, HW, versions): Instance Version Tracking ticket
Taking notes
Each action you take should be noted down in a journal, which is to be linked or attached to the minutes of theCeph weekly meeting the following week. https://indico.cern.ch/category/9250/ Use HackMD, Notepad, ...
Keeping the Team Informed
If you have any questions or take any significant actions, keep you colleagues informed in Mattermost
Common Procedures
- scsi_blockdevice_driver_error_reported
- CephInconsistentPGs
- Ceph PG Unfound
- CephTargetDown
- SSD Replacement
- MDS Slow Ops
- Large omap Objects
exception.scsi_blockdevice_driver_error_reported
Draining a Failing OSD
The IT Repair Service may ask ceph-admins to prepare a disk to be physically removed.
The scripts needed for the replacement procedure may be found under ceph-scripts/tools/ceph-disk-replacement/
.
For failing OSDs in wigner cluster, contact ceph-admins
-
watch ceph status
<- keep this open in a separate window. -
Login to the machine with a failing drive and run
./drain-osd.sh --dev /dev/sdX
(the ticket should tell which drive is failing)- For machines in /ceph/erin/osd/castor: You cannot run the script, ask ceph-admins.
- If the output is of the following form: Take notes of the OSD id
<id>
ceph osd out osd.<id>
- Else
- If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
- Else if the script shows a broken output (especially missing
<id>
): Contact ceph-admins
-
Run
./drain-osd.sh --dev /dev/sdX | sh
-
Once drained (can take a few hours), we now want to prepare the disk for replacement
- Run
./prepare-for-replacement.sh --dev /dev/sdX
- Continue if the output is of the following form and that the OSD id
<id>
displayed is consistent with what was given by the previous command:
systemctl stop ceph-osd@<id> umount /var/lib/ceph/osd/ceph-<id> ceph-volume lvm zap /dev/sdX --destroy
-
(note that the
--destroy
flag will be dropped in case of a FileStore OSD) -
Else
- If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
- Else if the script shows a broken output (especially missing
<id>
): Contact ceph-admins
- Run
-
Run
./prepare-for-replacement.sh --dev /dev/sdX | sh
to execute. -
Now the disk is safe to be physically removed.
- Notify the repair team in the ticket
Creating a new OSD (on a replacement disk)
When the IT Repair Service has replaced the broken disk with a new one, we have to format that disk with BlueStore to add it back to the cluster:
-
watch ceph status
<- keep this open in a separate window. -
Identify the osd id
to use on this OSD: - Check your notes from the drain procedure above.
- Cross-check with
ceph osd tree down
<-- look for the down osd on this host, should match your notes.
-
Run
./recreate-osd.sh --dev /dev/sdX
and check that the output is according to the following:
- On beesly cluster:
ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
- On gabe cluster:
ceph-volume lvm zap /dev/sdX
ceph-volume lvm zap /dev/ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
-
On erin cluster:
- Regular case:
ceph-volume lvm zap /dev/sdX ceph osd destroy <id> --yes-i-really-mean-it ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
- ceph/erin/castor/osd
- Script cannot be run, contact ceph-admins.
- If the output is satisfactory, run
./recreate-osd.sh --dev /dev/sdX | sh
See OSD Replacement for many more details.
CephInconsistentPGs
Familiarize yourself with the Upstream documentation
Check ceph.log on a ceph/*/mon machine to find the original "cluster [ERR]" line.
The inconsistent PGs generally come in two types:
- deep-scrub: stat mismatch, solution is to repair the PG
- Here is an example on
ceph\flax
:
- Here is an example on
2019-02-17 16:23:05.393557 osd.60 osd.60 128.142.161.220:6831/3872729 56 : cluster [ERR] 1.85 deep-scrub : stat mismatch, got 149749/149749 objects, 0/0 clones, 149749/149749 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 135303283738/135303284584 bytes, 0/0 hit_set_archive bytes.
2019-02-17 16:23:05.393566 osd.60 osd.60 128.142.161.220:6831/3872729 57 : cluster [ERR] 1.85 deep-scrub 1 errors
- candidate had a read error, solution follows below.
- Notice that the doc says If read_error is listed in the errors attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. This is indeed the most common scenario.
Handle a failing disk
In this case, a failing disk returns bogus data during deep scrubbing, and ceph will notice that the replicas are not all consistent with each other. The correct procedure is therefore to remove the failing disk from the cluster, let the PGs backfill, then finally to deep-scrub the inconsistent PG once again.
Here is an example on ceph/erin
cluster, where the monitoring has told us that PG 64.657c
has an inconsistent PG:
[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~] grep shard /var/log/ceph/ceph.log
2017-04-12 06:34:26.763000 osd.508 128.142.25.116:6924/4070422 4602 : cluster [ERR] 64.657c shard 187:
soid 64:3ea78883:::1568573986@castorns.27153415189.0000000000000034:head candidate had a read error
A shard in this case refers to which OSD has the inconsistent object replica, in this case it's the "osd.187".
Where is osd.187?
[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~]# ceph osd find 187
{
"osd": 187,
"ip": "128.142.25.106:6820\/530456",
"crush_location": {
"host": "p05972678k94093",
"rack": "EC06",
"room": "0513-R-0050",
"root": "default",
"row": "EC"
}
}
On the p05972678k94093
host we first need to find out which /dev/sd* device hosts that osd.187.
On BlueStore OSDs we need to check with ceph-volume lvm list
or lvs
:
[14:38][root@p05972678e32155 (production:ceph/erin/osd*30) ~]# lvs -o +devices,tags | grep 187
osd-block-... ceph-... -wi-ao---- <5.46t /dev/sdm(0) ....,ceph.osd_id=187,....
So we know the failed drive is /dev/sdm
, now we can check for disk Medium errors:
[09:16][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# grep sdm /var/log/messages
[Wed Apr 12 12:27:59 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 04 00 00 00
[Wed Apr 12 12:27:59 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Sense Key : Medium Error [current]
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Add. Sense: Unrecovered read error
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 00 08 00 00
[Wed Apr 12 12:28:02 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
In this case, the disk is clearly failing.
Now check if that osd is safe to stop?
[14:41][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# ceph osd ok-to-stop osd.187
OSD(s) 187 are ok to stop without reducing availability, provided there are no other concurrent failures or interventions. 182 PGs are likely to be degraded (but remain available) as a result.
Since it is OK, we stop the osd, umount it, and mark it out.
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# systemctl stop ceph-osd@187.service
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# umount /var/lib/ceph/osd/ceph-187
[09:17][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# ceph osd out 187
marked out osd.187.
ceph status
should now show the PG is in a state like this:
1 active+undersized+degraded+remapped+inconsistent+backfilling
It can take a few 10s of minutes to backfill the degraded PG.
Repairing a PG
Once the inconsistent PG is no longer "undersized" or "degraded", use the script at ceph-scripts/tools/scrubbing/autorepair.sh
to repair the PG and start the scubbing immediately.
Now check ceph status
... You should see the scrubbing+repair
started already on the inconsistent PG.
Ceph PG Unfound
The PG unfound condition may be due to a race condition when PGs are scrubbed (see https://tracker.ceph.com/issues/51194) leading to PG reported as recovery_unfound
.
Upstream documentation is available for general unfound objects
In case of unfound objects, ceph reports a HEALTH_ERR condition
# ceph -s
cluster:
id: 687634f1-03b7-415b-aff9-e21e6bedbe7c
health: HEALTH_ERR
1/282983194 objects unfound (0.000%)
Possible data damage: 1 pg recovery_unfound
Degraded data redundancy: 3/848949582 objects degraded (0.000%), 1 pg degraded
services:
mon: 3 daemons, quorum cephdata20-4675e5a59e,cephdata20-44bdbfa86f,cephdata20-83e1d8a16e (age 4h)
mgr: cephdata20-83e1d8a16e(active, since 11w), standbys: cephdata20-4675e5a59e, cephdata20-44bdbfa86f
osd: 576 osds: 575 up (since 9d), 573 in (since 9d)
data:
pools: 3 pools, 17409 pgs
objects: 282.98M objects, 1.1 PiB
usage: 3.2 PiB used, 3.0 PiB / 6.2 PiB avail
pgs: 3/848949582 objects degraded (0.000%)
1/282983194 objects unfound (0.000%)
17342 active+clean
60 active+clean+scrubbing+deep
6 active+clean+scrubbing
1 active+recovery_unfound+degraded
List the PGs in recovery_unfound
state
# ceph pg ls recovery_unfound
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
1.2d09 17232 3 0 1 72106876434 0 0 3373 active+recovery_unfound+degraded 37m 399723'3926620 399723:23220581 [574,671,662]p574 [574,671,662]p574 2023-01-12T13:27:34.752832+0100 2023-01-12T13:27:34.752832+0100
Check the ceph log (cat /var/log/ceph/ceph.log | grep ERR
) for IO errors on the primary OSD of the PG. In this case, the disk backing osd.574 is failing with pending sectors (check with smartctl -a <device>
)
2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
2023-01-12T13:27:34.752327+0100 osd.574 (osd.574) 776 : cluster [ERR] 1.2d09 deep-scrub 0 missing, 1 inconsistent objects
2023-01-12T13:27:34.752830+0100 osd.574 (osd.574) 777 : cluster [ERR] 1.2d09 repair 1 errors, 1 fixed
2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)
Before taking any action, make sure that the version of the objected reported as unfound on the other two OSDs are more recent than the lost one:
- List unfound object
# ceph pg 1.2d09 list_unfound { "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "rbd_data.0bee1ae64c9012.00000000000032c4", "key": "", "snapid": -2, "hash": 2152017161, "max": 0, "pool": 1, "namespace": "" }, "need": "399702'3923004", "have": "0'0", "flags": "none", "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1", "locations": [] } ], "state": "NotRecovering", "available_might_have_unfound": true, "might_have_unfound": [], "more": false
- The missing object is at version
399702
- Last osd map before read error: e399704
2023-01-12T13:07:24.463521+0100 mon.cephdata20-4675e5a59e (mon.0) 2714279 : cluster [DBG] osdmap e399704: 576 total, 575 up, 573 in 2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
- The object goes unfound at: e399710
2023-01-12T13:27:30.297813+0100 mon.cephdata20-4675e5a59e (mon.0) 2714933 : cluster [DBG] osdmap e399710: 576 total, 575 up, 573 in 2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)
- The two copies on 671 and 662 are more recent --
399702 VS 399709
:2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819 2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
If copies are more recent than the lost one:
- Set the primary osd (
osd.574
) out - The
recovery_unfound
object disappears and backfilling start - Once backfilled, deep-scrub the PG to check for inconsistencies
CephTargetDown
This is an special alert raised by prometheus. This indicates that for whatever reason a target
node is not exposing
its metrics anymore or prometheus server is not able to pull them. This does not imply that the node is offline, just
that the node endpoint is down for prometheus.
To handle this tickets first we need to identify which is the affected target. This information should be in the ticket body.
The following Alerts are in Firing Status:
------------------------------------------------
Target cephpolbo-mon-0.cern.ch:9100 is down
Target cephpolbo-mon-2.cern.ch:9100 is down
Alert Details:
------------------------------------------------
Alertname: TargetDown
Cluster: polbo
Job: node
Monitor: cern
Replica: A
Severity: warning
After, we can go to the target
section in prometheus's dashboard and cross-check
the affected node. There you can find more information about the reason of being down.
This could be caused by the following reasons:
- A node is offline or it's being restarted. Follow the normal procedures for understanding why the node is not online (ping, ssh, console access, SNOW ticket search...). Once the node is back, the target should be marked as UP again automatically.
- If a new target was added recently, possibly there are mistakes in the target definition or some conectivity problems
like the port being blocked.
- Review the target configuration in
it-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
and refer to the monitoring guide. - Make sure that the firewall configuration allows prometheus to scrape the data through the specified port.
- Review the target configuration in
- In ceph, the daemons that expose the metrics are the
mgr
. Sometimes, could happen that the mgr hangs and then stop exposing the metrics.- Check the
mgr
status and eventually restart it. Don't forget to collect information about the state in what you found it for further analysis. If all went well, after 30 seconds, the target should beUP
again in prometheus dashboard. For double-check you can click in theendpoint
url of the node and see if the metrics are now shown.
- Check the
SSD Replacement
Draining OSDs attached to a failing SSD
In order to drain the osds attached to a failing SSD, run the following command:
$> cd /root/ceph-scripts/tools/ceph-disk-replacement
$> ./ssd-drain-osd.sh --dev /dev/<ssd>
ceph osd out osd.<osd0>;
ceph osd primary-affinity osd.<osd0> 0;
ceph osd out osd.<osd1>;
ceph osd primary-affinity osd.<osd1> 0;
...
ceph osd out osd.<osdN>;
ceph osd primary-affinity osd.<osdN> 0;
If the output is similar to the one above, it is safe to re-run the commands adding | sh
to actually put out of the cluster all the osds attached to the ssd.
Prepare for replacement
Once the draining has been started, the osds need to be zapped before the ssd can be removed and physically replaced:
$> ./ssd-prepare-for-replacement.sh --dev /dev/<dev> -f
systemctl stop ceph-osd@<osd0>
umount /var/lib/ceph/osd/ceph-<osd0>
ceph-volume lvm zap --destroy --osd-id <osd0>
systemctl stop ceph-osd@<osd1>
umount /var/lib/ceph/osd/ceph-<osd1>
ceph-volume lvm zap --destroy --osd-id <osd1>
...
systemctl stop ceph-osd@<osdN>
umount /var/lib/ceph/osd/ceph-<osdN>
ceph-volume lvm zap --destroy --osd-id <osdN>
Recreate the OSD
TBC
MDS Slow Ops
Check for long ongoing operations on the MDS reporting Slow Ops:
The mon shows SLOW_OPS warning:
ceph health details
cat /var/log/ceph/ceph.log | grep SLOW
cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
The affected MDS shows slow request in the logs:
cat /var/log/ceph/ceph-mds.cephcpu21-0c370531cf.log | grep -i SLOW
2022-10-22T09:09:21.473+0200 7fe1b8054700 0 log_channel(cluster) log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 2356.704295 secs
2022-10-22T09:09:21.473+0200 7fe1b8054700 0 log_channel(cluster) log [WRN] : slow request 1924.631928 seconds old, received at 2022-10-22T08:37:16.841403+0200: client_request(client.366059605:743931 getattr AsXsFs #0x10251604c38 2022-10-22T08:37:16.841568+0200 caller_uid=1001710000, caller_gid=0{1001710000,}) currently dispatched
Dump the ongoing ops and check there are some with very long (minutes, hours) age:
ceph daemon mds.`hostname -s` ops | grep age | less
Identify the client with such long ops (age should be >900):
ceph daemon mds.`hostname -s` ops | egrep 'client|age' | less
"description": "client_request(client.364075205:4876 getattr pAsLsXsFs #0x1023f14e5d8 2022-10-16T03:46:40.673900+0200 RETRY=184 caller_uid=0, caller_gid=0{})",
"age": 0.87975248399999995,
"reqid": "client.364075205:4876",
"op_type": "client_request",
"client_info": {
"client": "client.364075205",
Get info on the client:
ceph daemon mds.`hostname -s` client ls id=<THE_ID>
- IP address
- Hostname
- Ceph client version
- Kernel version (in case of a kernel mount)
- Mount point (on the client side)
- Root (aka, the CephFS volume the client mounts)
Evict the client:
ceph tell mds.* client ls id=<THE_ID>
ceph tell mds.* client evict id=<THE_ID>
Large omap objects
On S3 clusters, it may happen to see a HEALTH_WARN message reporting 1 large omap objects
.
This is very likely due to bucket index(es) being over full. Example:
"user_id": "warp-tests",
"buckets": [
{
"bucket": "warp-tests",
"tenant": "",
"num_objects": 9993106,
"num_shards": 11,
"objects_per_shard": 908464,
"fill_status": "OVER"
}
]
Proceed as follows:
- Check bucket index(es) being over full is the actual problem:
radosgw-admin bucket limit check
- If it it not possible to reshard the bucket tune
osd_deep_scrub_large_omap_object_key_threshold
properly
Default is 200000; Gabe runs with 500000. Read at 42on.comceph config set osd osd_deep_scrub_large_omap_object_key_threshold 300000
- If it is possible to reshard the bucket, manually reshard any bucket showing
fill_status
WARN
orOVER
:- keep the number of objects per shard around 50k
- pick a prime number of shards
- consider if the bucket will be ever growing or owners delete objects. If ever-growing, you may reshard to a high number of shards to avoid (or postpone) resharding in the future.
radosgw-admin bucket reshard --bucket=warp-tests --num-shards=211
- Check in
ceph.log
which is the PG complining about the large omap objects and start a deep scrub on it (else the HEALTH_WARN won't go away)# zcat /var/log/ceph/ceph.log-20221204.gz | grep -i large 2022-12-03T06:48:37.975544+0100 osd.179 (osd.179) 996 : cluster [WRN] Large omap object found. Object: 9:22f5fbf8:::.dir.a1035ed2-37be-4e7d-892d-46728bc3d046.285532.1.1:head PG: 9.1fdfaf44 (9.344) Key count: 204639 Size (bytes): 60621488 2022-12-03T06:48:39.270652+0100 mon.cephdata22-12f31fcca0 (mon.0) 292373 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS) # ceph pg deep-scrub 9.344 instructing pg 9.344 on osd.179 to deep-scrub