Ceph ROTA Procedures - Using and Operating Ceph

There are several channels to watch during your Rota shift:

Emails to ceph-admins@cern.ch:
- "Ceph Health Warn" mails.
- SNOW tickets from IT Repair Service.
- Prometheus Alerts.
SNOW tickets assigned to Ceph Service:
- Here is a link to the tickets needing to be taken: Ceph Assigned
Ceph Internal Mattermost channel
General informations on clusters (configurations, OSD types, HW, versions): Instance Version Tracking ticket

Each action you take should be noted down in a journal, which is to be linked or attached to the minutes of theCeph weekly meeting the following week. https://indico.cern.ch/category/9250/ Use HackMD, Notepad, ...

If you have any questions or take any significant actions, keep you colleagues informed in Mattermost

scsi_blockdevice_driver_error_reported
- Draining a Failing OSD
- Creating a new OSD
CephInconsistentPGs
Ceph PG Unfound
CephTargetDown
SSD Replacement
MDS Slow Ops
Large omap Objects

The IT Repair Service may ask ceph-admins to prepare a disk to be physically removed. The scripts needed for the replacement procedure may be found under ceph-scripts/tools/ceph-disk-replacement/.

For failing OSDs in wigner cluster, contact ceph-admins

watch ceph status <- keep this open in a separate window.
Login to the machine with a failing drive and run ./drain-osd.sh --dev /dev/sdX (the ticket should tell which drive is failing)
- For machines in /ceph/erin/osd/castor: You cannot run the script, ask ceph-admins.
- If the output is of the following form: Take notes of the OSD id <id>
```
ceph osd out osd.<id>
```
- Else
  - If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
  - Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
Run ./drain-osd.sh --dev /dev/sdX | sh
Once drained (can take a few hours), we now want to prepare the disk for replacement
- Run ./prepare-for-replacement.sh --dev /dev/sdX
- Continue if the output is of the following form and that the OSD id <id> displayed is consistent with what was given by the previous command:
```
systemctl stop ceph-osd@<id>
umount /var/lib/ceph/osd/ceph-<id>
ceph-volume lvm zap /dev/sdX --destroy
```
- (note that the --destroy flag will be dropped in case of a FileStore OSD)
- Else
  - If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
  - Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
Run ./prepare-for-replacement.sh --dev /dev/sdX | sh to execute.
Now the disk is safe to be physically removed.
- Notify the repair team in the ticket

When the IT Repair Service has replaced the broken disk with a new one, we have to format that disk with BlueStore to add it back to the cluster:

watch ceph status <- keep this open in a separate window.
Identify the osd id to use on this OSD:
- Check your notes from the drain procedure above.
- Cross-check with ceph osd tree down <-- look for the down osd on this host, should match your notes.
Run ./recreate-osd.sh --dev /dev/sdX and check that the output is according to the following:

On beesly cluster:

ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY

On gabe cluster:

ceph-volume lvm zap /dev/sdX
ceph-volume lvm zap /dev/ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+

On erin cluster:

Regular case:

ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY

ceph/erin/castor/osd
- Script cannot be run, contact ceph-admins.

If the output is satisfactory, run ./recreate-osd.sh --dev /dev/sdX | sh

See OSD Replacement for many more details.

Familiarize yourself with the Upstream documentation

Check ceph.log on a ceph/*/mon machine to find the original "cluster [ERR]" line.

The inconsistent PGs generally come in two types:

deep-scrub: stat mismatch, solution is to repair the PG
- Here is an example on ceph\flax:

2019-02-17 16:23:05.393557 osd.60 osd.60 128.142.161.220:6831/3872729 56 : cluster [ERR] 1.85 deep-scrub : stat mismatch, got 149749/149749 objects, 0/0 clones, 149749/149749 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 135303283738/135303284584 bytes, 0/0 hit_set_archive bytes.
2019-02-17 16:23:05.393566 osd.60 osd.60 128.142.161.220:6831/3872729 57 : cluster [ERR] 1.85 deep-scrub 1 errors

candidate had a read error, solution follows below.

Notice that the doc says If read_error is listed in the errors attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. This is indeed the most common scenario.

In this case, a failing disk returns bogus data during deep scrubbing, and ceph will notice that the replicas are not all consistent with each other. The correct procedure is therefore to remove the failing disk from the cluster, let the PGs backfill, then finally to deep-scrub the inconsistent PG once again.

Here is an example on ceph/erin cluster, where the monitoring has told us that PG 64.657c has an inconsistent PG:

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~] grep shard /var/log/ceph/ceph.log
2017-04-12 06:34:26.763000 osd.508 128.142.25.116:6924/4070422 4602 : cluster [ERR] 64.657c shard 187:
soid 64:3ea78883:::1568573986@castorns.27153415189.0000000000000034:head candidate had a read error

A shard in this case refers to which OSD has the inconsistent object replica, in this case it's the "osd.187".

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~]# ceph osd find 187
{
   "osd": 187,
   "ip": "128.142.25.106:6820\/530456",
   "crush_location": {
       "host": "p05972678k94093",
       "rack": "EC06",
       "room": "0513-R-0050",
       "root": "default",
       "row": "EC"
   }
}

On the p05972678k94093 host we first need to find out which /dev/sd* device hosts that osd.187.

On BlueStore OSDs we need to check with ceph-volume lvm list or lvs:

[14:38][root@p05972678e32155 (production:ceph/erin/osd*30) ~]# lvs -o +devices,tags | grep 187
  osd-block-... ceph-... -wi-ao---- <5.46t        /dev/sdm(0) ....,ceph.osd_id=187,....

So we know the failed drive is /dev/sdm, now we can check for disk Medium errors:

[09:16][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# grep sdm /var/log/messages
[Wed Apr 12 12:27:59 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 04 00 00 00
[Wed Apr 12 12:27:59 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Sense Key : Medium Error [current]
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Add. Sense: Unrecovered read error
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 00 08 00 00
[Wed Apr 12 12:28:02 2017] blk_update_request: critical medium error, dev sdm, sector 90638112

In this case, the disk is clearly failing.

Now check if that osd is safe to stop?

[14:41][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# ceph osd ok-to-stop osd.187
OSD(s) 187 are ok to stop without reducing availability, provided there are no other concurrent failures or interventions. 182 PGs are likely to be degraded (but remain available) as a result.

Since it is OK, we stop the osd, umount it, and mark it out.

[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# systemctl stop ceph-osd@187.service
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# umount /var/lib/ceph/osd/ceph-187
[09:17][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# ceph osd out 187
marked out osd.187.

ceph status should now show the PG is in a state like this:

             1     active+undersized+degraded+remapped+inconsistent+backfilling

It can take a few 10s of minutes to backfill the degraded PG.

Once the inconsistent PG is no longer "undersized" or "degraded", use the script at ceph-scripts/tools/scrubbing/autorepair.sh to repair the PG and start the scubbing immediately.

Now check ceph status... You should see the scrubbing+repair started already on the inconsistent PG.

The PG unfound condition may be due to a race condition when PGs are scrubbed (see https://tracker.ceph.com/issues/51194) leading to PG reported as recovery_unfound.

Upstream documentation is available for general unfound objects

In case of unfound objects, ceph reports a HEALTH_ERR condition

# ceph -s
  cluster:
    id:     687634f1-03b7-415b-aff9-e21e6bedbe7c
    health: HEALTH_ERR
            1/282983194 objects unfound (0.000%)
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 3/848949582 objects degraded (0.000%), 1 pg degraded
 
  services:
    mon: 3 daemons, quorum cephdata20-4675e5a59e,cephdata20-44bdbfa86f,cephdata20-83e1d8a16e (age 4h)
    mgr: cephdata20-83e1d8a16e(active, since 11w), standbys: cephdata20-4675e5a59e, cephdata20-44bdbfa86f
    osd: 576 osds: 575 up (since 9d), 573 in (since 9d)
 
  data:
    pools:   3 pools, 17409 pgs
    objects: 282.98M objects, 1.1 PiB
    usage:   3.2 PiB used, 3.0 PiB / 6.2 PiB avail
    pgs:     3/848949582 objects degraded (0.000%)
             1/282983194 objects unfound (0.000%)
             17342 active+clean
             60    active+clean+scrubbing+deep
             6     active+clean+scrubbing
             1     active+recovery_unfound+degraded

List the PGs in recovery_unfound state

# ceph pg ls recovery_unfound
PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES        OMAP_BYTES*  OMAP_KEYS*  LOG   STATE                             SINCE  VERSION         REPORTED         UP                 ACTING             SCRUB_STAMP                      DEEP_SCRUB_STAMP
1.2d09    17232         3          0        1  72106876434            0           0  3373  active+recovery_unfound+degraded    37m  399723'3926620  399723:23220581  [574,671,662]p574  [574,671,662]p574  2023-01-12T13:27:34.752832+0100  2023-01-12T13:27:34.752832+0100

Check the ceph log (cat /var/log/ceph/ceph.log | grep ERR) for IO errors on the primary OSD of the PG. In this case, the disk backing osd.574 is failing with pending sectors (check with smartctl -a <device>)

2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
2023-01-12T13:27:34.752327+0100 osd.574 (osd.574) 776 : cluster [ERR] 1.2d09 deep-scrub 0 missing, 1 inconsistent objects
2023-01-12T13:27:34.752830+0100 osd.574 (osd.574) 777 : cluster [ERR] 1.2d09 repair 1 errors, 1 fixed
2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)

Before taking any action, make sure that the version of the objected reported as unfound on the other two OSDs are more recent than the lost one:

List unfound object

# ceph pg 1.2d09 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.0bee1ae64c9012.00000000000032c4",
                "key": "",
                "snapid": -2,
                "hash": 2152017161,
                "max": 0,
                "pool": 1,
                "namespace": ""
            },
            "need": "399702'3923004",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": []
        }
    ],
    "state": "NotRecovering",
    "available_might_have_unfound": true,
    "might_have_unfound": [],
    "more": false

The missing object is at version 399702

Last osd map before read error: e399704

2023-01-12T13:07:24.463521+0100 mon.cephdata20-4675e5a59e (mon.0) 2714279 : cluster [DBG] osdmap e399704: 576 total, 575 up, 573 in
2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error

The object goes unfound at: e399710

2023-01-12T13:27:30.297813+0100 mon.cephdata20-4675e5a59e (mon.0) 2714933 : cluster [DBG] osdmap e399710: 576 total, 575 up, 573 in
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)

The two copies on 671 and 662 are more recent -- 399702 VS 399709:

2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819

If copies are more recent than the lost one:

Set the primary osd (osd.574) out
The recovery_unfound object disappears and backfilling start
Once backfilled, deep-scrub the PG to check for inconsistencies

This is an special alert raised by prometheus. This indicates that for whatever reason a target node is not exposing its metrics anymore or prometheus server is not able to pull them. This does not imply that the node is offline, just that the node endpoint is down for prometheus.

To handle this tickets first we need to identify which is the affected target. This information should be in the ticket body.

The following Alerts are in Firing Status:
------------------------------------------------
Target cephpolbo-mon-0.cern.ch:9100 is down
Target cephpolbo-mon-2.cern.ch:9100 is down

Alert Details:
------------------------------------------------
Alertname: TargetDown
Cluster: polbo
Job: node
Monitor: cern
Replica: A
Severity: warning

After, we can go to the target section in prometheus's dashboard and cross-check the affected node. There you can find more information about the reason of being down.

This could be caused by the following reasons:

A node is offline or it's being restarted. Follow the normal procedures for understanding why the node is not online (ping, ssh, console access, SNOW ticket search...). Once the node is back, the target should be marked as UP again automatically.
If a new target was added recently, possibly there are mistakes in the target definition or some conectivity problems like the port being blocked.
- Review the target configuration in it-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml and refer to the monitoring guide.
- Make sure that the firewall configuration allows prometheus to scrape the data through the specified port.
In ceph, the daemons that expose the metrics are the mgr. Sometimes, could happen that the mgr hangs and then stop exposing the metrics.
- Check the mgr status and eventually restart it. Don't forget to collect information about the state in what you found it for further analysis. If all went well, after 30 seconds, the target should be UP again in prometheus dashboard. For double-check you can click in the endpoint url of the node and see if the metrics are now shown.

In order to drain the osds attached to a failing SSD, run the following command:

$> cd /root/ceph-scripts/tools/ceph-disk-replacement
$> ./ssd-drain-osd.sh --dev /dev/<ssd>
ceph osd out osd.<osd0>;
ceph osd primary-affinity osd.<osd0> 0;
ceph osd out osd.<osd1>;
ceph osd primary-affinity osd.<osd1> 0;
...
ceph osd out osd.<osdN>;
ceph osd primary-affinity osd.<osdN> 0;

If the output is similar to the one above, it is safe to re-run the commands adding | sh to actually put out of the cluster all the osds attached to the ssd.

Once the draining has been started, the osds need to be zapped before the ssd can be removed and physically replaced:

$> ./ssd-prepare-for-replacement.sh --dev /dev/<dev> -f
systemctl stop ceph-osd@<osd0>
umount /var/lib/ceph/osd/ceph-<osd0>
ceph-volume lvm zap --destroy --osd-id <osd0>
systemctl stop ceph-osd@<osd1>
umount /var/lib/ceph/osd/ceph-<osd1>
ceph-volume lvm zap --destroy --osd-id <osd1>
...
systemctl stop ceph-osd@<osdN>
umount /var/lib/ceph/osd/ceph-<osdN>
ceph-volume lvm zap --destroy --osd-id <osdN>

TBC

Check for long ongoing operations on the MDS reporting Slow Ops:

The mon shows SLOW_OPS warning:

ceph health details

cat /var/log/ceph/ceph.log | grep SLOW
    cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)

The affected MDS shows slow request in the logs:

cat /var/log/ceph/ceph-mds.cephcpu21-0c370531cf.log | grep -i SLOW
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 2356.704295 secs
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : slow request 1924.631928 seconds old, received at 2022-10-22T08:37:16.841403+0200: client_request(client.366059605:743931 getattr AsXsFs #0x10251604c38 2022-10-22T08:37:16.841568+0200 caller_uid=1001710000, caller_gid=0{1001710000,}) currently dispatched

Dump the ongoing ops and check there are some with very long (minutes, hours) age:

ceph daemon mds.`hostname -s` ops | grep age | less

Identify the client with such long ops (age should be >900):

ceph daemon mds.`hostname -s` ops | egrep 'client|age' | less

    "description": "client_request(client.364075205:4876 getattr pAsLsXsFs #0x1023f14e5d8 2022-10-16T03:46:40.673900+0200 RETRY=184 caller_uid=0, caller_gid=0{})",
    "age": 0.87975248399999995,
        "reqid": "client.364075205:4876",
        "op_type": "client_request",
        "client_info": {
            "client": "client.364075205",

Get info on the client:

ceph daemon mds.`hostname -s` client ls id=<THE_ID>

IP address
Hostname
Ceph client version
Kernel version (in case of a kernel mount)
Mount point (on the client side)
Root (aka, the CephFS volume the client mounts)

Evict the client:

ceph tell mds.* client ls id=<THE_ID>
ceph tell mds.* client evict id=<THE_ID>

On S3 clusters, it may happen to see a HEALTH_WARN message reporting 1 large omap objects. This is very likely due to bucket index(es) being over full. Example:

"user_id": "warp-tests",
"buckets": [
    {
        "bucket": "warp-tests",
        "tenant": "",
        "num_objects": 9993106,
        "num_shards": 11,
        "objects_per_shard": 908464,
        "fill_status": "OVER"
    }
]

Proceed as follows:

Check bucket index(es) being over full is the actual problem:
```
radosgw-admin bucket limit check
```
If it it not possible to reshard the bucket tune osd_deep_scrub_large_omap_object_key_threshold properly
```
ceph config set osd osd_deep_scrub_large_omap_object_key_threshold 300000
```
Default is 200000; Gabe runs with 500000. Read at 42on.com
If it is possible to reshard the bucket, manually reshard any bucket showing fill_status WARN or OVER:
- keep the number of objects per shard around 50k
- pick a prime number of shards
- consider if the bucket will be ever growing or owners delete objects. If ever-growing, you may reshard to a high number of shards to avoid (or postpone) resharding in the future.
```
radosgw-admin bucket reshard --bucket=warp-tests --num-shards=211
```

Check in ceph.log which is the PG complining about the large omap objects and start a deep scrub on it (else the HEALTH_WARN won't go away)

# zcat  /var/log/ceph/ceph.log-20221204.gz | grep -i large
2022-12-03T06:48:37.975544+0100 osd.179 (osd.179) 996 : cluster [WRN] Large omap object found. Object: 9:22f5fbf8:::.dir.a1035ed2-37be-4e7d-892d-46728bc3d046.285532.1.1:head PG: 9.1fdfaf44 (9.344) Key count: 204639 Size (bytes): 60621488
2022-12-03T06:48:39.270652+0100 mon.cephdata22-12f31fcca0 (mon.0) 292373 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)

# ceph pg deep-scrub 9.344
instructing pg 9.344 on osd.179 to deep-scrub

Using and Operating Ceph

What to watch?

Taking notes

Keeping the Team Informed

Common Procedures

exception.scsi_blockdevice_driver_error_reported

Draining a Failing OSD

Creating a new OSD (on a replacement disk)

CephInconsistentPGs

Handle a failing disk

Where is osd.187?

Repairing a PG

Ceph PG Unfound

CephTargetDown

SSD Replacement

Draining OSDs attached to a failing SSD

Prepare for replacement

Recreate the OSD

MDS Slow Ops

Large omap objects