User's Guide

This chapter provides minimal Ceph user documentation.

Block Storage and CephFS documentation is available at the following URLs:

Which Storage Service is Right for Me?

I need an extra drive for my OpenStack VM:

I need a POSIX filesystem shared across a small number of servers:

I need storage which is accessible from lxplus, lxbatch, or the WLCG:

I need to share my files or collaborate with colleages:

I need HTTP accessible cloud storage for my application:

I need to distribute static software or data globally:

I need to archive data to tape:

Using Block Storage

Block storage is accessible via OpenStack VMs as documented here: https://clouddocs.web.cern.ch/clouddocs/details/volumes.html

Using CephFS

CephFS is made available via OpenStack Manila. See https://clouddocs.web.cern.ch/file_shares/index.html for more info.

Using S3 or SWIFT

S3/Swift is made available via OpenStack. See https://clouddocs.web.cern.ch/object_store/README.html for more info.

Configure aws cli

The aws s3api is useful for doing advanced s3 operations, e.g. dealing with object versions. The following explains how to set this up with our s3.cern.ch endpoint.

Setting up aws

All of the information required to set up aws-cli can be found in the existing .s3cfg file used when using S3.

$> yum install awscli 
$> aws configure
AWS Access Key ID [None]: <your access key> 
AWS Secret Access Key [None]: <your secret key>
Default region name [None]:
Default output format [None]:

Testing

$> aws --endpoint-url=http://s3.cern.ch s3api list-buckets
{
  "Buckets": [
     {
         "Name": <bucket1>,
         "CreationDate": <timestamp> 
     },
     {
       ....
     }
   ],
   "Owner": {
        "DisplayName": <owner>,
        "ID": <owner id>
    }

}

Delete all object versions

We provide here a script to help user make sure all versions of their objects are deleted.

Usage:

$> ./s3-delete-all-object-versions.sh -b <bucket> [-f]
   -b: bucket name to be cleaned up
   -f: if omitted, the script will simply display a summary of actions. Add -f to execute them. 

Useful links

AWS reference manual

Operator's Guide

Create a Ceph Test Cluster

Create CEPH Cluster

Prepare the hostgroups

  • Log to Foreman and create the following hostgroups:
    • <my_cluster> and select ceph as the parent hostgroup.
    • For the monitors, create host group mon and select ceph/<my_cluster> as the parent group.
    • For the osd, create hostgroup osd and select ceph/<my_cluster> as the parent group.
    • For the metadata servers, create the hostgroup mds and select ceph/<my_cluster> as the parent.
    • Do the puppet configuration:
      • Clone the repo it-puppet-hostgroup-ceph
      • Create the manifests and data files accordingly for the new cluster (use the configuration of other cluster as a base)
      • Remember to create a new uuid for the cluster and put it in /code/hostgroup/ceph/<my_cluster>.yaml
      • Commit, push, do merge request, etc ...

First Monitor Configuration

  • 1 ) Create one virtual machine for the first monitor following this guide

  • 2 ) Create a mon bootstrap key (from any previous ceph cluster):

      ssh root@ceph<existing-cluster>-mon-XXXX
      ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'
    
    • From aiadm: (maybe you need to ask for permissions to get access to the tbag folder)
      mkdir ~/private/tbag/<my_cluster>
      cd ~/private/tbag/<my_cluster>
      scp root@ceph<existing_cluster>-mon-XXXX:/tmp/keyring.mon .
      tbag set --hg ceph/<my_cluster>/mon keyring.mon --file keyring.mon
    
  • 3 ) Now run puppet on the first mon.

    puppet agent -t -v 
    
  • 4 ) Now copy the admin keyring to tbag (from aiadm):

      scp root@<first_mon>:/etc/ceph/keyring . 
      tbag set --hg ceph/<my_cluster> keyring --file keyring
    
  • 5 ) Now create an MGR bootstrap key on the first mon:

      ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
      ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
    
    • From aiadm:
      scp root@<first_mon>:/tmp/keyring.bootstrap-mgr .
      tbag set --hg ceph/<my_cluster> keyring.bootstrap-mgr --file keyring.bootstrap-mgr
    
  • 6 ) Now create an OSD bootstrap key on the first mon:

       ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
       ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
    
    • From aiadm:
      scp root@<first_mon>:/tmp/keyring.bootstrap-osd .
      tbag set --hg ceph/<my_cluster> keyring.bootstrap-osd --file keyring.bootstrap-osd
    

Add more Monitors and OSD's

  • Follow the step 1) to add more mons and osds. Everything should install correctly.
  • Prepare and activate the OSD

    /root/ceph-scripts/ceph-disk/ceph-disk-prepare-all
    

NOTE: To setup a OSD in the same machine as the monitor.

  • mkdir /data/a (for example)
  • chown ceph:ceph -R /data
  • ceph-disk prepare --filestore /data/a (ignore the Deprecate warnings)
  • ceph-disk activate /data/a

Creating a CEPH cluster

Table of Contents

Follow the below instructions to create a new CEPH cluster in CERN

Prerequisites

  • Access to aiadm.cern.ch
  • Proper GIT configuration
  • Member of ceph administration e-groups
  • OpenStack environment configured, link

Introduction - Hostgroups

First, we have to create the hostgroups in which we want to build our cluster in.

The hostgroups provide a layer of abstraction for configuring automatically a
cluster using Puppet. The first group called ceph, ensures that each
machine in this hostgroup has ceph installed, configured and running. The second
group, called first sub-hostgroup, ensures that each machine will communicate
with machines in the same sub-hostgroup forming a cluster. These machines will
have specific configuration defined later in this guide. The second sub-hostgroup
ensures that each machine will act as its corresponding role in the cluster.

For example we first create our cluster's hostgroup with its name that is provided by your task.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}

As each cluster has its own features, the 2 basic sub-hostgroups for a ceph
cluster are the mon and osd.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mon
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/osd

These sub-hostgroups will contain the monitors and the osd hosts.

If the cluster has to use CephFS and/or Rados gateway we need to create the
appropriate sub-hostgroups.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mds      #for CephFS
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/radosgw  #for the rados gateway

Creating a configuration for your new cluster

Go to gitlab.cern.ch and search for it-puppet-hostgroup-ceph. This repository
contains the configuration for all the machines under the ceph hostgroup. Clone
the repository, create a new branch based on qa, and go to it-puppet-hostgroup-ceph/code/manifests.
From there, you will create the {hg_name}.pp file and the {hg_name} folder.

The {hg_name}.pp should contain the following code: (replace {hg_name} with the cluster's name)

class hg_ceph::{hg_name} {
  include hg_ceph::include::base
}

This will load the basic configuration for ceph on each machine. The {hg_name} folder should contain the *.pp files for the appropriate 2nd sub-hostgroups.

The files under your cluster's folder will have the following basic format:

File {role}.pp:

class hg_ceph::{hg_name}::{role} {
  include hg_ceph::classes::{role}
}

The include will use a configuration template located in it-puppet-hostgroup-ceph/code/manifests/classes

The roles are: mon, mgr, osd, mds and radosgw. It is good to run both mon and mgr together, so we usually create the following class e.g.:

class hg_ceph::{hg_name}::mon {
  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr
}

The following code will configure machines in "ceph/{hg_name}/mon" to act as
monitors and mgrs together. After you are done with creating the needed files
for your task. Your "code/manifests" path should look like this:

# Using kermit as {hg_name}

kermit.pp
kermit/mon.pp
kermit/osd.pp
# Optional, only if requested by the JIRA ticket
kermit/mds.pp
kermit/radosgw.pp

Create a YAML configuration file for the new hostgroup in it-puppet-hostgroup-ceph/data/hostgroup/ceph with name {hg_name}.yaml. This files contains all the basic configuration parameters that are common to all the nodes in the cluster.

ceph::conf::fsid: d3c77094-4d74-4acc-a2bb-1db1e42bb576

ceph::params::release: octopus

lbalias: ceph{hg_name}.cern.ch
hg_ceph::classes::mon::enable_lbalias: false

hg_ceph::classes::mon::enable_health_cron: true
hg_ceph::classes::mon::enable_sls_cron: true

Where:

  • ceph::conf::fsid can be generated by uuid tool;
  • lbalias is the alias the mons are part of.

Git add the following files, commit and push your branch. BEFORE you push, do a git pull --rebase origin qa to avoid any conflicts with your request. The command line will provide a link to submit a merge request.

@dvanders is currently the administrator of the repo, so you should assign him the task to check your request and eventually merge it.

Creating your first monitor node

Follow the instructions to create exactly one monitor here. DO NOT ADD more than one machines to the ceph/{hg_name}/mon hostgroup, otherwise your first monitor will always deadlock and you will need to remove the others and rebuild the first one again.

With TBag authentication

Once we are able to login to the node, we will need to create the keys to be
able to bootstrap new nodes to the cluster. We will first have to create the
inital key, so mons can be created in our new cluster.

[root@ceph{hg_name}-mon-...]$ ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'

Login to aiadm, copy the key from the monitor host and store it on tbag.

[user@aiadm]$ mkdir -p ~/private/tbag/{hg_name}
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.mon .
[user@aiadm]$ tbag set --hg ceph/{hg_name}/mon keyring.mon --file keyring.mon

Login to your mon host and run puppet puppet agent -t, repeat until you see a running ceph-mon process.

Run the following to disable some warning and enable some features for ceph:

[root@ceph{hg_name}-mon-...]$ ceph mon enable-msgr2
[root@ceph{hg_name}-mon-...]$ ceph osd set-require-min-compat-client luminous
[root@ceph{hg_name}-mon-...]$ ceph config set mon auth_allow_insecure_global_id_reclaim false

Note that enable-msgr2 will need to be run again after all mons have been created.

We will need to repeat this procedure for the mgr, osd, mds, rgw and rbd-mirror depending on what we need:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
# Optional, only if the cluster uses CephFS
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mds mon 'allow profile bootstrap-mds'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mds > /tmp/keyring.bootstrap-mds
# Optional, only if the cluster uses a Rados Gateway
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' -o /tmp/keyring.bootstrap-rgw
# Optional, only if the cluster uses a rbd-mirror
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-rbd-mirror -o /tmp/keyring.bootstrap-rbd-mirror

Login to aiadm, copy the keys from the monitor host and use them with tbag.

Make sure you don't have any excess keys in the /tmp folder (5 max, mon/mgr/osd/mds/rgw).
We don't need to provide the specific subgroup for each key, because that will cause confusion, "ceph/{hg_name}" is enough.

[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.* .
[user@aiadm]$ scp {mon_host}:/etc/ceph/keyring .
[user@aiadm]$ for file in *; do tbag set --hg ceph/{hg_name} $file --file $file; done
# Make sure to copy all the generated keys on `/mnt/projectspace/tbag` of `cephadm.cern.ch` as well:
[user@aiadm]$ scp -r . root@cephadm:/mnt/projectspace/tbag/{hg_name}

Now we create the other monitors using the same procedure as the first one using ai-bs. The other monitors will be configured automatically.

Creating manager hosts

The procedure is very similar to the one for the creation of mons:

  • Create new VMs;
  • Add them to the ceph/{hg_name}/mgr hostgroup;
  • Set the right roger state for the new VMs;

Instructions for the creation of mons still hold here, with the necessary changes for mgrs.

As stated above, in some cases it is necessary to colocate mons and mgs. If so, it is not needed to create new machines for mgrs but simply include the mgr class in the mon manifest:

class hg_ceph::{hg_name}::mon {

  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr

}

Creating osd hosts

The OSD hosts will be usually given to you to be prepared by formatting the disks
and adding them to the cluster. The tool used to format the disks will be ceph-volume.
The provision will happen with lvm. Make sure your disks are empty, run pvs and
vgs to check if they have any lvm data.

We can safely ignore the system disks in case they are used with lvm. On every
host run ceph-volume lvm zap {disk} --destroy to zap the disks and remove any
lvm data. In case your hosts contain only one type of disk like HDD or SSD for
OSDS we can run the following command for the provision of our OSDS:

# It works like the ls command, if we need to create OSDS from /dev/sdc to /dev/sdz we can try this
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sd[c-z]

You will be prompted to check the OSD creation plan and if you agree with the
following changes you can input yes to create the OSDS. If you are trying to
automate this task you can pass the --yes parameter to the ceph-volume lvm batch
command. In the case you have SSDs to back the HDDs to create hybrid OSDs using
SDD block.DB and HDD block.data you will have to run the above command per SSD:

# 2 SSDs sda sdb 4HDDs sdc sdd sde sdf
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sda /dev/sdc /dev/sdd
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sdb /dev/sde /dev/sdf

The problem with the current lvm batch implementation is that it creates a single
volume group for the block.DB part. Therefore, when an SSD fails, the whole set
of OSDs in the host become corrupted. So in order to minimize the cost, we run batch per SSD.

Run ceph osd tree to check whether the OSDs are placed correctly in the tree.
If the OSDs are not set as described with grep ^crush /etc/ceph/ceph.conf you
will need to remove the line containing something like update crush on start
and restart the OSDs of that host. You can also create/move/delete buckets with (examples):

  • ceph osd crush add-bucket CK13 rack
  • ceph osd crush move CK13 room=0513-R-0050
  • ceph osd crush move 0513-R-0050 root=default
  • ceph osd crush move cephflash21a-ff5578c275 rack=CK13

Now you are one step away from having a functional cluster.
Next step is to create a pool so we can be able to use the storage of our cluster.

Creating the first pool

A pool in ceph is the root namespace of an object store system. A pool has its
own data redudancy schema and access permissions. In the case cephfs is used, two
pools are created, one for data and one for metadata, or in the case to support
openstack various pools are created for storing images and volumes and shares.
To create a pool we first have to understand what type of data redundancy we
should use: replicated or EC. If the task already defines what should happen,
then you can go to the ceph documentation:

BEFORE you create a pool you first need to create a CRUSH rule that matches
to your cluster's schema:

You can get the schema by running ceph osd tree | less.

As an example, the meredith cluster runs with 4+2 EC and the failure domain is rack. Create the required erasure-code-profile with:

[root@cephmeredithmon...]$ ceph osd erasure-code-profile ls
default

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2 k=4 m=2 crush-failure-domain=rack --force
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=rack
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

NEVER modify an existing profile. That would change the data placement on disk!
Here we use the --force flag only because the new jera_4plus2 is not used yet.

Now create a CRUSH rule with the defined profile:

[root@cephmeredithmon...]$ ceph osd crush rule create-erasure rack_ec jera_4plus2
created rule rack_ec at 1

[root@cephmeredithmon...]$ ceph osd crush rule ls
replicated_rule
rack_ec

[root@cephmeredithmon...]$ ceph osd crush rule dump rack_ec
{
    "rule_id": 1,
    "rule_name": "rack_ec",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 6,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "rack"
        },
        {
            "op": "emit"
        }
    ]
}

The last thing that is left is to calculate the number of PGs to keep the cluster running optimally. The Ceph developers reccomend 30 to 100 PGs per OSD, keep in mind that the data
redundancy schema counts as a multiplier. For example, if you have 100 OSDs you
will need at least 3K to 10K PGs. The number of the PGs must be a power of
two. So, we will use at least 1024(x3) to 2048(x3) PGs on the pool creation
command. Keep in mind that there may be a need for additional pools, such as
"test" which is created on every cluster for the profound reason of testing.

In general the formula is the following:

MaxPGs = \begin{cases}
NumOSDs*100/ReplicationSize &\text{if } replicated \\
NumOSDs*100/(k+m) &\text{if } erasure\ coded
\end{cases}

Then we use the closest power of two, which is less than the above number.
Example on meredith (368 OSDs, EC -- k=4, m=2): MaxPGs=6133 --> MaxPGs=4096

Now, let's create the pools following the upstream documentation Create a pool.

We should have at least one test pool and one data pool:

  • Create the test pool. It should always be replicated and not EC:

    [root@cephmeredithmon...]$ ceph osd pool create test 512 512 replicated replicated_rule
    pool 'test' created
    
    [root@cephmeredithmon...]$ ceph osd pool ls detail
    pool 6 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1710 flags hashpspool stripe_width 0 application test
    
  • Create the data pool (named 'rbd_ec_data' here) with EC:

    [root@cephmeredithmon...]$ ceph osd pool create rbd_ec_data 4096 4096 erasure jera_4plus2 rbd_ec_data
    pool 'rbd_ec_data' created
    [root@cephmeredithmon...]$ ceph osd pool ls detail | grep rbd_ec_data
    pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 1554 flags hashpspool stripe_width 16384
    

Finalize cluster configuration

Security Flags on Pools

  1. Make sure the security flags {nodelete, nopgchange, nosizechange} are set for all the pools
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1711 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
...

If not, set the flags with

[root@cluster_mon]$ ceph osd pool set <pool_name> {nodelete, nopgchange, nosizechange} 1
  1. pg_autoscale_mode should be set to off
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1985 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd

If the output shows anything for autoscale_mode, disable autoscaling with

[root@cluster_mon]$ ceph osd pool set <pool_name> pg_autoscale_mode off
  1. Set the application type for each pool in the cluster
[root@cluster_mon]$ ceph osd pool application enable my_test_pool test
[root@cluster_mon]$ ceph osd pool application enable my_rbd_pool rbd
  1. If relevant, enable the balancer
[root@cluster_mon]$ ceph balancer on
[root@cluster_mon]$ ceph balancer mode upmap
[root@cluster_mon]$ ceph config set mgr mgr/balancer/upmap_max_deviation 1

The parameter upmap_max_deviation is used to spread the PGs more evenly across the OSDs.
Check with

[root@cluster_mon]$ ceph balancer status
{
    "plans": [],
    "active": true,
    "last_optimize_started": "Tue Jan 12 16:47:48 2021",
    "last_optimize_duration": "0:00:00.296960",
    "optimize_result": "Optimization plan created successfully",
    "mode": "upmap"
}

[root@cluster_mon]$ ceph config dump
WHO   MASK LEVEL    OPTION                           VALUE RO 
  mgr      advanced mgr/balancer/active              true     
  mgr      advanced mgr/balancer/mode                upmap    
  mgr      advanced mgr/balancer/upmap_max_deviation 1        

Also, after quite some time spent balancing, the number of PGs per OSD should be even.
Focus on the PGS column of the output of ceph osd df tree

[root@cluster_mon]$ ceph osd df tree

ID  CLASS WEIGHT    REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE VAR  PGS STATUS TYPE NAME                                
 -1       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -        root default                             
 -5       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -            room 0513-R-0050                     
 -4        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                rack CK01                        
 -3        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                    host cephflash21a-04f5dd1763 
  0   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  75     up                 osd.0                    
  1   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  69     up                 osd.1                    
  2   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  72     up                 osd.2                    
  3   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  70     up                 osd.3       

Monitoring

Cluster monitoring is offered by:

  • Health crons enabled at the hostgroup level (see the YAML file above):
    • enable_health_cron enables sending the email report that checks the current health status and greps in recent ceph.log
    • enable_sls_cron enables sending metrics to filer-carbon that populate the Ceph Health dashboard
  • Regular polling performed by cephadm.cern.ch
  • Prometheus
  • Watcher clients (CephFS) that mount and test FS availability

To enable polling from cephadm, proceed as follows:

  1. Add the new cluster to it-puppet-hostgroup-ceph/code/manifest/admin.pp. Consider Admin newclusters as reference merge request. (note, if you are adding a cephFS cluster, you do not need to add it to the ### BASIC CEPH CLIENTS Array.
  2. Create a client.admin key on the cluster
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.admin mon 'allow *' mgr 'allow *' osd 'allow *' mds 'allow *'
[client.admin]
        key = <the_super_secret_key>
  1. Add the key to tbag in the ceph/admin hostgroup (the secret must contain the full output of the command above)
tbag set --hg ceph/admin <cluster_name>.keyring --file <keyring_filename>
tbag set --hg ceph/admin <cluster_name>.admin.secret
Enter Secret: <paste secret here>
  1. Add the new cluster to it-puppet-module-ceph/data/ceph.yaml otherwise the clients (cephadm included) will lack the mon hostname. (Consider Add ryan cluster as a reference merge request.) Double check you are using the appropriate port.
  2. ssh to cephadm and run puppet a couple of times
  3. Make sure files at <cluster_name>.client.admin.keyring and at <cluster_name>.conf exist and show the appropriate content
  4. Check the health of the cluster with
[root@cephadm]# ceph --cluster=<cluster_name> health
HEALTH_OK
  1. Cephadm is also resposnbile for producting the availability numbers sent to the central IT Service Availability Overview. If the cluster needs to be reported in IT SAO, add it to ceph-availability-producer.py with a relevant description.

To enable monitoring from Prometheus, add the new cluster to prometheus.yaml. Also, the Prometheus module must be enabled on the MGR (Documentation: https://docs.ceph.com/en/octopus/mgr/prometheus/) for metrics to be retrieved:

ceph mgr module enable prometheus

To ensure a CephFS cluster is represented adequetely, there are some unique steps we must take:

  1. Update the it-puppet-module-cephfs README.md and code/data/common.yaml to include the new cluster (Consider add doyle cluster as a reference merge request.)
  2. Update it-puppet-hostgroup-ceph watchers definition in code/manfiests/test/cepfs/watchers.pp to ensure the new cluster is mounted by the watchers. (consider watchers.pp: add doyle definition an example merge request)
  3. SSH to one of the watcher nodes (e.g. cephfs-testc9-d81171f572.cern.ch) and run puppet a few times to synchronise the changes.
  4. Checking cat /proc/mounts | grep ceph for an appropriate systemd mount and navigating to one of the directories within / let you examine if the FS is availible.

Details on lbalias for mons

We prefer not to use load-balancing service and lbclient here (https://configdocs.web.cern.ch/dnslb/). There is no scenario in ceph where we want a mon to disappear from the alias.

We rather use the --load-N- appoarch to create the alias with all the mons:

  • Go to network.cern.ch
  • Click on Update information and use the FQDN of the mon machine
    • If prompted, make sure you host interface and not the IPMI one
  • Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
  • Multiple aliases are supported. Use a comma-separated list
  • Check the changes are correct and submit the request

Benchmarking

Note: What follows is not proper benchmarking but some quick hints the cluster works as expected.

Good reading at Benchmarking performance

Rados bench

Start a test on pool 'my_test_pool' with 30s duration and blockize 4096 B

[root@cluster_mon]$ rados bench -p my_test_pool 10 write -b 4096

hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephflash21a-a6564a2ee7.cern._1768589
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16      8752      8736   34.1231    34.125  0.00130825  0.00182201
    2      16     16913     16897   32.9995   31.8789  0.00104112  0.00189076
    3      15     24678     24663   32.1108   30.3359  0.00139087  0.00194522
    4      16     32189     32173   31.4167   29.3359   0.0209055   0.0019863
    5      16     39595     39579   30.9187   28.9297   0.0209981  0.00201906
    6      16     47263     47247   30.7573   29.9531  0.00138272  0.00203065
    7      16     55169     55153   30.7748   30.8828  0.00121337  0.00202973
    8      16     63070     63054   30.7855   30.8633  0.00133439  0.00202877
    9      15     70408     70393     30.55    28.668  0.00144124  0.00204461
   10      11     78679     78668   30.7271   32.3242  0.00162555  0.00203309
Total time run:         10.0178
Total writes made:      78679
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     30.6793
Stddev Bandwidth:       1.68734
Max bandwidth (MB/sec): 34.125
Min bandwidth (MB/sec): 28.668
Average IOPS:           7853
Stddev IOPS:            431.959
Max IOPS:               8736
Min IOPS:               7339
Average Latency(s):     0.00203504
Stddev Latency(s):      0.00370041
Max latency(s):         0.0702117
Min latency(s):         0.000887922
Cleaning up (deleting benchmark objects)
Removed 78679 objects
Clean up completed and total clean up time :4.93871

RBD bench

Create a RBD image and run some tests on it

[root@cluster_mon]$ rbd create rbd_ec_meta/enricotest --size 100G --data-pool rbd_ec_data
[root@cluster_mon]$ rbd bench --io-type write rbd_ec_meta/enricotest --io-size 4M --io-total 100G

Once done, delete the image with

[root@cluster_mon]$ rbd ls -p rbd_ec_meta
[root@cluster_mon]$ rbd rm rbd_ec_meta/enricotest

RBD clusters

Create Cinder key for use with OpenStack

All of the above steps bring to a fully functional Rados Block cluster. The only missing step is to create access keys for the OpenStack Cinder so that it can use the provided storage.

The upstream documentation on user management (and OpenStack is a user) is available at User Management

To create the relevant access key for OpenStack use the following command:

$ ceph auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd pool=volumes' mgr 'profile rbd pool=volumes'

which results in creating a user named "cinder" to run rbd commands on the pool named "volumes".

Create an Images pool for use with OpenStack Glance

To store Glance images on ceph, a dedicated pool (pg_num may vary) and cephx keys are needed:

$ ceph osd pool create images 128 128 replica replicated_rule
$ ceph auth get-or-create client.images mon 'profile rbd' mgr 'profile rbd pool=images' osd 'profile rbd pool=images'

CephFS Clusters

Enabling CephFS consists of creating data and metadata pools for CephFS and a new filesystem. It is also needed to create metadata servers (either dedicated or colocated with other daemons), else the cluster will show HEALTH_ERR and 1 filesystem offline. See below for the creation of metadata servers.

Follow the upstream documentation at Create a Ceph File System

Creating metadata servers

Create at least two hosts to ceph/{hg_name}/mds. MDS daemons can be dedicated (preferable for large, busy clusters) or colocated with other daemons (e.g., on the osd host, assuming enough memory is available).

As soon as one MDS goes active, the cluster health will go back to HEALTH_OK. It is recommended to have at least 2 nodes running MDSes for failover. One can also consider to have a stand-by replay MDS to lower the time needed for a failover.

Create Manila key for use with OpenStack

To provision CephFS File Shares via OpenStack Manila, a dedicated cephx must be provided to the OpenStack team. Create the key with:

$ ceph auth get-or-create client.manila mon 'allow r' mgr 'allow rw'

S3 Clusters

Creating rgw hosts

To provide object storage, it is needed to run Ceph Object Gateway daemons (radosgw).

RGWs can run on dedicated machines (by creating new hosts in hostgroup ceph/{hg_name}/rgw) or colocated with existing machines. In both cases, these classes need to be enabled:

Also, you may want to enable:

  • The S3 crons for specific quota and health checks (see include/s3{hourly,daily,weekly}.pp
  • Traefik log ingestion into the MONIT pipelines for ElasticSeach dashoboards (see s3-logging).

Always start with one RGW only and iterate over the configuration until it runs.

Some of the required data pools (default.rgw.control, default.rgw.meta, default.rgw.log, .rgw.root) are automatically created by the RGW at its first run. The creation of some other pools is triggered by specific actions, e.g., making a bucket will create pool default.rgw.buckets.index, pushing the first object will trigger creation of default.rgw.buckets.data.

It is highly recommended to pre-create all pools so that they have the right cursh rule, pg_num, etc. before data is written to them. If they get auto-created, they will use the default crush type (replicated), while we typically use erasure coding for object storage. Use an existing clusters as reference to configure pools.

Creating a DNS load-balanced alias

The round-robin based DNS load balancing service is describe at DNS Load Balancing.

To create a new load-balanced alias for S3:

  1. Go to https://aiermis.cern.ch/
  2. Add LB Alias by specifying if it needs to be external and the number of hosts to return (Best Hosts)
  3. Configure hg_ceph::classes::lb::lbalias and the relevant RadosGW configuration params accordingly (rgw dns name, rgw dns s3website name, rgw swift url. ...)
  4. To support virtual host style bucket address (i.e., mybucket.s3.cern.ch) talk to the Network Team to have wildcard DNS enabled on the alias

Integration with OpenStack Keystone


RBD Mirroring

Make sure you have included hg_ceph::classes:rbd_mirror and set up the bootstrap-rbd-mirror keyring.

Adding peers to rbd-mirror

You first have to add a rbd-mirror-peer keyring in the hostgroup ceph.

First get to your mon and run the following command:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.rbd-mirror-peer mon 'profile rbd-mirror-peer' osd 'profile rbd' -o {hg_name}.client.rbd-mirror-peer.keyring

Copy the keyring to aiadm and create the secret:

[user@aiadm]$ tbag set --hg ceph {hg_name}.client.rbd-mirror-peer.keyring --file {hg_name}.client.rbd-mirror-peer.keyring

Now your cluster can participate with the others already registered to mirror your RBD images! You can now add the following data to registers peers for your rbd-mirror daemons:

ceph::rbd_mirror:
  - peer1
  - peer2
  - ...

Peerings pools

You first have to enable the mirroring of some of your pools: https://docs.ceph.com/en/octopus/rbd/rbd-mirroring/#enable-mirroring. Also check the configuration of those modes in the same page (journaling feature enabled on the RBD images, image snapshot settings, ...).

And then you can add peers like this:

[root@ceph{hg_name}-rbd-mirror-...]$ rbd mirror pool peer add {pool} client.rbd-mirror-peer@{remote_peer}

What to watch?

There are several channels to watch during your Rota shift:

  1. Emails to ceph-admins@cern.ch:

    • "Ceph Health Warn" mails.
    • SNOW tickets from IT Repair Service.
    • Prometheus Alerts.
  2. SNOW tickets assigned to Ceph Service:

    • Here is a link to the tickets needing to be taken: Ceph Assigned
  3. Ceph Internal Mattermost channel

  4. General informations on clusters (configurations, OSD types, HW, versions): Instance Version Tracking ticket

Taking notes

Each action you take should be noted down in a journal, which is to be linked or attached to the minutes of theCeph weekly meeting the following week. https://indico.cern.ch/category/9250/ Use HackMD, Notepad, ...

Keeping the Team Informed

If you have any questions or take any significant actions, keep you colleagues informed in Mattermost

Common Procedures

exception.scsi_blockdevice_driver_error_reported

Draining a Failing OSD

The IT Repair Service may ask ceph-admins to prepare a disk to be physically removed. The scripts needed for the replacement procedure may be found under ceph-scripts/tools/ceph-disk-replacement/.

For failing OSDs in wigner cluster, contact ceph-admins

  1. watch ceph status <- keep this open in a separate window.

  2. Login to the machine with a failing drive and run ./drain-osd.sh --dev /dev/sdX (the ticket should tell which drive is failing)

    • For machines in /ceph/erin/osd/castor: You cannot run the script, ask ceph-admins.
    • If the output is of the following form: Take notes of the OSD id <id>
    ceph osd out osd.<id>
    
    • Else
      • If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
      • Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
  3. Run ./drain-osd.sh --dev /dev/sdX | sh

  4. Once drained (can take a few hours), we now want to prepare the disk for replacement

    • Run ./prepare-for-replacement.sh --dev /dev/sdX
    • Continue if the output is of the following form and that the OSD id <id> displayed is consistent with what was given by the previous command:
    systemctl stop ceph-osd@<id>
    umount /var/lib/ceph/osd/ceph-<id>
    ceph-volume lvm zap /dev/sdX --destroy
    
    • (note that the --destroy flag will be dropped in case of a FileStore OSD)

    • Else

      • If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
      • Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
  5. Run ./prepare-for-replacement.sh --dev /dev/sdX | sh to execute.

  6. Now the disk is safe to be physically removed.

    • Notify the repair team in the ticket

Creating a new OSD (on a replacement disk)

When the IT Repair Service has replaced the broken disk with a new one, we have to format that disk with BlueStore to add it back to the cluster:

  1. watch ceph status <- keep this open in a separate window.

  2. Identify the osd id to use on this OSD:

    • Check your notes from the drain procedure above.
    • Cross-check with ceph osd tree down <-- look for the down osd on this host, should match your notes.
  3. Run ./recreate-osd.sh --dev /dev/sdX and check that the output is according to the following:

  • On beesly cluster:
ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
  • On gabe cluster:
ceph-volume lvm zap /dev/sdX
ceph-volume lvm zap /dev/ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+ 
  • On erin cluster:

    • Regular case:
    ceph-volume lvm zap /dev/sdX
    ceph osd destroy <id> --yes-i-really-mean-it
    ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY
    
    • ceph/erin/castor/osd
      • Script cannot be run, contact ceph-admins.
  1. If the output is satisfactory, run ./recreate-osd.sh --dev /dev/sdX | sh

See OSD Replacement for many more details.

CephInconsistentPGs

Familiarize yourself with the Upstream documentation

Check ceph.log on a ceph/*/mon machine to find the original "cluster [ERR]" line.

The inconsistent PGs generally come in two types:

  1. deep-scrub: stat mismatch, solution is to repair the PG
    • Here is an example on ceph\flax:
2019-02-17 16:23:05.393557 osd.60 osd.60 128.142.161.220:6831/3872729 56 : cluster [ERR] 1.85 deep-scrub : stat mismatch, got 149749/149749 objects, 0/0 clones, 149749/149749 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 135303283738/135303284584 bytes, 0/0 hit_set_archive bytes.
2019-02-17 16:23:05.393566 osd.60 osd.60 128.142.161.220:6831/3872729 57 : cluster [ERR] 1.85 deep-scrub 1 errors
  1. candidate had a read error, solution follows below.
  • Notice that the doc says If read_error is listed in the errors attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. This is indeed the most common scenario.

Handle a failing disk

In this case, a failing disk returns bogus data during deep scrubbing, and ceph will notice that the replicas are not all consistent with each other. The correct procedure is therefore to remove the failing disk from the cluster, let the PGs backfill, then finally to deep-scrub the inconsistent PG once again.

Here is an example on ceph/erin cluster, where the monitoring has told us that PG 64.657c has an inconsistent PG:

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~] grep shard /var/log/ceph/ceph.log
2017-04-12 06:34:26.763000 osd.508 128.142.25.116:6924/4070422 4602 : cluster [ERR] 64.657c shard 187:
soid 64:3ea78883:::1568573986@castorns.27153415189.0000000000000034:head candidate had a read error

A shard in this case refers to which OSD has the inconsistent object replica, in this case it's the "osd.187".

Where is osd.187?

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~]# ceph osd find 187
{
   "osd": 187,
   "ip": "128.142.25.106:6820\/530456",
   "crush_location": {
       "host": "p05972678k94093",
       "rack": "EC06",
       "room": "0513-R-0050",
       "root": "default",
       "row": "EC"
   }
}

On the p05972678k94093 host we first need to find out which /dev/sd* device hosts that osd.187.

On BlueStore OSDs we need to check with ceph-volume lvm list or lvs:

[14:38][root@p05972678e32155 (production:ceph/erin/osd*30) ~]# lvs -o +devices,tags | grep 187
  osd-block-... ceph-... -wi-ao---- <5.46t        /dev/sdm(0) ....,ceph.osd_id=187,....

So we know the failed drive is /dev/sdm, now we can check for disk Medium errors:

[09:16][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# grep sdm /var/log/messages
[Wed Apr 12 12:27:59 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 04 00 00 00
[Wed Apr 12 12:27:59 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Sense Key : Medium Error [current]
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Add. Sense: Unrecovered read error
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 00 08 00 00
[Wed Apr 12 12:28:02 2017] blk_update_request: critical medium error, dev sdm, sector 90638112

In this case, the disk is clearly failing.

Now check if that osd is safe to stop?

[14:41][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# ceph osd ok-to-stop osd.187
OSD(s) 187 are ok to stop without reducing availability, provided there are no other concurrent failures or interventions. 182 PGs are likely to be degraded (but remain available) as a result.

Since it is OK, we stop the osd, umount it, and mark it out.

[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# systemctl stop ceph-osd@187.service
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# umount /var/lib/ceph/osd/ceph-187
[09:17][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# ceph osd out 187
marked out osd.187.

ceph status should now show the PG is in a state like this:

             1     active+undersized+degraded+remapped+inconsistent+backfilling

It can take a few 10s of minutes to backfill the degraded PG.

Repairing a PG

Once the inconsistent PG is no longer "undersized" or "degraded", use the script at ceph-scripts/tools/scrubbing/autorepair.sh to repair the PG and start the scubbing immediately.

Now check ceph status... You should see the scrubbing+repair started already on the inconsistent PG.

Ceph PG Unfound

The PG unfound condition may be due to a race condition when PGs are scrubbed (see https://tracker.ceph.com/issues/51194) leading to PG reported as recovery_unfound.

Upstream documentation is available for general unfound objects

In case of unfound objects, ceph reports a HEALTH_ERR condition

# ceph -s
  cluster:
    id:     687634f1-03b7-415b-aff9-e21e6bedbe7c
    health: HEALTH_ERR
            1/282983194 objects unfound (0.000%)
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 3/848949582 objects degraded (0.000%), 1 pg degraded
 
  services:
    mon: 3 daemons, quorum cephdata20-4675e5a59e,cephdata20-44bdbfa86f,cephdata20-83e1d8a16e (age 4h)
    mgr: cephdata20-83e1d8a16e(active, since 11w), standbys: cephdata20-4675e5a59e, cephdata20-44bdbfa86f
    osd: 576 osds: 575 up (since 9d), 573 in (since 9d)
 
  data:
    pools:   3 pools, 17409 pgs
    objects: 282.98M objects, 1.1 PiB
    usage:   3.2 PiB used, 3.0 PiB / 6.2 PiB avail
    pgs:     3/848949582 objects degraded (0.000%)
             1/282983194 objects unfound (0.000%)
             17342 active+clean
             60    active+clean+scrubbing+deep
             6     active+clean+scrubbing
             1     active+recovery_unfound+degraded

List the PGs in recovery_unfound state

# ceph pg ls recovery_unfound
PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES        OMAP_BYTES*  OMAP_KEYS*  LOG   STATE                             SINCE  VERSION         REPORTED         UP                 ACTING             SCRUB_STAMP                      DEEP_SCRUB_STAMP
1.2d09    17232         3          0        1  72106876434            0           0  3373  active+recovery_unfound+degraded    37m  399723'3926620  399723:23220581  [574,671,662]p574  [574,671,662]p574  2023-01-12T13:27:34.752832+0100  2023-01-12T13:27:34.752832+0100

Check the ceph log (cat /var/log/ceph/ceph.log | grep ERR) for IO errors on the primary OSD of the PG. In this case, the disk backing osd.574 is failing with pending sectors (check with smartctl -a <device>)

2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
2023-01-12T13:27:34.752327+0100 osd.574 (osd.574) 776 : cluster [ERR] 1.2d09 deep-scrub 0 missing, 1 inconsistent objects
2023-01-12T13:27:34.752830+0100 osd.574 (osd.574) 777 : cluster [ERR] 1.2d09 repair 1 errors, 1 fixed
2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)

Before taking any action, make sure that the version of the objected reported as unfound on the other two OSDs are more recent than the lost one:

  • List unfound object
    # ceph pg 1.2d09 list_unfound
    {
        "num_missing": 1,
        "num_unfound": 1,
        "objects": [
            {
                "oid": {
                    "oid": "rbd_data.0bee1ae64c9012.00000000000032c4",
                    "key": "",
                    "snapid": -2,
                    "hash": 2152017161,
                    "max": 0,
                    "pool": 1,
                    "namespace": ""
                },
                "need": "399702'3923004",
                "have": "0'0",
                "flags": "none",
                "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
                "locations": []
            }
        ],
        "state": "NotRecovering",
        "available_might_have_unfound": true,
        "might_have_unfound": [],
        "more": false
    
  • The missing object is at version 399702
  • Last osd map before read error: e399704
    2023-01-12T13:07:24.463521+0100 mon.cephdata20-4675e5a59e (mon.0) 2714279 : cluster [DBG] osdmap e399704: 576 total, 575 up, 573 in
    2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
    
  • The object goes unfound at: e399710
    2023-01-12T13:27:30.297813+0100 mon.cephdata20-4675e5a59e (mon.0) 2714933 : cluster [DBG] osdmap e399710: 576 total, 575 up, 573 in
    2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)
    
  • The two copies on 671 and 662 are more recent -- 399702 VS 399709:
    2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
    2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
    

If copies are more recent than the lost one:

  • Set the primary osd (osd.574) out
  • The recovery_unfound object disappears and backfilling start
  • Once backfilled, deep-scrub the PG to check for inconsistencies

CephTargetDown

This is an special alert raised by prometheus. This indicates that for whatever reason a target node is not exposing its metrics anymore or prometheus server is not able to pull them. This does not imply that the node is offline, just that the node endpoint is down for prometheus.

To handle this tickets first we need to identify which is the affected target. This information should be in the ticket body.

The following Alerts are in Firing Status:
------------------------------------------------
Target cephpolbo-mon-0.cern.ch:9100 is down
Target cephpolbo-mon-2.cern.ch:9100 is down

Alert Details:
------------------------------------------------
Alertname: TargetDown
Cluster: polbo
Job: node
Monitor: cern
Replica: A
Severity: warning

After, we can go to the target section in prometheus's dashboard and cross-check the affected node. There you can find more information about the reason of being down.

This could be caused by the following reasons:

  • A node is offline or it's being restarted. Follow the normal procedures for understanding why the node is not online (ping, ssh, console access, SNOW ticket search...). Once the node is back, the target should be marked as UP again automatically.
  • If a new target was added recently, possibly there are mistakes in the target definition or some conectivity problems like the port being blocked.
    • Review the target configuration in it-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml and refer to the monitoring guide.
    • Make sure that the firewall configuration allows prometheus to scrape the data through the specified port.
  • In ceph, the daemons that expose the metrics are the mgr. Sometimes, could happen that the mgr hangs and then stop exposing the metrics.
    • Check the mgr status and eventually restart it. Don't forget to collect information about the state in what you found it for further analysis. If all went well, after 30 seconds, the target should be UP again in prometheus dashboard. For double-check you can click in the endpoint url of the node and see if the metrics are now shown.

SSD Replacement

Draining OSDs attached to a failing SSD

In order to drain the osds attached to a failing SSD, run the following command:

$> cd /root/ceph-scripts/tools/ceph-disk-replacement
$> ./ssd-drain-osd.sh --dev /dev/<ssd>
ceph osd out osd.<osd0>;
ceph osd primary-affinity osd.<osd0> 0;
ceph osd out osd.<osd1>;
ceph osd primary-affinity osd.<osd1> 0;
...
ceph osd out osd.<osdN>;
ceph osd primary-affinity osd.<osdN> 0;

If the output is similar to the one above, it is safe to re-run the commands adding | sh to actually put out of the cluster all the osds attached to the ssd.

Prepare for replacement

Once the draining has been started, the osds need to be zapped before the ssd can be removed and physically replaced:

$> ./ssd-prepare-for-replacement.sh --dev /dev/<dev> -f
systemctl stop ceph-osd@<osd0>
umount /var/lib/ceph/osd/ceph-<osd0>
ceph-volume lvm zap --destroy --osd-id <osd0>
systemctl stop ceph-osd@<osd1>
umount /var/lib/ceph/osd/ceph-<osd1>
ceph-volume lvm zap --destroy --osd-id <osd1>
...
systemctl stop ceph-osd@<osdN>
umount /var/lib/ceph/osd/ceph-<osdN>
ceph-volume lvm zap --destroy --osd-id <osdN>

Recreate the OSD

TBC

MDS Slow Ops

Check for long ongoing operations on the MDS reporting Slow Ops:

The mon shows SLOW_OPS warning:

ceph health details

cat /var/log/ceph/ceph.log | grep SLOW
    cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)

The affected MDS shows slow request in the logs:

cat /var/log/ceph/ceph-mds.cephcpu21-0c370531cf.log | grep -i SLOW
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 2356.704295 secs
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : slow request 1924.631928 seconds old, received at 2022-10-22T08:37:16.841403+0200: client_request(client.366059605:743931 getattr AsXsFs #0x10251604c38 2022-10-22T08:37:16.841568+0200 caller_uid=1001710000, caller_gid=0{1001710000,}) currently dispatched

Dump the ongoing ops and check there are some with very long (minutes, hours) age:

ceph daemon mds.`hostname -s` ops | grep age | less

Identify the client with such long ops (age should be >900):

ceph daemon mds.`hostname -s` ops | egrep 'client|age' | less

    "description": "client_request(client.364075205:4876 getattr pAsLsXsFs #0x1023f14e5d8 2022-10-16T03:46:40.673900+0200 RETRY=184 caller_uid=0, caller_gid=0{})",
    "age": 0.87975248399999995,
        "reqid": "client.364075205:4876",
        "op_type": "client_request",
        "client_info": {
            "client": "client.364075205",

Get info on the client:

ceph daemon mds.`hostname -s` client ls id=<THE_ID>
  • IP address
  • Hostname
  • Ceph client version
  • Kernel version (in case of a kernel mount)
  • Mount point (on the client side)
  • Root (aka, the CephFS volume the client mounts)

Evict the client:

ceph tell mds.* client ls id=<THE_ID>
ceph tell mds.* client evict id=<THE_ID>

Large omap objects

On S3 clusters, it may happen to see a HEALTH_WARN message reporting 1 large omap objects. This is very likely due to bucket index(es) being over full. Example:

"user_id": "warp-tests",
"buckets": [
    {
        "bucket": "warp-tests",
        "tenant": "",
        "num_objects": 9993106,
        "num_shards": 11,
        "objects_per_shard": 908464,
        "fill_status": "OVER"
    }
]

Proceed as follows:

  1. Check bucket index(es) being over full is the actual problem:
    radosgw-admin bucket limit check
    
  2. If it it not possible to reshard the bucket tune osd_deep_scrub_large_omap_object_key_threshold properly
    ceph config set osd osd_deep_scrub_large_omap_object_key_threshold 300000
    
    Default is 200000; Gabe runs with 500000. Read at 42on.com
  3. If it is possible to reshard the bucket, manually reshard any bucket showing fill_status WARN or OVER:
    • keep the number of objects per shard around 50k
    • pick a prime number of shards
    • consider if the bucket will be ever growing or owners delete objects. If ever-growing, you may reshard to a high number of shards to avoid (or postpone) resharding in the future.
    radosgw-admin bucket reshard --bucket=warp-tests --num-shards=211
    
  4. Check in ceph.log which is the PG complining about the large omap objects and start a deep scrub on it (else the HEALTH_WARN won't go away)
    # zcat  /var/log/ceph/ceph.log-20221204.gz | grep -i large
    2022-12-03T06:48:37.975544+0100 osd.179 (osd.179) 996 : cluster [WRN] Large omap object found. Object: 9:22f5fbf8:::.dir.a1035ed2-37be-4e7d-892d-46728bc3d046.285532.1.1:head PG: 9.1fdfaf44 (9.344) Key count: 204639 Size (bytes): 60621488
    2022-12-03T06:48:39.270652+0100 mon.cephdata22-12f31fcca0 (mon.0) 292373 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
    
    # ceph pg deep-scrub 9.344
    instructing pg 9.344 on osd.179 to deep-scrub
    

Ceph Clusters

Production Clusters

Cluster Lead Use-case Mon host (where?) Release Version OS Racks IP Services Power SSB Upgrades?
barn Enrico Cinder:
cp1, cpio1
cephbarn (hw) pacific 16.2.9-1 RHEL8 BA09 S513-A-IP250 UPS-4/-C Yes
beesly Enrico Glance
Cinder:
1st AZ
cephmon (hw) pacific 16.2.9-1 RHEL8 CD27-CD30 S513-C-IP152 UPS-3/-4i Yes
cta Roberto CTA prod cephcta (hw) pacific 16.2.13-5 RHEL8 SI36-SI41 - No, Julien Leduc
dwight Zac Testing + Manila: CephFS Testing cephmond (vm,abc) quincy 17.2.7-1 Alma8 CE01-CE03 S513-C-IP501 Yes + Manila MM
doyle CephFS for DFS Projects cephdoyle (hw) quincy 17.2.7-2 RHEL9 CP18, CP19-21, CP22 S513-C-IP200 UPS-1 Yes + Sebast/Giuseppe
flax(*) Abhi Manila:
Meyrin CephFS
cephflax (vm,abc) pacific 16.2.9-1 RHEL8 BA10,SQ05
CQ18-CQ21
SJ04-SJ07
S513-A-IP558,S513-V-IP562
S513-C-IP164
S513-V-IP553
UPS-4/-C,UPS-1
UPS-1
UPS-3
Yes
gabe Enrico S3 cephgabe (hw) pacific 16.2.13-5 RHEL8 SE04-SE07
SJ04-SJ07
S513-V-IP808
S513-V-IP553
UPS-1
UPS-3
Yes
jim Enrico HPC BE (CephFS) cephjim (vm,abc) quincy 17.2.7-1 RHEL8 SW11-SW15
SX11-SX15
S513-V-IP194
S513-V-IP193
UPS-3
UPS-3
Yes + Nils Hoimyr
kelly Roberto Cinder: hyperc +
CTA preprod
cephkelly (hyperc) quincy 17.2.7-1 RHEL8 CQ12-CQ22 S513-C-IP164 UPS-1 Yes + Julien Leduc
kapoor Enrico Cinder:
cpio2, cpio3
cephkapoor (hyperc) quincy 17.2.7-1 RHEL8 BE10 BE11 BE13 S513-A-IP22 UPS-4/-C Yes
levinson Abhi Manila:
Meyrin CephFS SSD A
cephlevinson (hw) pacific 16.2.9-1 RHEL8 BA03 BA04 BA05 BA07 S513-A-IP120 S513-A-IP119 S513-A-IP121 S513-A-IP122 UPS-4/-C Yes
meredith Enrico Cinder: io2, io3 cephmeredith (hw) pacific 16.2.9-1 RHEL8 CK01-23 S513-C-IP562 UPS-2 Yes
nethub Enrico S3 FR + Cinder FR cephnethub (hw) pacific 16.2.13-5 RHEL8 HA06-HA09
HB01-HB06
S773-C-SI180
S773-C-IP200
EOD104,ESK404
EOD105 (CEPH-1519)
Yes
pam Abhi Manila:
Meyrin CephFS B
cephpam (hw) pacific 16.2.9-1 Alma8 CP16-19 S513-C-IP200 UPS-1 Yes
poc Enrico PCC Proof of Concept (CEPH-1382) cephpoc (hyperc) quincy 17.2.7-1 RHEL9 SU06 S513-V-SI263 No
ryan Enrico Cinder: 3rd AZ cephryan (hw) pacific 16.2.9-1 RHEL8 CE01-CE03 S513-C-IP501 UPS-2 Yes
stanmey Zachary S3 multi-site, Meyrin (secondary) cephstanmey (hw) reef 18.2.1-1 RHEL8 CP16-24 S513-C-IP200 UPS-1 No
stanpre Zachary S3 multi-site, Prevessin (master) cephstanpre (hw) reef 18.2.1-1 Alma8 HB01-HB06 S773-C-IP200 EOD105/0E No
toby Enrico Stretch cluster cephtoby (hw) pacific 16.2.9-1 RHEL8 CP16-19
SJ04-07
S513-C-IP200
S513-V-IP553
UPS-1
UPS-3
No
vance Enrico Manila: HPC Theory-QCD cephvance (hw) quincy 17.2.7-1 Alma8 CP16-CP17, CP19, CP21, CP23-CP24 S513-C-IP200 UPS-1 Yes + Nils Hoimyr
wallace Enrico krbd: Oracle DB restore tests cephwallace (hw) quincy 17.2.7-1 RHEL8 CP18, CP20, CP22 S513-C-IP200 UPS-1 No, Dmytro Grzudo
vault Enrico Cinder: 2nd AZ cephvault (hw) pacific 16.2.9-1 RHEL8 SE04-SE07 S513-V-IP808 UPS-1 Yes

Flax locations details:

  • MONs: 3x OpenStack VMs, one in each availability zone
  • MDSes (CPU servers): 50% in barn, 50% in vault
    • cephcpu21-0c370531cf, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
    • cephcpu21-2456968853, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
    • cephcpu21-46bb400fc8, BA10, S513-A-IP558
    • cephcpu21-4a93514bf3, BA10, S513-A-IP558
    • cephcpu21b-417b05bfee, BA10, S513-A-IP558
    • cephcpu21b-4ad1d0ae5f, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
    • cephcpu21b-a703fac16c, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
    • cephcpu21b-aecbee75a5, BA10, S513-A-IP558
  • Metadata pool: Main room, UPS-1 EOD1*43
  • Data pool: Vault, UPS-3 EOD3*43

Each production cluster has a designated cluster lead, who is the primary contact and responsible for that cluster.

The user-visible "services" provided by the clusters are documented in our Service Availability probe: https://gitlab.cern.ch/ai/it-puppet-hostgroup-ceph/-/blob/qa/code/files/sls/ceph-availability-producer.py#L19

The QoS provided by each user-visible cluster is described in OpenStack docs. Cinder volumes available on multiple AZs are of standard and io1 types.


s3.cern.ch RGWs

Hostname Customer IPv4 IPv6 IPsvc VM IPsvc Real Runs on OpenStack AZ Room Rack Power
cephgabe-rgwxl-325de0fb1d cvmfs 137.138.152.241 2001:1458:d00:13::1e5 S513-C-VM33 0513-C-IP33 P06636663U66968 cern-geneva-a main CH14 UPS-3
cephgabe-rgwxl-86d4c90cc6 cvmfs 137.138.33.24 2001:1458:d00:18::390 S513-V-VM936 0513-V-IP35 P06636688Q51842 cern-geneva-b vault SQ27 UPS-4
cephgabe-rgwxl-8930fc00f8 cvmfs 137.138.151.203 2001:1458:d00:12::3e0 S513-C-VM32 0513-C-IP32 P06636663N63480 cern-geneva-c main CH11 UPS-3
cephgabe-rgwxl-8ee4a698b7 cvmfs 137.138.44.245 2001:1458:d00:1a::24b S513-C-VM933 0513-C-IP33 P06636663J50924 cern-geneva-a main CH16 UPS-3
cephgabe-rgwxl-3e0d67a086 default 188.184.73.131 2001:1458:d00:4e::100:4ae S513-A-VM805 0513-A-IP561 I82006520073152 cern-geneva-c barn BC11 UPS-4/-C
cephgabe-rgwxl-652059ccf1 default 188.185.87.72 2001:1458:d00:3f::100:2bd S513-A-VM559 0513-A-IP559 I82006525008611 cern-geneva-a barn BC06 UPS-4/-C
cephgabe-rgwxl-8e7682cb81 default 137.138.158.145 2001:1458:d00:14::341 S513-V-VM35 0513-V-IP35 P06636688R71189 cern-geneva-b vault SQ28 UPS-4
cephgabe-rgwxl-91b6e0d6dd default 137.138.77.21 2001:1458:d00:1c::405 S513-C-VM931 0513-C-IP33 P06636663M67468 cern-geneva-a main CH13 UPS-3
cephgabe-rgwxl-895920ea1a gitlab 137.138.158.221 2001:1458:d00:14::299 S513-V-VM35 0513-V-IP35 P06636688H41037 cern-geneva-b vault SQ29 UPS-4
cephgabe-rgwxl-9e3981c77a gitlab 137.138.154.49 2001:1458:d00:13::3a S513-C-VM33 0513-C-IP33 P06636663J50924 cern-geneva-a main CH16 UPS-3
cephgabe-rgwxl-dbb0bcc513 gitlab 188.184.102.175 2001:1458:d00:3b::100:2a9 S513-C-VM852 0513-C-IP852 I78724428177369 cern-geneva-c main EK03 UPS-2
cephgabe-rgwxl-26774321ac jec-data 188.185.10.120 2001:1458:d00:63::100:39a S513-V-VM902 0513-V-IP402 I88681450454656 cern-geneva-a vault SP23 UPS-4
cephgabe-rgwxl-a273d35b9d jec-data 188.185.19.171 2001:1458:d00:65::100:32a S513-V-VM406 S513-V-IP406 I88681458914473 cern-geneva-b vault SP27 UPS-4
cephgabe-rgwxl-d91c221898 jec-data 137.138.155.51 2001:1458:d00:13::14d S513-C-VM33 0513-C-IP33 P06636663Y16806 cern-geneva-a main CH15 UPS-3
cephgabe-rgwxl-75569ebe5c prometheus 137.138.149.253 2001:1458:d00:12::52f S513-C-VM32 0513-C-IP32 P06636663G98563 cern-geneva-c main CH04 UPS-3
cephgabe-rgwxl-7658b46c78 prometheus 188.185.9.237 2001:1458:d00:63::100:424 S513-V-VM902 0513-V-IP402 I88681457779137 cern-geneva-a vault SP24 UPS-4
cephgabe-rgwxl-05386c6cdb vistar 188.185.86.117 2001:1458:d00:3f::100:2d9 S513-A-VM559 0513-A-IP559 I82006526449210 cern-geneva-a barn BC05 UPS-4/-C
cephgabe-rgwxl-13f36a01c2 vistar 137.138.33.10 2001:1458:d00:18::1ee S513-V-VM936 0513-V-IP35 P06636688C41209 cern-geneva-b vault SQ29 UPS-4
cephgabe-rgwxl-6da6da7653 vistar 188.184.74.136 2001:1458:d00:4e::100:5d S513-A-VM805 0513-A-IP561 I82006527765435 cern-geneva-c barn BC13 UPS-4/-C

Reviewing a Cluster Status

  1. Check Grafana dashboards for unusual activity, patterns, memory usage:
  • https://filer-carbon.cern.ch/grafana/d/000000001/ceph-dashboard
  • https://filer-carbon.cern.ch/grafana/d/000000108/ceph-osd-mempools
  • https://filer-carbon.cern.ch/grafana/d/uHevna1Mk/ceph-hosts
  • For RGWs: https://filer-carbon.cern.ch/grafana/d/iyLKxjoGk/s3-rgw-perf-dumps
  • For CephFS: * https://filer-carbon.cern.ch/grafana/d/000000111/cephfs-detail
  • etc...
  1. Login to cluster mon and check various things:
  • ceph osd pool ls detail - are the pool flags correct? e.g. nodelete,nopgchange,nosizechange
  • ceph df - assess amount of free space for capacity planning
  • ceph osd crush rule ls, ceph osd crush rule dump - are the crush rules as expected?
  • ceph balancer status - as expected?
  • ceph osd df tree - are the PGs per OSD balanced and a reasonable number, e.g. < 100.
  • ceph osd tree out, ceph osd tree down - are there any OSDs that are not being replaced properly?
  • ceph config dump - is the configuration as expected?
  • ceph telemetry status - check from config if it on, enable it

Clusters' priority

In case of a major incident (e.g., power cuts), revive clusters in the following order:

  1. Beesly (RBD1, main, UPS-3/4), Flax (CephFS, everywhere), Gabe (S3, vault, UPS-1/3)
  2. Vault (RBD2, vault, UPS-1), Levinson (CephFS SSD, vault, UPS-1), Meredith (RBD SSD, main, UPS-2)
  3. Ryan (RBD3, main, UPS-2), CTA (ObjectStore, vault, UPS-1)
  4. Jim, Dwight, Kelly, Pam (currently unused)
  5. Barn, Kopano -- should not go down, as they are in critical power
  6. NetHub -- 2nd network hub, Prevessin, diesel-backed (9/10 racks)

Hardware Specs


Test clusters

Cluster Use-case Mon alias Release Version Notes
cslab Test cluster for Network Lab (RQF2068297,CEPH-1348) cephcslab pacific 16.2.13-5 Binds to IPv6 only; 3 hosts Alma8 + 3 RHEL8
miniflax Mini cluster mimicking Flax None (ceph/miniflax/mon) quincy 17.2.7-1
minigabe Mini cluster mimicking Gabe (zone groups) cephminigabe pacific 16.2.13-6 RGW on minigabe-831ffcf9f9; Beast on 8080; RGW DNS: cephminigabe
next RC and Cloud next region testing cephnext01 quincy 17.2.6-4

Preparing a new delivery

Flavor per rack

We now want to have flavors per rack for our Ceph clusters, please reminds people from Ironic/CF to do that when a new delivery is installed!

Setting root device hints

We set root device hints on every new delivery so that we can be certain that Ironic installs the OS on the right drive (and if the corresponding drive fails the installation also fails).

There are multiple ways to set root device hints (see the OpenStack documentation). For our recent deliveries setting the model is typically sufficient to have only one possible drive for the root device.

To get the model of the drive you have to boot a node and get it from /sys/class, for instance: cat /sys/class/block/nvme0n1/device/model (you may also ask to get access to Ironic inspection data if it gets more complicated than that).

Then you can set the model on every nodes of the delivery. For instance, for delivery dl8642293 you would do:

export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
    grep dl8642293 | awk '{print $1}' | \
    xargs -L1 openstack baremetal node set --property root_device='{"model": "SAMSUNG MZ1LB960HAJQ-00007"}'

If it looks correct, pipe the output to shell to actually set the root device hints.

Check the root device hints were correctly set with:

export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
    grep dl8642293 | awk '{print $1}' | \
    xargs -L1 openstack baremetal node show -f json | jq .properties.root_device

Ceph Monitoring

About Ceph Monitoring

The monitoring system in Ceph is based on Grafana, using Prometheus as datasource and the native ceph prometheus plugin as metric exporter. Prometheus node_exporter is used for node metrics (cpu, memory, etc).

For long-term metric storage, Thanos is used to store metrics in S3 (Meyrin)

Access the monitoring system

  • All Ceph monitoring dashboards are available in monit-grafana (Prometheus). Although prometheus is the main datasource for ceph metrics, some plots/dashboard may still require the legacy Graphite datasource.

  • The prometheus server is configured in the host cephprom.cern.ch, hostgroup ceph/prometheus

  • Configuration files (Puppet):

    • it-puppet-hostgroup-ceph/code/manifests/prometheus.pp
    • it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
    • it-puppet-hostgroup-ceph/data/hostgroup/ceph.yaml
    • Alertmanager templates: it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl
    • Alert definition: it-puppet-hostgroup-ceph/code/files/generated_rules/
  • Thanos infrastructure is under ceph/thanos hostgroup, configured via the corresponding hiera files.

A analog qa infrastructure is also available, which all components replicated (cephprom-qa, thanos-store-qa, etc). This qa infra is configured overriding the puppet environment:

  • it-puppet-hostogroup-ceph/data/hostgroup/ceph/environments/qa.yaml

Add/remove a cluster to/from the monitoring system

  • Enable the prometheus mgr module in the cluster:
ceph mgr module enable prometheus

NOTE: Make sure that the port 9283 is accepting connections.

Instances that include the hg_ceph::classes::mgr class will be automatically discovered through puppetdb and scraped by prometheus.

  • To ensure that we don't lose metrics during mgr failovers, all the cluster mgr's will be scraped. As a side benefit, we can monitor the online status of the mgr's.
  • Run or wait for a puppet run on cephprom.cern.ch.

Add/remove a node for node metrics (cpu, memory, etc)

Instances that include the prometheus::node_exporter class (anything under ceph top hostgroup) will be automatically discovered through puppetdb and scraped by prometheus.

Add/remove an alert rule to/from the monitoring system

Alerts are defined in yaml files managed by puppet in:

  • it-puppet-hostgroup-ceph/files/prometheus/generated_rules

They are organised in services, so add the alert in the appropiate file (e.g: ceph alerts in alerts_ceph.yaml). The file rules.yaml is used to add recorded rules

There are 3 notification channels currently: e-mail, SNOW ticket and Mattermost message.

Before creating the alert, make sure you test your query in advance, for example using the Explore panel on Grafana. Once the query is working, proceed with the alert definition.

A prometheus alert could look like this:

rules:
  - alert: "CephOSDReadErrors"
    annotations:
      description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
      documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
      summary: "Device read errors detected on cluster {{ $labels.cluster }}"
    expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
    for: "30s"
    labels:
      severity: "warning"
      type: "ceph_default"
  • alert: Mandatory. Name of the alert, which will be part of the subject of the email, head of the ticket and title of the mattermost notification. Try to follow the same pattern as the ones already created CephDAEMONAlert. Daemon in uppercase and rest in camel case.
  • expr: Mandatory. PromQL query that defines the alert. The alert will trigger if the query returns one of more matches. It's a good exercise to use promdash for tuning the query to ensure that it is well formed.
  • for: Mandatory.The alert will be triggered if stays active for more than the specified time (e.g 30s, 1m, 1h).
  • annotations:summary: Mandatory. Express the actual alert in a a concise way.
  • annotations:description: Optional. Allow to specify more detailed information about the alert when the summary is not enough.
  • annotation:documentation: Optional. Allows to specify the url of the documentation/procedure to follow to handle the alert.
  • labels:severity: Mandatory. Defines the notification channel to use, based on the following:
    • warning/critical: Sends an e-mail to ceph-alerts.
    • ticket: Sends an e-mail AND creates an SNOW ticket.
    • mattermost: Sends an e-email AND sends a Mattermost message to the ceph-bot channel.
  • labels:type: Optional. Allows to distinguish from alerts created upstream ceph_default from created by us ceph_cern. It has no actual implication on the alert functionality.
  • labels:xxxxx: Optional. You can add custom labels that could be used on the template.

NOTES

  • In order for the templating to work as expected, make sure that labels cluster or job_name are part of the resulting query. In case the query does not preserve labels (like count) you can specify manually the label and value in the labels section in the alert definition.
  • All annotations, if defined, will appear in the body of the ticket, e-mail or mattermost message generated by the alert.
  • Alerts are evaluated against the local prometheus server which contains metrics for the last 7 days. Take that into account while defining alerts that evaluates longer periods (like linear_predict). In such cases, you can create the alert in Grafana using the Thanos-LTMS metric datasource (more on that later this doc) .
  • In grafana or promdash you can access the alerts querying the metric called ALERTS
  • For more information about how to define an alert, refer to the Prometheus Documentation

Create / Link procedure/documentation to Prometheus Alert.

Prometheus alerts are pre-configured to show the procedure needed for handling the alert via the annotation procedure_url. This is an optional argument that could be configured per alert rule.

Step 1: Create the procedure in case does not exist yet.

Update the file rota.md on this repository and add the new procedure. Use this file for convenience, but you can create a new file if needed.

Step 2: Edit the alert rule and link to the procedure.

Edit the alert following instructions above, and add the link to the procedure under the annotations section, under the key documentation, for example:

- alert: "CephMdsTooManyStrays"
    annotations:
      documentation: "http://s3-website.cern.ch/cephdocs/ops/rota.html#cephmdstoomanystrays"
      summary: "The number of strays is above 500K"
    expr: "ceph_mds_cache_num_strays > 500000"
    for: "5m"
    labels:
      severity: "ticket"

Push the changes and prometheus server will reload automatically picking the new changes. Next time the alert is triggered, a link to the procedure will be shown in the alert body.

Silence Alarms

You can use the alertmanager Web Interface to silence alarms during scheduled interventions. Please always specify a reason for silencing the alarms (a JIRA link or ticket would be a plus). Additionally, for the alerts that generate an e-mail, you will find a link to silence it in the email body.

Alert Grouping

Alert grouping is enabled by default, so if the same alert is triggered in different nodes, we only receive one ticket with all involved nodes.

Modifying AlertManager Templates

Both email and Snow Ticket templates are customizable. For doing that, you need to edit the following puppet file:

  • it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl

You have to use Golang's Template syntax. The structure of the file is as follows:

{{ define "ceph.email.subject" }}
....
{{ end }}
{{ define "ceph.email.body" }}
....
{{ end }}

For reference check the default AlertManager Templates

In case you add templates make sure that you adapt the AlertManager configuration accordingly:

- name: email
  email_configs:
  - to: ceph-admins@cern.ch
    from: altertmanager@locahost
    smarthost: cernmx.cern.ch:25
    headers:
      Subject: '{{ template "ceph.email.subject" . }}'
    html: '{{ template "ceph.email.body" . }}'

Note A restart of AlertManager is needed for the changes to be applied.

Accessing the prometheus dashboard (promdash)

The prometheus dashboard or Dashprom is a powerful interface that allows to quickly asses the prometheus server status and also provide a quick way of querying metrics. The prometheus dashboard is accesible from this link: Promdash.

  • The prometheus dashboard is useful for:
    • Checking the status of all targets: Target status
    • Check the status of the alerts Alert Status
    • For debug purposes, you can execute PromQL queries directly on the dashboard and change the intervals quickly.
    • In grafana there is an icon just near the metric definition to view the current query in promdash.
    • You can also use the Grafana Explorer.

Note: This will only give you access to the metrics of the last 7 days, refer to the next chapter for accessing older metrics.

Long Term Metric Storage - LTMS

The long term storage metrics are kept in S3 CERN Service using Thanos. The bucket is called prometheus-storage and is accessed using the EC2 credentials of Ceph's Openstack Project. Accesing to this metrics is transparent from Grafana:

  • Metrics of the last 7 days are served directly from prometheus local storage
  • Older metrics are pulled from S3.
  • As metrics in S3 contains downsampled versions (5m, 1h) is usually much faster that getting metrics from the local prometheus.
  • RAW metrics are also kept, so it is possible to zoom-in to the 15 second-resolution

Accessing the thanos dashboard

There is a thanos promdash version here, from where you can access all historical metrics. This dashboard has some specific thanos features like deduplication (for use cases with more than one prometheus servers scrapping the same data) and the possibility of showing downsampled data (thanos stores two downsampled versions of the metrics, with 1h and 5m resolution). This downsampled data is also stored in S3.

Thanos Architecture

You can find more detailed information in Thanos official webpage, but these are the list of active components in our current setup and the high level description of what they do:

Sidecar

  • Every time Prometheus dumps the data to disk (by default, each 2 hours), the thanos-sidecar uploads the metrics to the S3 bucket. It also acts as a proxy that serves Prometheus’s local data.

Store

  • This is the storage proxy which serves the metrics stored in S3

Querier

  • This component reads the data from store(s) and sidecar(s) and answers PromSQL using the standard Prometheus HTTP API. This is the component you have to point from monitoring dashboards.

Compactor

  • This is a detached component which compacts the data in S3 and also creates the downsampled versions.

Operating the Ceph Monitors (ceph-mon)

Adding ceph-mon daemons (VM, jewel/luminous)

Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

Create the machine for the mon

Normally we create ceph-mon's as VMs in the ceph/{hg_name}/mon hostgroup.

Example: Adding a monitor to the ceph/test cluster:

  • First, source the IT Ceph Storage Service environment on aiadm: link
  • Then create a virtual machine with the following parameters:
  • main-user/responsible: ceph-admins (the user of the VM)
  • VM Flavor: m2.2xlarge (monitors must withstand heavy loads)
  • OS: Centos7 (the preferred OS used in CERN applications)
  • Hostgroup: ceph/test/mon (Naming convention for puppet configuration)
  • VM name: cephtest-mon- (We use prefix to generate an id)
  • Availability zone: usually cern-geneva-[a/b/c]

Example command: (It will create a VM with the above parameters)

$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.2xlarge
--cc7 -g ceph/test/mon --prefix cephtest-mon-  --nova-availabilityzone cern-geneva-a
--nova-sshkey {your_openstack_key}

This command will create a VM named cephtest-mon-XXXXXXXXXX in the ceph/test/mon hostgroup. Puppet will take care of the initialization of the machine

When you deploy a monitor server, you have to choose an availability zone. We tend to use different availability zones to avoid a single point of failure.

Set roger state and enable alarming

Set the appstate and app_alarmed parameters if necessary

Example: Get the roger data for the VM cephtest-mon-d8788e3256

$ roger show cephtest-mon-d8788e3256

The output should be something similar to this:

[
    {
        "app_alarmed": false,
        "appstate": "build",
        "expires": "",
        "hostname": "cephtest-mon-d8788e3256.cern.ch",
        "hw_alarmed": true,
        "message": "",
        "nc_alarmed": true,
        "os_alarmed": true,
        "update_time": "1506418702",
        "update_time_str": "Tue Sep 26 11:38:22 2017",
        "updated_by": "tmourati",
        "updated_by_puppet": false
    }
]

You need to set the machine's state to "production", so it can be used in production.

The following command will set the target VM to production state:

$ roger update --appstate production --all_alarms=true cephtest-mon-XXXXXXXXXX

Now the roger show {host} should show something like this:

[
    {
        "app_alarmed": true,
        "appstate": "production",
        "..."
    }
]

We now let puppet configure the machine. This task will take an adequate amount of time, as it needs about two configuration cycles to apply the desired changes. After the second cycle you can SSH (as root) to the machine to check if everything is ok.

For example you can check the cluster's status with $ ceph -s

You should see the current host in the monitor quorum.

Details on lbalias for mons

We prefer not to use load-balancing service and lbclient here (https://configdocs.web.cern.ch/dnslb/). There is no scenario in ceph where we want a mon to disappear from the alias.

For a bare metal node

We rather use the --load-N appoarch to create the alias with all the mons:

  • Go to network.cern.ch
  • Click on Update information and use the FQDN of the mon machine
    • If prompted, make sure you host interface and not the IPMI one
  • Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
  • Multiple aliases are supported. Use a comma-separated list
  • Check the changes are correct and submit the request

For a openstack VM

In the case of a VM, we can't directly set an alias, but can set a property in OS to the same effectL

  • Log onto aiadm or lxplus
  • Set your environmental variables to the correct tenant e.g. `eval $(ai-rc 'Ceph Development')
    • Check the vars are what you expect with env | grep OS paying attention to OS_region
  • set the alias using openstack with openstack server set --property landb-alias=CEPH{hg_name}--LOAD-N- {hostname}

Removing a ceph-mon daemon (jewel)

Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

Prerequisites

  1. The cluster must be in HEALTH_OK state, i.e. the monitor must be in a a healthy quorum.
  2. You should have a replacement for the current monitor already in the quorum. And there should be enough monitors so that the cluster can be healthy after one monitor is removed. Normally this means that we should have about 4 monitors in the quorum before starting.

Procedure

  1. Disable puppet: $ puppet agent --disable 'decommissioning mon'
  2. (If needed) remove the DNS alias from this machine and wait until it is so:
- For physical machines, visit http://network.cern.ch → "Update Information".
- For a VM monitor, you can remove the alias from the `landb-alias` property. See [Cloud Docs](https://clouddocs.web.cern.ch/clouddocs/using_openstack/properties.html)
  1. Check if monitor is ok-to-stop: $ ceph mon ok-to-stop <hostname>
  2. Stop the monitor: $ systemctl stop ceph-mon.target. You should now get a HEALTH_WARN status by running $ ceph -s, for example 1 mons down, quorum 1,2,3,4,5.
  3. Remove the monitor's configuration, data and secrets with:
```sh
$ rm /var/lib/ceph/tmp/keyring.mon.*
$ rm -rf /var/lib/ceph/mon/<hostname>
```
  1. Remove the monitor from the ceph cluster:
```sh
$ ceph mon rm <hostname>
removing mon.<hostname> at <IP>:<port>, there will be 5 monitors
```
  1. You should now have a HEALTH_OK status after the monitor removal.
  2. (If monitored by prometheus) remove the hostname from the list of endpoints to monitor. See it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml

For machines hosting uniquely the ceph mon

  1. Move this machine to a spare hostgroup: $ ai-foreman updatehost -c ceph/spare {hostname}

  2. Run puppet once: $ puppet agent -t

  3. (If physical) Reinstall the server in the ceph/spare hostgroup:

```sh
aiadm> ai-installhost p01001532077488
...
1/1 machine(s) ready to be installed
Please reboot the host(s) to start the installation:
ai-remote-power-control cycle p01001532077488.cern.ch
aiadm> ai-remote-power-control cycle p01001532077488.cern.ch
```

Now the physical machine is installed in the ceph/spare hostgroup.

  1. (If virtual) Kill the vm with: $ ai-kill-vm {hostname}

For machines hosting other ceph-daemons

  1. Move this machine to another hostgroup (e.g., /osd) of the same cluster: $ ai-foreman updatehost -c ceph/<cluster_name>/osd {hostname}
  2. Run puppet to apply the changes: $ puppet agent -t

Operating the Ceph Metadata Servers (ceph-mds)

Adding a ceph-mds daemon (VM, luminous)

Upsream documentation here: http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-mds/

The procedure follows the same pattern as adding a monitor node (create_a_mon) to the cluster.

Make sure you add your mds to the corresponding hostgroup ceph/<cluster>/mds and prepare the Puppet code (check other ceph clusters with cephfs as a reference)

Example for the ceph/mycluster hostgroup:

$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins \
 --nova-flavor m2.2xlarge --cc7 -g ceph/<mycluster>/mds --prefix ceph<mycluster>-mds- \
 --nova-availabilityzone cern-geneva-a

Note: When deploying more than one mds, make sure that they are spreaded into different availability zones.

As written in the upstream documentation, a ceph filesystem needs at least two metadata servers. The first will be the main server that will handle the clients' requests and the second one is the backup. Don't forget also to put the metadata servers into different availability zones, in case some problem occurs to a site.

Because of resource limitations, the flavor of the machines could be m2.xlarge instead of m2.2xlarge. In the ceph/<mycluster> cluster we use 2 m2.2xlarge main servers and one m2.xlarge backup server.

When the machine is available (reachable by the dns service), you can alter its state into production with roger.

$ roger update --appstate production --all_alarms=true ceph<mycluster>-mds-XXXXXXXXXX

After 2-3 runs of puppet

Using additional metadata servers (luminous)

Upstream documentation here: http://docs.ceph.com/docs/master/cephfs/multimds/

When your cephfs system can't handle the amount of client requests, you notice warnings about the mds or the requests on ceph status, you may need to use multiple active metadata servers.

After adding an mds to the cluster, you will notice on ceph status on the mds line something like the following line.

mds: cephfs-1/1/1 up  {0=cephironic-mds-716dc88600=up:active}, 1 up:standby-replay, 1 up:standby

The 1 up:standby-replay is the backup server and the 1 up:standby that is shown recently is the mds we just added. To make the standby server active, we need to execute the following line:

WARNING: Your cluster may have multiple filesystems, use the right one!

ceph fs set <fs_name> max_mds 2

The name of the ceph filesystem can be retrieved by using $ ceph fs ls and looking for the name: <fs_name> key-value pair.

Now your ceph status message should look like this:

...
mds: cephfs-2/2/2 up  {0=cephironic-mds-716dc88600=up:active,1=cephironic-mds-c4fbd7ee74=up:active}, 1 up:standby-replay
...

OSD Replacement Procedures

Check which disks needs to be put back in procedures.

  • To see which osds are down, check with ceph osd tree down out.
[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out
ID  CLASS WEIGHT     TYPE NAME                        STATUS REWEIGHT PRI-AFF
 -1       5589.18994 root default                                             
 -2       4428.02979     room 0513-R-0050                                     
 -6        917.25500         rack RA09                                        
 -7        131.03999             host p06253939j03957                         
430          5.45999                 osd.430            down        0 1.00000
-19        131.03999             host p06253939s09190                         
 24          5.45999                 osd.24             down        0 1.00000
405          5.45999                 osd.405            down        0 1.00000
 -9        786.23901         rack RA13                                        
-11        131.03999             host p06253939b84659                         
101          5.45999                 osd.101            down        0 1.00000
-32        131.03999             host p06253939u19068                         
577          5.45999                 osd.577            down        0 1.00000
-14        895.43903         rack RA17                                        
-34        125.58000             host p06253939f99921                         
742          5.45999                 osd.742            down        0 1.00000
-22        125.58000             host p06253939h70655                         
646          5.45999                 osd.646            down        0 1.00000
659          5.45999                 osd.659            down        0 1.00000
718          5.45999                 osd.718            down        0 1.00000
-26        131.03999             host p06253939v20205                         
650          5.45999                 osd.650            down        0 1.00000
-33        131.03999             host p06253939w66726                         
362          5.45999                 osd.362            down        0 1.00000
654          5.45999                 osd.654            down        0 1.00000
  • Check the tickets for the machines in Service Now. Those who interest us are the named : [GNI] exception.scsi_blockdevice_driver_error_reported or exception.nonwriteable_filesystems.
    • If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.

On the OSD:

LVM formatting using ceph-volume

  • Simple format: osd as logical volume of one disk

This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.

(Ceph volume listing takes some time to complete)

[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list

===== osd.335 ======

  [block]    /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba

      type                      block
      osd id                    335
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      encrypted                 0
      cephx lockbox secret      
      block uuid                PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92
      block device              /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      vdo                       0
      crush device class        None
      devices                   /dev/sdw

===== osd.311 ======

  [block]    /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e

      type                      block
      osd id                    311
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  1bfad506-c450-4116-8ba5-ac356be87a9e
      encrypted                 0
      cephx lockbox secret      
      block uuid                O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a
      block device              /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
      vdo                       0
      crush device class        None
      devices                   /dev/sdt

This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.

[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list


====== osd.29 ======

  [block]    /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48

      type                      block
      osd id                    29
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  dff889e7-5db5-4c5e-9aab-151e8ad17b48
      db device                 /dev/sdac3
      encrypted                 0
      db uuid                   9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
      cephx lockbox secret      
      block uuid                HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY
      block device              /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
      crush device class        None
      devices                   /dev/sdk

  [  db]    /dev/sdac3

      PARTUUID                  9762cd49-8f1c-4c29-88ca-ff78f6bdd35c

====== osd.88 ======

  [block]    /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558

      type                      block
      osd id                    88
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  f19541f6-42b2-4612-a700-ec5ac8ed4558
      db device                 /dev/sdab6
      encrypted                 0
      db uuid                   f0b652e1-0161-4583-a50b-45a0a2348e9a
      cephx lockbox secret      
      block uuid                cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg
      block device              /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
      crush device class        None
      devices                   /dev/sdu

  [  db]    /dev/sdab6

      PARTUUID                  f0b652e1-0161-4583-a50b-45a0a2348e9a

One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:

[  db]    /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85

    type                      db
    osd id                    220
    cluster fsid              e7681812-f2b2-41d1-9009-48b00e614153
    cluster name              ceph
    osd fsid                  81f9ed48-d27d-44b6-9ac0-f04799b5d0d5
    db device                 /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
    encrypted                 0
    db uuid                   wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL
    cephx lockbox secret      
    block uuid                z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR
    block device              /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8
    vdo                       0
    crush device class        None
    devices                   /dev/md125

Replacement procedure: one disk per osd

  1. ceph-volume lvm list is slow, save its output to ~/ceph-volume.out and work with that file instead.
  2. Check if the ssd device exists and it is failed.
  3. Check if it is used as a metadata database for osds, or as a regular osd.
    1. If it is a metadata database:
      1. Locate all osds that use it (lvm list + grep)
      2. Follow the procedure for each affected osd
    2. Treat it as a regular osd (normal replacement)
  4. Mark out the osd: ceph osd out $OSD_ID
  5. Destroy the osd: ceph osd destroy $OSD_ID --yes-i-really-mean-it
  6. Stop the osd daemon: systemctl stop ceph-osd@OSD_ID
  7. Unmount the filesystem: umount /var/lib/ceph/osd/ceph-$OSD_ID
  8. If the osd has uses a metadata database (on ssds)
    1. If it is a regular partition, remove the partition I guess
    2. If it's an lvm, remove it:
      1. eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
      2. lvremove cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
  9. Run ceph-volume lvm zap /dev/sdXX --destroy
  10. In case ceph-volume fails to list the defected devices or zap the disks. You can get the information you need through lvs -o devices,lv_tags | grep type=block and use vgremove instead for the osd block.
  11. In case you can't get any information through ceph-volume or lvs about the defective devices, you should list the working osds and umount the unused folders with:
    $ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
    
  12. Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
  13. If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm. For naming we use cache-`uuid -v4`. Just recreate the lvm you removed at step 7 with: lvcreate -name $name -l 100%FREE $VG. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG 25%VG instead of 100%FREE.
  14. Recreate the OSD using ceph volume, use a destroyed osd's id from the same host
    $ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX
    

Replacement procedure: two disks striped (raid 0) per osd

  1. Run this script with the defective device ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX (it doesn't take a list of devices)
  2. The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
  3. In any case the script failed, you can open it, as it is documented and follow the steps manually.
  4. If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
  5. Pass the set of disks from step 1 after you have all of them working on this script:
    ceph-scripts/ceph-volume/striped-osd-prepare.sh /dev/sd[a-f]
    
    It uses ls inside so you can use wildcards if you are bored to write '/dev/sdX' all the time.
  6. It will output a list of commands to be executed in order, run all EXCEPT THE ceph-volume create one. Add at the end of the ceph-volume create line the argument --osd-id XXX with the number of the destroyed osd id, and run the command.

Retrieve metadata information from Openstack

Ceph is tightly integrated with Openstack and it is this latter the main access point to the storage from the user perspective. As a result, Openstack is the main source of information for the data stored on Ceph: Project names, project owners, quotas, etc. Some noteble exceptions remain, for example local S3 accounts on Gabe and the whole Nethub cluster.

This page collects some example of what it is possible to retrieve from Openstack to know better the storage items we manage.

The magic "services" project

To gain visibility on the metadata stored by Openstack, it is needed to have access to the services project in Openstack. Typically all members of ceph-admins are part of it. services is a special project with storage administrator capabilities that allows to retrieve various pieces of information on the whole Openstack instance and on existing projects, compute resources, storage, etc...

Use the services project simply by setting:

OS_PROJECT_NAME=services

Openstack Projects

Get the list of openstack projects with their names and IDs:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project list | head -n 10
+--------------------------------------+------------------------------------------------------------------+
| ID                                   | Name                                                             |
+--------------------------------------+------------------------------------------------------------------+
| 0000d664-f697-423b-8595-57aea89be355 | Stuff...                                                         |
| 0007808b-2f41-41c5-bd7c-3bd1f1f94cb2 | Other stuff...                                                   |
| 00100a6d-b71c-415d-9dbc-3f78c2b8372a | Stuff continues...                                               |
| 001d902d-f76e-4222-a5d0-ca6529e8221f | ...                                                              |
| 0026e800-f134-4622-b0ef-4a03283a3965 | ...                                                              |
| 00292adf-92ad-4815-966c-a9296266b0a0 | ...                                                              |
| 004b5668-4ebe-418d-83bc-1cdadf059c85 | ...                                                              |

Get details of a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project show 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+-------------+--------------------------------------+
| Field       | Value                                |
+-------------+--------------------------------------+
| chargegroup | af9298f2-041b-0944-7904-3b41fde4f97f |
| chargerole  | default                              |
| description | Ceph Storage Service                 |
| domain_id   | default                              |
| enabled     | True                                 |
| fim-lock    | True                                 |
| fim-skip    | True                                 |
| id          | 5d8ea54e-697d-446f-98f3-da1ce8f8b833 |
| is_domain   | False                                |
| name        | IT Ceph Storage Service              |
| options     | {}                                   |
| parent_id   | default                              |
| tags        | ['s3quota']                          |
| type        | service                              |
+-------------+--------------------------------------+

Identify the owner of a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack role assignment list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833 --names --role owner
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| Role  | User             | Group | Project                         | Domain | System | Inherited |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| owner | dvanders@Default |       | IT Ceph Storage Service@Default |        |        | False     |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+

Openstack Volumes

List the RBD volumes in a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name               | Status    | Size | Attached to                                                   |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| 5143d9e4-8470-4ac4-821e-57ef99f24060 | buildkernel        | in-use    |  200 | Attached to 8afce55e-313f-432c-a764-b0ada783a268 on /dev/vdb  |
| c0f1a9f7-8308-412a-92da-afcc20db3c4c | clickhouse-data-01 | available |  500 |                                                               |
| 53406846-445f-4f47-b4c5-e8558bb1bbed | cephmirror-io1     | in-use    | 3000 | Attached to dfc9a14a-ff4b-490a-ab52-e6c9766205ad on /dev/vdc  |
| c2c31270-0b95-4e28-9ac0-6d9876ea7f32 | metrictank-data-01 | in-use    |  500 | Attached to fbdff7a0-7b5b-47c0-b496-5a8afcc8e528 on /dev/vdb  |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+

Show details of a volume:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume show c0f1a9f7-8308-412a-92da-afcc20db3c4c
+--------------------------------+-------------------------------------------+
| Field                          | Value                                     |
+--------------------------------+-------------------------------------------+
| attachments                    | []                                        |
| availability_zone              | ceph-geneva-1                             |
| bootable                       | false                                     |
| consistencygroup_id            | None                                      |
| created_at                     | 2021-11-04T08:34:51.000000                |
| description                    |                                           |
| encrypted                      | False                                     |
| id                             | c0f1a9f7-8308-412a-92da-afcc20db3c4c      |
| migration_status               | None                                      |
| multiattach                    | False                                     |
| name                           | clickhouse-data-01                        |
| os-vol-host-attr:host          | cci-cinder-qa-w01.cern.ch@beesly#standard |
| os-vol-mig-status-attr:migstat | None                                      |
| os-vol-mig-status-attr:name_id | None                                      |
| os-vol-tenant-attr:tenant_id   | 5d8ea54e-697d-446f-98f3-da1ce8f8b833      |
| properties                     |                                           |
| replication_status             | None                                      |
| size                           | 500                                       |
| snapshot_id                    | None                                      |
| source_volid                   | None                                      |
| status                         | available                                 |
| type                           | io1                                       |
| updated_at                     | 2021-11-04T08:35:15.000000                |
| user_id                        | tmourati                                  |
+--------------------------------+-------------------------------------------+

Show the snapshots for a volume in a project:

[ebocchi@aiadm84 ~]$ OS_PROJECT_NAME=services openstack volume snapshot list --project 79b9e379-f89d-4b3a-9827-632b9bf16e98 --volume d182a910-b40a-4dc0-89b7-890d6fa01efd
+--------------------------------------+-------------------+-------------+-----------+-------+
| ID                                   | Name              | Description | Status    |  Size |
+--------------------------------------+-------------------+-------------+-----------+-------+
| 798d06dc-6af4-420d-89ce-1258104e1e0f | snapv_webstuff03  |             | available | 30000 |
+--------------------------------------+-------------------+-------------+-----------+-------+

Whatchers preventing images to be deleted

OpenStack colleagues might report problems purging images

[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash ls
2ccb86bd4fca85 volume-3983f035-a47f-46e8-868c-04d2345c3786
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
8df764f0d51e64 volume-eb48e00f-ea31-4d28-91a1-4f8319724da7
99e74530298e95 volume-18fbb3e6-fb37-4547-8d27-dcbc5056c2b2
ebcc84aa45a3da volume-821b9755-dd42-4bf5-a410-384339a2d9f0

[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash purge
2021-02-17 15:42:46.911 7f674affd700 -1 librbd::image::PreRemoveRequest: 0x7f6744001880 check_image_watchers: image has watchers - not removing
Removing images: 0% complete...failed.

Find out which are the watchers with using the identifier on the left-hand side:

[15:52][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados listwatchers -p volumes rbd_header.2ccb86bd4fca85
watcher=188.184.103.106:0/964233084 client.634461458 cookie=140076936413376

Get in touch with the owner of the machine. The easiest way to fix stuck watchers is to reboot the machine.

Further information (might require untrash) about the volume can be found with

[18:31][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rbd info volumes/volume-00067659-3d1e-4e22-a5d7-212aba108500
rbd image 'volume-00067659-3d1e-4e22-a5d7-212aba108500':
    size 500 GiB in 128000 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: e8df4c4fe1aa8f
    block_name_prefix: rbd_data.e8df4c4fe1aa8f
    format: 2
    features: layering, striping, exclusive-lock, object-map
    op_features:
    flags:
    stripe unit: 4 MiB
    stripe count: 1

and with (no untrash required)

[18:32][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados stat -p volumes rbd_header.e8df4c4fe1aa8f
volumes/rbd_header.e8df4c4fe1aa8f mtime 2020-11-23 10:25:56.000000, size 0

Unpurgeable RBD image in trash

We have seen a case of an image in Beesly's trash that cannot be purged:

# rbd --pool volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b

# rbd --pool volumes trash purge
Removing images: 0% complete...failed.
2021-03-10 13:58:42.849 7f78b3fc9c80 -1 librbd::api::Trash: remove:
error: image is pending restoration.

When trying to delete manually, it says there are some watchers, but this is actually not the case:

# rbd --pool volumes trash remove 5afa5e5a07b8bc
rbd: error: image still has watchers2021-03-10 14:00:21.262 7f93ee8f8c80
-1 librbd::api::Trash: remove: error: image is pending restoration.
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client
to timeout.
Removing image:
0% complete...failed.

# rados listwatchers -p volumes rbd_header.5afa5e5a07b8bc
#

This has been reported upstream. Check:

  • ceph-users with subject "Unpurgeable rbd image from trash"
  • ceph-tracker https://tracker.ceph.com/issues/49716

The original answer was

$ rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file $ hexedit key_file ## CHANGE LAST BYTE FROM '01' to '00' $ rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc --input-file key_file $ rbd trash rm --pool volumes 5afa5e5a07b8bc

To unstuck the image and make it purgeable

  1. Get the value for its ID in rdb_trash
# rbd -p volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
[09:42][root@p05517715d82373 (qa:ceph/beesly/mon*2:peon) ~]# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file
Writing to key_file
  1. Make a safety copy of the original key_file
# cp -vpr key_file key_file_master
  1. Edit the key_file with an hex editor and change the last byte from '01' to '00'
# hexedit key_file
  1. Make sure the edited file contains only that change
# xxd key_file > file
# xxd key_file_master > file_master
# diff file file_master
5c5
< 0000040: 2a60 09c5 d416 00                        *`.....
---
> 0000040: 2a60 09c5 d416 01                        *`.....
  1. Set the edited file to be the new value
# rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc < key_file
  1. Get it back and check that the last byte is now '00'
# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc
value (71 bytes) :
00000000  02 01 41 00 00 00 00 2b  00 00 00 76 6f 6c 75 6d  |..A....+...volum|
00000010  65 2d 30 32 64 39 35 39  66 65 2d 61 36 39 33 2d  |e-02d959fe-a693-|
00000020  34 61 63 62 2d 39 35 65  32 2d 63 61 30 34 62 39  |4acb-95e2-ca04b9|
00000030  36 35 33 38 39 62 12 05  2a 60 09 c5 d4 16 12 05  |65389b..*`......|
00000040  2a 60 09 c5 d4 16 00                              |*`.....|
00000047
  1. Now you can finally purge the image
# rbd -p volumes trash purge
Removing images: 100% complete...done.
# rbd -p volumes trash ls
#

Undeletable image due to linked snapshots

We had a ticket (RQF2003413) of a user unable to delete a volume because of linked snapshots.

Dump the RBD info available on CEPH using the volume ID (see openstack_info of the undeletable volume:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd 
rbd image 'volume-d182a910-b40a-4dc0-89b7-890d6fa01efd':
    size 29 TiB in 7680000 objects
    order 22 (4 MiB objects)
    snapshot_count: 1
    id: 457afdd323be829
    block_name_prefix: rbd_data.457afdd323be829
    format: 2
    features: layering
    op_features:
    flags:
    access_timestamp: Fri Mar 25 12:19:12 2022

The snapshot_count reports 1, which indicates one snapshot exists for the volume.

Now, list the snapshots for the undeletable volumes:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical snap ls --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd 
SNAPID  NAME                                           SIZE    PROTECTED  TIMESTAMP
    37  snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f  29 TiB  yes

In turn, it is possible to create volumes from snapshots. To check if they exist, list the child(ren) volume(s) from snapshots

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical children --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd --snap snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
cinder-critical/volume-b9d0035f-857c-46b6-b614-4480c462d306

This last is a brand new volume, that still keeps a reference to the snapshot it originates from:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-b9d0035f-857c-46b6-b614-4480c462d306
rbd image 'volume-b9d0035f-857c-46b6-b614-4480c462d306':
    size 29 TiB in 7680000 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 7f8067e3510b0d
    block_name_prefix: rbd_data.7f8067e3510b0d
    format: 2
    features: layering, striping, exclusive-lock, object-map
    op_features:
    flags:
    access_timestamp: Fri Mar 25 12:20:51 2022
    modify_timestamp: Fri Mar 25 12:36:48 2022
    parent: cinder-critical/volume-d182a910-b40a-4dc0-89b7-890d6fa01efd@snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
    overlap: 29 TiB
    stripe unit: 4 MiB
    stripe count: 1

The parent field shows the volume comes from a snapshot, which cannot be deleted as the volume-from-snapshot is implemented as copy-on-write (see overlap: 29 TiB) via RBD layering.

Openstack can flatten volumes-from-snapshots in case these need to be made independent from the parent. Alternatively, to delete to parent volume, it is required to delete both the volume-from-snapshot and the snapshot.

Large omap object warning due to bucket index over limit

Large omap objects trigger HEALTH WARN messages and can be due to poorly sharded bucket indexes.

The following example report about a over-limit bucket on nethub detected on 2021/05/21.

  1. Look for Large omap object found. in ceph logs (/var/log/ceph/ceph/log):
2021-05-21 04:34:00.879483 osd.867 (osd.867) 240 : cluster [WRN] Large omap object found. Object: 7:7bae080b:::.dir.fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29:head PG: 7.d01075de (7.de) Key count: 610010 Size (bytes): 198156342
2021-05-21 04:34:11.622372 mon.cephnethub-data-c116fa59b2 (mon.0) 659324 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)

These lines show that:

  • The pool suffering from the problem is pool number 7
  • The PG suggering is 7.de
  • The shared object is a bucket index: .dir. represents bucket indexes
  • The affected bucket has id fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29 (sadly , there is no way to map it to a name)

To verify this is actually a bucket index, one can also check what pool #7 stores:

[14:21][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph osd pool ls detail | grep "pool 7"
pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 30708 lfor 0/0/2063 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 application rgw
  1. Run radosgw-admin bucket limit check to see how bucket index sharding is doing. It might take a while, it is recommended to dump to file.

  2. Check the output of radosgw-admin bucket limit check and look for buckets with OVER "fill_status":

{
    "bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
    "tenant": "",
    "num_objects": 767296,
    "num_shards": 0,
    "objects_per_shard": 767296,
    "fill_status": "OVER 749%"
},
  1. Check in the radosgw logs (please, use mco to look through all the RGWs) if the radosgw process has tried to reshard the bucket recently but did not manage. Example:
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:19:40.316 7fd2ce2a4700  1 check_bucket_shards bucket cboxbackproj-sftnight-lgdocs need resharding  old num shards 0 new num sh
ards 18
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.624 7fd2cd2a2700  0 NOTICE: resharding operation on bucket index detected, blocking
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.625 7fd2cd2a2700  0 RGWReshardLock::lock failed to acquire lock on cboxbackproj-sftnight-lgdocs:fe32212d-631b-44fe-8d35-
03f5a3551af1.142705079.19 ret=-16

This only applies whether dynamic resharding is enabled:

[14:27][root@cephnethub-data-0509dffff2 (qa:ceph/nethub/traefik*26) ~]# cat /etc/ceph/ceph.conf  | grep resharding
rgw dynamic resharding = true
  1. Reshard the bucket index manually:
radosgw-admin reshard add --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs --num-shards 18
  • The number of shards can be inferred from the logs inspected at point 4. -i If dynamic resharding is disable, a little math is required. Check the bucket stats (radosgw-admin bucket stats --bucket <bucket_name>) and make sure usage --> rgw.main --> num_objects divided by the number of shards does not exceed 100000 (50000 is recommended).

Example:

[14:29][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# radosgw-admin bucket stats --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs
{
    "bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
[...]
    "usage": {
        "rgw.main": {
            "size": 4985466767640,
            "size_actual": 4987395952640,
            "size_utilized": 4985466767640,
            "size_kb": 4868619891,
            "size_kb_actual": 4870503860,
            "size_kb_utilized": 4868619891,
            "num_objects": 941202
        }
    },
}

with 941202 / 18 = 52289

5b. Once added the bucket to be reshared, start the reshard process:

radosgw-admin reshard list
radosgw-admin reshard process
  1. Check after some time that the radosgw-admin bucket stats --bucket <bucket_name> reports the right number of shards and that radosgw-admin bucket limit check no longer shows OVER or WARNING for the re-sharded bucket.

  2. To clear the HEALTH_WARN message for the large omap object, start a deep scrub on the affected pg:

[14:31][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph pg deep-scrub 7.de
instructing pg 7.de on osd.867 to deep-scrub

Ceph logging [WRN] evicting unresponsive client

This warning shows that a client stopped responding to messages from the MDS. Sometimes it is harmless (perhaps a client disconnected "uncleanly", e.g. a hard reboot), or it could indicate the client is overloaded, deadlocked on something else.

If the same client is appearing repeatedly, it may be useful to get in touch with the owner of the client machine. (ai-dump <hostname> on aiadm).

Ceph logging [WRN] clients failing to respond to cache pressure

When the MDS cache is full, it will need to clear inodes from its cache. This normally also means that the MDS needs to ask some clients to also remove some inodes from their cache too.

If the client fails to respond to this cache recall request, then Ceph will log this warning.

Clients stuck in this state for an extended period of time can cause issues -- follow up with the machine owner to understand the problem.

Note: Ceph-fuse v13.2.1 has a bug which triggers this issue -- users should update to a newer client release.

Ceph logging [WRN] client session with invalid root denied

This means that a user is trying to mount a Manila share that either doesn't exist or they didn't create a key yet. It is harmless, but if repeated then get in touch with the user.

Procedure to unblock hung HPC writes

An HPC client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
    mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
failing to respond to capability release client_id: 69092525
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec

Indeed there was a hung write on hpc070.cern.ch:

# cat /sys/kernel/debug/ceph/*/osdc
245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024        1 write

I restarted osd.100 and the deadlocked request went away.

S3 Operations notes

Note: If you are looking for the old notes related to the infrasctructure based on consul and nomad, please refer to old documentation.


About the architecture

The CERN S3 service (s3.cern.ch) is provided by the gabe cluster and an arbitrary number of radosgw running on VMs. Each node in the ceph/gabe/radosgw hostgroup also runs a reverse-proxy daemon (Træfik), to spread the load on the VMs running a radosgw and to route traffic to different dedicated RGWs (cvmfs, gitlab, ...).

A second S3 cluster (s3-fr-prevessin-1.cern.ch) is also available in Prevessin Nethub Hub (nethub).

Both clusters (as of July 2021) use similar technologies: Ceph, RGWs, Træfik, Logstash, ....

Components

  • RadosGW: Daemon handling S3 requests and interacting with the Ceph cluster
  • Træfik: Handles HTTP(S) requests from the Internet and spreads the load on radosgw daemons.
  • Logstash: Sidecar process that ships the access logs produced by Træfik to the MONIT infrastructure.

Useful documentation

  • Upstream RadosGW documentation: (https://docs.ceph.com/en/nautilus/radosgw/)
  • Upstream documentation on radosgw-admin tool: (https://docs.ceph.com/en/nautilus/man/8/radosgw-admin/)
  • Træfik documentation: (https://docs.Træfik.io/)
  • S3 Script guide: (https://gitlab.cern.ch/ceph/ceph-guide/-/blob/master/src/ops/s3-scripts.md)

Dashboards

  • Træfik: http://s3.cern.ch/traefik/ (requires basic auth)
  • ElasticSearch for access logs: https://es-ceph.cern.ch/ (from CERN network only)
  • Various S3 dashboards (and underlying Ceph clusters) on Filer Carbon
  • Buckets rates (and others) on Monit Grafana

Maintenance Tasks

Removal of one Træfik/RGW machine from the cluster

Each machine running Træfik/RGW is:

  • Part of the s3.cern.ch alias (managed by lbclient), with Træfik accepting connections on port 80 and 443 for HTTP and HTTPS, respectively

  • A backend RadosGW for all the Træfiks of the cluster, with the Ceph RadosGW daemon accepting connections on port 8080

  • To remove a machine from s3.cern.ch, touch /etc/nologin or change the roger status to intervention/disabled (roger update --appstate=intervention <hostname>). This will make lbclient return a negative value and the machine will be removed from the alias.

  • To remove temporarily a RadosGW from the list of backends (e.g., for a cluster upgrade), touch /etc/nologin and the RadosGW process will return 503 for requests to /swift/healthcheck. This path is used by Træfik healthcheck and, if the return code is different from 200, Træfik will stop sending requests to that backend. Wait few minutes to let in-flight requests complete, then restart the RadosGW process without clients noticing. See Pull Request to implement healthcheck disabling path.

  • To remove permanently a RadosGW from the list of backends (e.g., decommissioning), change the Træfik dynamic configuration via puppet in Træfik.yaml by removing the machine from the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Create a new Træfik/RGW VM

  • Spawn a new VM with the script cephgabe-rgwtraefik-create.sh from aiadm
  • Wait for the VM to be online and run puppet several times so that the configuration is up to date
  • Make sure you have received the email confirming the VM has been added to the firewall set (and so it is reachable from the big Internet)
  • Make sure the new VM serves requests as expects (Test IPv4 and IPv6, HTTP and HTTPS):
curl -vs --resolve s3.cern.ch:{80, 443}:<ip_address_of_new_VM> http(s)://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
  • Add the VM to the Prometheus s3_lb job (see prometheus puppet config) to monitor its availability and collect statistics on failed (HTTP 50*) requests
  • Change the roger status to production and enable all alarms. The machine will now be part of the s3.cern.ch alias
  • Update the Træfik dynamic configuration via puppet in Træfik.yaml by adding the new backend to the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).

Change/Add/Remove the backend RadosGWs

  • Edit the list of backend nodes in the Træfik dynamic configuration via puppet in Træfik.yaml by adding/removing/shuffling around the server. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).
  • If adding/removing, make sure the list of monitored endpoints by Prometheus is up to date. See prometheus puppet config.

Change Træfik TLS certificate

The certificate is provided by CDA. You should ask them to buy a new one with the correct SANs. Once the new certificate is provided, copy-paste it on https://tools.keycdn.com/certificate-chain -- It will return a certificate chain with all the required intermediate certificates. This certificate chain is the one to be put in Teigi and be used by Træfik. Please, split it and check the validity of each certificate validity with openssl x509 -in <filename> -noout -text. Typically, the root CA certificate, the intermediate certificate and the private key do not change.

Once validates, it should be put in Teigi under ceph/gabe/radosgw/træfik:

  • s3_ch_ssl_certificate
  • s3_ch_ssl_private_key

Next, the certificate must be deployed on all machines via puppet. Mcollective can be of help to bulk-run puppet on all the Træfik machines:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Last, the certificate must be loaded by Træfik. While the certificate is part of Træfik's dynamic configuration, Træfik does not seem to reload it if the certificate file (distributed via puppet) changes on disk. Puppet will still notify the Træfik service when the certificate file changes (see traefik.pp) to no avail.

Since 2022, a configuration change in Træfik (Traefik: hot-reload certificates when touching (or editing) dynamic file) allows reloading the certificate when the Traefik dynamic configuration file changes. It is sufficient to touch /etc/traefik/traefik.dynamic.conf to have the certificate reloaded, with no need to drain the machine and restart the Traefik process:

  • Make sure the new certificate file is available on the machine (/etc/ssl/certs/radosgw.crt)
  • Tail the logs of the Traefik service: tail -f /var/log/traefik/service.log
  • Touch Traefik's dynamic configuration file: touch /etc/traefik/traefik.dynamic.conf
  • Check the new certificate is in place:
curl -vs --resolve s3.cern.ch:443:<the_ip_address_of_the_machine> https://s3.cern.ch --output /dev/null 2>&1 | grep ^* | grep date
*  start date: Mar  1 00:00:00 2022 GMT
*  expire date: Mar  1 23:59:59 2023 GMT

The same certificates are also used by the Nethub cluster and distributed via Teigi under ceph/nethub/traefik:

  • s3_fr_ssl_certificate
  • s3_fr_ssl_private_key

Quota alerts

There is a daily cronjob that checks S3 user quota usage and sends a list of accounts reaching 90% of their quota. Upon reception of this email, we should get in touch with the user and see if they can (1) free some space by deleting unnecessary data or (2) request more space.

Currently, there is some rgw accounts that will come without an associated email address. A way to investigate who owns the account is to log into aiadm.cern.ch and run the following commands (in /root/ceph-scripts/tools/s3-accounting/)

./cern-get-accounting-unit.sh --id `./s3-user-to-accounting-unit.py <rgw account id>`

This will give you the user name of the associated openstack tenant's owner, with the contact email address.

Further notes on s3.cern.ch alias

The s3.cern.ch alias is managed by aiermis and/or by the kermis CLI utility on aiadm

[ebocchi@aiadm81 ~]$ kermis -a s3 -o read
INFO:kermis:[
    {
        "AllowedNodes": "",
        "ForbiddenNodes": "",
        "alias_name": "s3.cern.ch",
        "behaviour": "mindless",
        "best_hosts": 10,
        "clusters": "none",
        "cnames": [],
        "external": "yes",
        "hostgroup": "ceph/gabe/radosgw",
        "id": 3019,
        "last_modification": "2018-11-01T00:00:00",
        "metric": "cmsfrontier",
        "polling_interval": 300,
        "resource_uri": "/p/api/v1/alias/3019/",
        "statistics": "long",
        "tenant": "golang",
        "ttl": null,
        "user": "dvanders"
    }
]

As of July 2021, the alias returns the 10 best hosts (based on the lbclient score) out of all the machines that are part of the alias, which are typically more. Also, the members of the alias are refreshed every 5 minutes (300 seconds).

Upgrading software

Upgrade mon/mgr/osd

Follow the procedure defined for the other Ceph clusters. In a nutshell:

  • Start with mons, then mgrs. OSDs go last.
  • If upgrading OSDs, ceph osd set {noin, nout}
  • yum update to update the packages (check that the ceph package is actually upgraded)
  • systemctl restart ceph-{mon, mgr, osd}
  • Always make sure the daemons came back alive and all OSDs repeered before continuing with the following machine

Upgrading RGW

To safely upgrade the RadosGW, touch /etc/nologin to have it returning 503 to the healthcheck probes from Træfik (see more about healthcheck disabling path above). This allows for draining the RadosGW by not sending new requests to it and letting in-flight ones finish gently.

After few minutes, one can assume there are no more in-flight requests and the RadosGW can be update and restarted: systemctl restart ceph-{mon, mgr, osd}. Make sure the RadosGW came back alive by tailing the log at /var/log/ceph/ceph-client.rgw.*; it should still return 503 to the Træfik healthchecks. Now remove /etc/nologin and check the requests flow with 200.

Upgrading Træfik

To safely upgrade Træfik, the frontend machine must be removed from the load-balanced alias by touching /etc/nologin (this will also disable the RadosGW due to the healthcheck disabling path -- see above). Wait for some time and make sure no (or little) traffic is handled by Træfik by checking its access logs (/var/log/traefik/access.log)`. Some clients (e.g., GitLab, CBack) are particularly sticky and rarely re-resolve the alias to IPs -- there is nothing you can do to push those clients away.

When no (or little) traffic goes through Træfik, update the traefik::version parameter and run puppet. The new Træfik binary will be installed on the host and the service will be restarted.

Check with curl that Træfik works as expected. Example:

$ curl -vs --resolve s3.cern.ch:80:188.184.74.136 http://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
* Added s3.cern.ch:80:188.184.74.136 to DNS cache
* Hostname s3.cern.ch was found in DNS cache
*   Trying 188.184.74.136:80...
* TCP_NODELAY set
* Connected to s3.cern.ch (188.184.74.136) port 80 (#0)
> GET /cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished HTTP/1.1
> Host: s3.cern.ch
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Bucket: cvmfs-atlas
< Cache-Control: max-age=61
< Content-Length: 601
< Content-Type: application/x-cvmfs
< Date: Fri, 22 Apr 2022 14:45:27 GMT
< Etag: "b5dbc3633d7bb27d10610f5f1079a192"
< Last-Modified: Fri, 22 Apr 2022 14:11:10 GMT
< X-Amz-Request-Id: tx00000000000000143ffd3-006262bf87-28e3e206-default
< X-Rgw-Object-Type: Normal
< 
Ca5b48a4ed8f0ca46b79584104564da32b42a1c45
B1385472
Rd41d8cd98f00b204e9800998ecf8427e
D240
S103476
Gno
Ano
Natlas.cern.ch
{...cut...}
* Connection #0 to host s3.cern.ch left intact

If successful, allow the machine to join the load-balanced pool by removing /etc/nologin.

S3 radosgw-admin operations

radosgw-admin is used to manage users, quotas, buckets, indexes, and all other aspects of the radosgw service.

Create a user

End-users get S3 quota from OpenStack (see Object Storage).

In special cases (e.g., Atlas Event Index, CVMFS Stratum 0s, GitLab, Indico, ...), we create users that exist only in Ceph and are not managed by OpenStack. To create a new user of this kind, it is needed to know user_id, email address, display name, quota (optional).

Create the user with:

radosgw-admin user create --uid=<user_id> --email=<email_address> --display-name=<display_name>

To set a quota for the user:

radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>
radosgw-admin quota enable --quota-scope=user --uid=<user_id>

Example:

radosgw-admin user create --uid=myuser --email="myuser@cern.ch" --display-name="myuser"
radosgw-admin quota set --quota-scope=user --uid=myuser --max-size=500G
radosgw-admin quota enable --quota-scope=user --uid=myuser

Change user quota

It is sufficient to set the updated quota value for the user:

radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>

Bucket resharding

RGW shards bucket indices over several objects. The default number of shards per index is 32 in our clusters. It is best practice to keep the number of objects per shard below 100000. You can check the compliance across all buckets with radosgw-admin bucket limit check.

If there is a bucket with "fill_status": "OVER 100.000000%" then it should be resharded. E.g.

> radosgw-admin bucket reshard --bucket=lhcbdev-test --num-shards=128
tenant: 
bucket name: lhcbdev-test
old bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.24333603.1
new bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.76824996.1
total entries: 1000 2000 ... 8599000 8599603
2019-06-17 09:27:47.718979 7f2b7665adc0  1 execute INFO: reshard of bucket "lhcbdev-test" from "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.24333603.1" to "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.76824996.1" completed successfully

SWIFT protocol for quota information

It is convenient to use the SWIFT protocol to retrieve quota information.

  1. Create the SWIFT user as a subuser:
radosgw-admin subuser create --uid=<user_id> --subuser=<user_id>:swift --access=full

This generates a secret key that can be used on the client side to authenticate with SWIFT.

  1. On clients, install the swift package (provided in the OpenStack Repo on linuxsoft) and retrieve quota information with
swift \
    -V 1 \
    -A https://s3.cern.ch/auth/v1.0 \
    -U <user_id>:swift \
    -K <secret_key> \
    stat 

S3 logging

Access logs from Træfik reverse-proxy are collected via a side-car process called fluentbit. It pushes the logs to Monit Logs infrastructure for later processing by Logstash for filtering and enrichment running on Monit Marathon. Eventually, logs are then pushed to HDFS (/project/monitoring/archive/s3/logs) and to Elasticsearch for storage and visualization.

fluentbit on S3 RadosGWs

Since late April 2022, we use fluentbit on RadosGWs+Træfik frontends as it is much more gentle on memory than Logstash (which we were using previously).

fluentbit tails the log files produced by Træfik (both HTTP access logs and Træfik daemon logs), add a few fields and context through metadata, and pushes the records to the Monit Logs infrastructure at URI monit-logs-s3.cern.ch:10013/s3 using TLS encryption.

It is installed via puppet (exmaple for Gabe) by using the shared class fluentbit.pp responsible for installation and configuration of the fluentbit service.

fluentbit on the RadosGWs+Træfik frontends is configured to tail two input files, namely the access (/var/log/traefik/access.log) and the daemon (/var/log/traefik/service.log) logs of Træfik. Logs from the access (daemon) file are tagged as traefik.access.* (traefik.service.*), labelled as s3_access (s3_daemon). Before sending to the Monit infrastructure, the message is prepared to define the payload data and metadata (see monit.lua):

  • producer is s3 (used to build path on HDFS) -- must be whitelisted on the Monit infra;
  • type defines if the logs are access or daemon (used to build path on HDFS);
  • index_prefix defines the index for the logs (is used by Logstashon Monit Marathon and on Elasticsearch).

Logstash on Monit Marathon

Logstash is the tool that reads the aggregated log stream from Kafka, does most of the transformation and writes to Elasticsearch.

This Logstash process runs in a Docker container on the Monit Marathon cluster (see Applications --> storage --> s3logs-to-es). For debugging purposes, stdout and stderr of the container are available on monit-spark-master.cern.ch:5050/ -- They do not work from Marathon.

The Dockerfile, configuration pipeline, etc., are stored in s3logs-to-es.

This Logstash instance:

  • removes the additional fields introduced by the Monit infrastructure (metadata unused by us)
  • parses the original message as json document
  • adds costing information
  • adds geographical information of the client IP (geoIP)
  • copies a subset of fields relevant for CSIR to a different index
  • ...and pushes the results (full logs, and CSIR stripped version) to Elasticsearch

Elasticsearch

We finally have our dedicated Elasticsearch instance managed by the Elasticsearch Service.

There's not much to configure from our side, just a few useful links and the endpoint config repository:

Data is kept for:

  • 10 days on fast SSD storage, local to the ES cluster
  • other 20 days (30 total) on ceph storage
  • 13 months (stripped-down version, some fields are filtered out -- see below) for CSIR purposes

Indexes on ES must start with ceph_s3. This is the only whitelisted pattern, and hence the only one allowed. We currently use different indexes:

  • ceph_s3_access: Access logs for Gabe (s3.cern.ch)
  • ceph_s3_daemon: Traefik service logs for Gabe
  • ceph_s3_access-csir: Stripped down version of Gabe access logs for CSIR, retained for 13 months
  • ceph_s3_fr_access: Access logs of Nethub (s3-fr-prevessin-1.cern.ch)
  • ceph_s3_fr_daemon: Traefik service logs for Nethub
  • ceph_s3_fr_access-csir: Stripped down version of Nethub access logs for CSIR, retained for 13 months

ES is also a data source for Monit grafana dashboards:

  • Grafana uses basic auth to ES with user ceph_ro:<password> (The password is stored in Teigi: ceph/gabe/es-ceph_ro-password)
  • ES must have the internal user ceph_ro configured with permissions to read ceph* indexes

HDFS

HDFS is solely used as a storage backed to store the logs for 13 months for CSIR purposes. As of July 2021, HDFS stores the full logs (to be verified if they do not eat too much space on HDFS). To check/read logs on HDFS, you must have access to the HDFS cluster (see prerequisites) and from lxplus

source /cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3
kinit
hdfs dfs -ls /project/monitoring/archive/s3/logs

Centos Stream 8 migration

All the information regarding centos stream 8 can be found in this document.

Upgrading from Centos 8 in place

  1. Create new CS8 nodes with representative configurations and validate

  2. Enable the upgrade (top-level hostgroup, sub-hostgroup, etc)

    base::migrate::stream8: true
    
  3. Follow the instructions

    • Run Puppet twice.
    • Run distro-sync.
    • Reboot.

CephFS Backup via cback

CephFS backups are currently added on demand. Request a backup opening a ticket to Ceph Service.

Backup Characteristics

  • Stored in Nethub cluster (Prevessin, FR)
  • Snapshot based. Not point-in-timeconsistent (no snpshots, no fsfreeze or so)
  • By default, we keep last 7 daily snapshots, last 5 weekly snapshots and last 6 monthly snapshots.
  • Backup repositories are encrypted.

Add new backup job

  • Use the following procedure Link
  • Enabled clusters: flax, levinson, pam, doyle

Restore data

cback repos and documentation

See cback.docs.cern.ch/.

Improve me !