Using and Operating Ceph

The aws s3api is useful for doing advanced s3 operations, e.g. dealing with object versions. The following explains how to set this up with our s3.cern.ch endpoint.

All of the information required to set up aws-cli can be found in the existing .s3cfg file used when using S3.

$> yum install awscli 
$> aws configure
AWS Access Key ID [None]: <your access key> 
AWS Secret Access Key [None]: <your secret key>
Default region name [None]:
Default output format [None]:

$> aws --endpoint-url=http://s3.cern.ch s3api list-buckets
{
  "Buckets": [
     {
         "Name": <bucket1>,
         "CreationDate": <timestamp> 
     },
     {
       ....
     }
   ],
   "Owner": {
        "DisplayName": <owner>,
        "ID": <owner id>
    }

}

We provide here a script to help user make sure all versions of their objects are deleted.

Usage:

$> ./s3-delete-all-object-versions.sh -b <bucket> [-f]
   -b: bucket name to be cleaned up
   -f: if omitted, the script will simply display a summary of actions. Add -f to execute them.

AWS reference manual

Log to Foreman and create the following hostgroups:
- <my_cluster> and select ceph as the parent hostgroup.
- For the monitors, create host group mon and select ceph/<my_cluster> as the parent group.
- For the osd, create hostgroup osd and select ceph/<my_cluster> as the parent group.
- For the metadata servers, create the hostgroup mds and select ceph/<my_cluster> as the parent.
- Do the puppet configuration:
  - Clone the repo it-puppet-hostgroup-ceph
  - Create the manifests and data files accordingly for the new cluster (use the configuration of other cluster as a base)
  - Remember to create a new uuid for the cluster and put it in /code/hostgroup/ceph/<my_cluster>.yaml
  - Commit, push, do merge request, etc ...

1 ) Create one virtual machine for the first monitor following this guide

2 ) Create a mon bootstrap key (from any previous ceph cluster):

  ssh root@ceph<existing-cluster>-mon-XXXX
  ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'

From aiadm: (maybe you need to ask for permissions to get access to the tbag folder)

  mkdir ~/private/tbag/<my_cluster>
  cd ~/private/tbag/<my_cluster>
  scp root@ceph<existing_cluster>-mon-XXXX:/tmp/keyring.mon .
  tbag set --hg ceph/<my_cluster>/mon keyring.mon --file keyring.mon

3 ) Now run puppet on the first mon.
```
puppet agent -t -v 
```

4 ) Now copy the admin keyring to tbag (from aiadm):

  scp root@<first_mon>:/etc/ceph/keyring . 
  tbag set --hg ceph/<my_cluster> keyring --file keyring

5 ) Now create an MGR bootstrap key on the first mon:

  ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
  ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr

From aiadm:

  scp root@<first_mon>:/tmp/keyring.bootstrap-mgr .
  tbag set --hg ceph/<my_cluster> keyring.bootstrap-mgr --file keyring.bootstrap-mgr

6 ) Now create an OSD bootstrap key on the first mon:

   ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
   ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd

From aiadm:

  scp root@<first_mon>:/tmp/keyring.bootstrap-osd .
  tbag set --hg ceph/<my_cluster> keyring.bootstrap-osd --file keyring.bootstrap-osd

Follow the step 1) to add more mons and osds. Everything should install correctly.

Prepare and activate the OSD

/root/ceph-scripts/ceph-disk/ceph-disk-prepare-all

NOTE: To setup a OSD in the same machine as the monitor.

mkdir /data/a (for example)
chown ceph:ceph -R /data
ceph-disk prepare --filestore /data/a (ignore the Deprecate warnings)
ceph-disk activate /data/a

Introduction
Puppet configuration
Creating monitor hosts
Creating manager hosts
Creating osd hosts
Creating the first pool
Finalize cluster configuration
RBD Clusters
- Secrets for OS Cinder
- Images pool for OS Glance
CephFS Clusters
- Creating mds hosts
- Secrets for OS Manila
S3 Clusters

Follow the below instructions to create a new CEPH cluster in CERN

Access to aiadm.cern.ch
Proper GIT configuration
Member of ceph administration e-groups
OpenStack environment configured, link

First, we have to create the hostgroups in which we want to build our cluster in.

The hostgroups provide a layer of abstraction for configuring automatically a
cluster using Puppet. The first group called ceph, ensures that each
machine in this hostgroup has ceph installed, configured and running. The second
group, called first sub-hostgroup, ensures that each machine will communicate
with machines in the same sub-hostgroup forming a cluster. These machines will
have specific configuration defined later in this guide. The second sub-hostgroup
ensures that each machine will act as its corresponding role in the cluster.

For example we first create our cluster's hostgroup with its name that is provided by your task.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}

As each cluster has its own features, the 2 basic sub-hostgroups for a ceph
cluster are the mon and osd.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mon
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/osd

These sub-hostgroups will contain the monitors and the osd hosts.

If the cluster has to use CephFS and/or Rados gateway we need to create the
appropriate sub-hostgroups.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mds      #for CephFS
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/radosgw  #for the rados gateway

Go to gitlab.cern.ch and search for it-puppet-hostgroup-ceph. This repository
contains the configuration for all the machines under the ceph hostgroup. Clone
the repository, create a new branch based on qa, and go to it-puppet-hostgroup-ceph/code/manifests.
From there, you will create the {hg_name}.pp file and the {hg_name} folder.

The {hg_name}.pp should contain the following code: (replace {hg_name} with the cluster's name)

class hg_ceph::{hg_name} {
  include hg_ceph::include::base
}

This will load the basic configuration for ceph on each machine. The {hg_name} folder should contain the *.pp files for the appropriate 2nd sub-hostgroups.

The files under your cluster's folder will have the following basic format:

File {role}.pp:

class hg_ceph::{hg_name}::{role} {
  include hg_ceph::classes::{role}
}

The include will use a configuration template located in it-puppet-hostgroup-ceph/code/manifests/classes

The roles are: mon, mgr, osd, mds and radosgw. It is good to run both mon and mgr together, so we usually create the following class e.g.:

class hg_ceph::{hg_name}::mon {
  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr
}

The following code will configure machines in "ceph/{hg_name}/mon" to act as
monitors and mgrs together. After you are done with creating the needed files
for your task. Your "code/manifests" path should look like this:

# Using kermit as {hg_name}

kermit.pp
kermit/mon.pp
kermit/osd.pp
# Optional, only if requested by the JIRA ticket
kermit/mds.pp
kermit/radosgw.pp

Create a YAML configuration file for the new hostgroup in it-puppet-hostgroup-ceph/data/hostgroup/ceph with name {hg_name}.yaml. This files contains all the basic configuration parameters that are common to all the nodes in the cluster.

ceph::conf::fsid: d3c77094-4d74-4acc-a2bb-1db1e42bb576

ceph::params::release: octopus

lbalias: ceph{hg_name}.cern.ch
hg_ceph::classes::mon::enable_lbalias: false

hg_ceph::classes::mon::enable_health_cron: true
hg_ceph::classes::mon::enable_sls_cron: true

Where:

ceph::conf::fsid can be generated by uuid tool;
lbalias is the alias the mons are part of.

Git add the following files, commit and push your branch. BEFORE you push, do a git pull --rebase origin qa to avoid any conflicts with your request. The command line will provide a link to submit a merge request.

@dvanders is currently the administrator of the repo, so you should assign him the task to check your request and eventually merge it.

Follow the instructions to create exactly one monitor here. DO NOT ADD more than one machines to the ceph/{hg_name}/mon hostgroup, otherwise your first monitor will always deadlock and you will need to remove the others and rebuild the first one again.

Once we are able to login to the node, we will need to create the keys to be
able to bootstrap new nodes to the cluster. We will first have to create the
inital key, so mons can be created in our new cluster.

[root@ceph{hg_name}-mon-...]$ ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'

[user@aiadm]$ mkdir -p ~/private/tbag/{hg_name}
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.mon .
[user@aiadm]$ tbag set --hg ceph/{hg_name}/mon keyring.mon --file keyring.mon

Run the following to disable some warning and enable some features for ceph:

[root@ceph{hg_name}-mon-...]$ ceph mon enable-msgr2
[root@ceph{hg_name}-mon-...]$ ceph osd set-require-min-compat-client luminous
[root@ceph{hg_name}-mon-...]$ ceph config set mon auth_allow_insecure_global_id_reclaim false

Note that enable-msgr2 will need to be run again after all mons have been created.

We will need to repeat this procedure for the mgr, osd, mds, rgw and rbd-mirror depending on what we need:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
# Optional, only if the cluster uses CephFS
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mds mon 'allow profile bootstrap-mds'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mds > /tmp/keyring.bootstrap-mds
# Optional, only if the cluster uses a Rados Gateway
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' -o /tmp/keyring.bootstrap-rgw
# Optional, only if the cluster uses a rbd-mirror
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-rbd-mirror -o /tmp/keyring.bootstrap-rbd-mirror

Make sure you don't have any excess keys in the /tmp folder (5 max, mon/mgr/osd/mds/rgw).
We don't need to provide the specific subgroup for each key, because that will cause confusion, "ceph/{hg_name}" is enough.

[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.* .
[user@aiadm]$ scp {mon_host}:/etc/ceph/keyring .
[user@aiadm]$ for file in *; do tbag set --hg ceph/{hg_name} $file --file $file; done
# Make sure to copy all the generated keys on `/mnt/projectspace/tbag` of `cephadm.cern.ch` as well:
[user@aiadm]$ scp -r . root@cephadm:/mnt/projectspace/tbag/{hg_name}

Now we create the other monitors using the same procedure as the first one using ai-bs. The other monitors will be configured automatically.

The procedure is very similar to the one for the creation of mons:

Create new VMs;
Add them to the ceph/{hg_name}/mgr hostgroup;
Set the right roger state for the new VMs;

Instructions for the creation of mons still hold here, with the necessary changes for mgrs.

As stated above, in some cases it is necessary to colocate mons and mgs. If so, it is not needed to create new machines for mgrs but simply include the mgr class in the mon manifest:

class hg_ceph::{hg_name}::mon {

  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr

}

The OSD hosts will be usually given to you to be prepared by formatting the disks
and adding them to the cluster. The tool used to format the disks will be ceph-volume.
The provision will happen with lvm. Make sure your disks are empty, run pvs and
vgs to check if they have any lvm data.

We can safely ignore the system disks in case they are used with lvm. On every
host run ceph-volume lvm zap {disk} --destroy to zap the disks and remove any
lvm data. In case your hosts contain only one type of disk like HDD or SSD for
OSDS we can run the following command for the provision of our OSDS:

# It works like the ls command, if we need to create OSDS from /dev/sdc to /dev/sdz we can try this
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sd[c-z]

You will be prompted to check the OSD creation plan and if you agree with the
following changes you can input yes to create the OSDS. If you are trying to
automate this task you can pass the --yes parameter to the ceph-volume lvm batch
command. In the case you have SSDs to back the HDDs to create hybrid OSDs using
SDD block.DB and HDD block.data you will have to run the above command per SSD:

# 2 SSDs sda sdb 4HDDs sdc sdd sde sdf
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sda /dev/sdc /dev/sdd
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sdb /dev/sde /dev/sdf

The problem with the current lvm batch implementation is that it creates a single
volume group for the block.DB part. Therefore, when an SSD fails, the whole set
of OSDs in the host become corrupted. So in order to minimize the cost, we run batch per SSD.

Run ceph osd tree to check whether the OSDs are placed correctly in the tree.
If the OSDs are not set as described with grep ^crush /etc/ceph/ceph.conf you
will need to remove the line containing something like update crush on start
and restart the OSDs of that host. You can also create/move/delete buckets with (examples):

ceph osd crush add-bucket CK13 rack
ceph osd crush move CK13 room=0513-R-0050
ceph osd crush move 0513-R-0050 root=default
ceph osd crush move cephflash21a-ff5578c275 rack=CK13

Now you are one step away from having a functional cluster.
Next step is to create a pool so we can be able to use the storage of our cluster.

A pool in ceph is the root namespace of an object store system. A pool has its
own data redudancy schema and access permissions. In the case cephfs is used, two
pools are created, one for data and one for metadata, or in the case to support
openstack various pools are created for storing images and volumes and shares.
To create a pool we first have to understand what type of data redundancy we
should use: replicated or EC. If the task already defines what should happen,
then you can go to the ceph documentation:

BEFORE you create a pool you first need to create a CRUSH rule that matches
to your cluster's schema:

You can get the schema by running ceph osd tree | less.

As an example, the meredith cluster runs with 4+2 EC and the failure domain is rack. Create the required erasure-code-profile with:

[root@cephmeredithmon...]$ ceph osd erasure-code-profile ls
default

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2 k=4 m=2 crush-failure-domain=rack --force
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=rack
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

NEVER modify an existing profile. That would change the data placement on disk!
Here we use the --force flag only because the new jera_4plus2 is not used yet.

Now create a CRUSH rule with the defined profile:

[root@cephmeredithmon...]$ ceph osd crush rule create-erasure rack_ec jera_4plus2
created rule rack_ec at 1

[root@cephmeredithmon...]$ ceph osd crush rule ls
replicated_rule
rack_ec

[root@cephmeredithmon...]$ ceph osd crush rule dump rack_ec
{
    "rule_id": 1,
    "rule_name": "rack_ec",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 6,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "rack"
        },
        {
            "op": "emit"
        }
    ]
}

The last thing that is left is to calculate the number of PGs to keep the cluster running optimally. The Ceph developers reccomend 30 to 100 PGs per OSD, keep in mind that the data
redundancy schema counts as a multiplier. For example, if you have 100 OSDs you
will need at least 3K to 10K PGs. The number of the PGs must be a power of
two. So, we will use at least 1024(x3) to 2048(x3) PGs on the pool creation
command. Keep in mind that there may be a need for additional pools, such as
"test" which is created on every cluster for the profound reason of testing.

In general the formula is the following:

MaxPGs = \begin{cases}
NumOSDs*100/ReplicationSize &\text{if } replicated \\
NumOSDs*100/(k+m) &\text{if } erasure\ coded
\end{cases}

Then we use the closest power of two, which is less than the above number.
Example on meredith (368 OSDs, EC -- k=4, m=2): MaxPGs=6133 --> MaxPGs=4096

Now, let's create the pools following the upstream documentation Create a pool.

We should have at least one test pool and one data pool:

Create the test pool. It should always be replicated and not EC:

[root@cephmeredithmon...]$ ceph osd pool create test 512 512 replicated replicated_rule
pool 'test' created

[root@cephmeredithmon...]$ ceph osd pool ls detail
pool 6 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1710 flags hashpspool stripe_width 0 application test

Create the data pool (named 'rbd_ec_data' here) with EC:

[root@cephmeredithmon...]$ ceph osd pool create rbd_ec_data 4096 4096 erasure jera_4plus2 rbd_ec_data
pool 'rbd_ec_data' created
[root@cephmeredithmon...]$ ceph osd pool ls detail | grep rbd_ec_data
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 1554 flags hashpspool stripe_width 16384

Make sure the security flags {nodelete, nopgchange, nosizechange} are set for all the pools

[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1711 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
...

If not, set the flags with

[root@cluster_mon]$ ceph osd pool set <pool_name> {nodelete, nopgchange, nosizechange} 1

pg_autoscale_mode should be set to off

[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1985 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd

If the output shows anything for autoscale_mode, disable autoscaling with

[root@cluster_mon]$ ceph osd pool set <pool_name> pg_autoscale_mode off

Set the application type for each pool in the cluster

[root@cluster_mon]$ ceph osd pool application enable my_test_pool test
[root@cluster_mon]$ ceph osd pool application enable my_rbd_pool rbd

If relevant, enable the balancer

[root@cluster_mon]$ ceph balancer on
[root@cluster_mon]$ ceph balancer mode upmap
[root@cluster_mon]$ ceph config set mgr mgr/balancer/upmap_max_deviation 1

The parameter upmap_max_deviation is used to spread the PGs more evenly across the OSDs.
Check with

[root@cluster_mon]$ ceph balancer status
{
    "plans": [],
    "active": true,
    "last_optimize_started": "Tue Jan 12 16:47:48 2021",
    "last_optimize_duration": "0:00:00.296960",
    "optimize_result": "Optimization plan created successfully",
    "mode": "upmap"
}

[root@cluster_mon]$ ceph config dump
WHO   MASK LEVEL    OPTION                           VALUE RO 
  mgr      advanced mgr/balancer/active              true     
  mgr      advanced mgr/balancer/mode                upmap    
  mgr      advanced mgr/balancer/upmap_max_deviation 1

Also, after quite some time spent balancing, the number of PGs per OSD should be even.
Focus on the PGS column of the output of ceph osd df tree

[root@cluster_mon]$ ceph osd df tree

ID  CLASS WEIGHT    REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE VAR  PGS STATUS TYPE NAME                                
 -1       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -        root default                             
 -5       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -            room 0513-R-0050                     
 -4        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                rack CK01                        
 -3        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                    host cephflash21a-04f5dd1763 
  0   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  75     up                 osd.0                    
  1   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  69     up                 osd.1                    
  2   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  72     up                 osd.2                    
  3   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  70     up                 osd.3

Cluster monitoring is offered by:

Health crons enabled at the hostgroup level (see the YAML file above):
- enable_health_cron enables sending the email report that checks the current health status and greps in recent ceph.log
- enable_sls_cron enables sending metrics to filer-carbon that populate the Ceph Health dashboard
Regular polling performed by cephadm.cern.ch
Prometheus
Watcher clients (CephFS) that mount and test FS availability

To enable polling from cephadm, proceed as follows:

Add the new cluster to it-puppet-hostgroup-ceph/code/manifest/admin.pp. Consider Admin newclusters as reference merge request. (note, if you are adding a cephFS cluster, you do not need to add it to the ### BASIC CEPH CLIENTS Array.
Create a client.admin key on the cluster

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.admin mon 'allow *' mgr 'allow *' osd 'allow *' mds 'allow *'
[client.admin]
        key = <the_super_secret_key>

Add the key to tbag in the ceph/admin hostgroup (the secret must contain the full output of the command above)

tbag set --hg ceph/admin <cluster_name>.keyring --file <keyring_filename>
tbag set --hg ceph/admin <cluster_name>.admin.secret
Enter Secret: <paste secret here>

Add the new cluster to it-puppet-module-ceph/data/ceph.yaml otherwise the clients (cephadm included) will lack the mon hostname. (Consider Add ryan cluster as a reference merge request.) Double check you are using the appropriate port.
ssh to cephadm and run puppet a couple of times
Make sure files at <cluster_name>.client.admin.keyring and at <cluster_name>.conf exist and show the appropriate content
Check the health of the cluster with

[root@cephadm]# ceph --cluster=<cluster_name> health
HEALTH_OK

Cephadm is also resposnbile for producting the availability numbers sent to the central IT Service Availability Overview. If the cluster needs to be reported in IT SAO, add it to ceph-availability-producer.py with a relevant description.

To enable monitoring from Prometheus, add the new cluster to prometheus.yaml. Also, the Prometheus module must be enabled on the MGR (Documentation: https://docs.ceph.com/en/octopus/mgr/prometheus/) for metrics to be retrieved:

ceph mgr module enable prometheus

To ensure a CephFS cluster is represented adequetely, there are some unique steps we must take:

Update the it-puppet-module-cephfs README.md and code/data/common.yaml to include the new cluster (Consider add doyle cluster as a reference merge request.)
Update it-puppet-hostgroup-ceph watchers definition in code/manfiests/test/cepfs/watchers.pp to ensure the new cluster is mounted by the watchers. (consider watchers.pp: add doyle definition an example merge request)
SSH to one of the watcher nodes (e.g. cephfs-testc9-d81171f572.cern.ch) and run puppet a few times to synchronise the changes.
Checking cat /proc/mounts | grep ceph for an appropriate systemd mount and navigating to one of the directories within / let you examine if the FS is availible.

We prefer not to use load-balancing service and lbclient here (https://configdocs.web.cern.ch/dnslb/). There is no scenario in ceph where we want a mon to disappear from the alias.

We rather use the --load-N- appoarch to create the alias with all the mons:

Go to network.cern.ch
Click on Update information and use the FQDN of the mon machine
- If prompted, make sure you host interface and not the IPMI one
Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
Multiple aliases are supported. Use a comma-separated list
Check the changes are correct and submit the request

Note: What follows is not proper benchmarking but some quick hints the cluster works as expected.

Good reading at Benchmarking performance

Start a test on pool 'my_test_pool' with 30s duration and blockize 4096 B

[root@cluster_mon]$ rados bench -p my_test_pool 10 write -b 4096

hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephflash21a-a6564a2ee7.cern._1768589
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16      8752      8736   34.1231    34.125  0.00130825  0.00182201
    2      16     16913     16897   32.9995   31.8789  0.00104112  0.00189076
    3      15     24678     24663   32.1108   30.3359  0.00139087  0.00194522
    4      16     32189     32173   31.4167   29.3359   0.0209055   0.0019863
    5      16     39595     39579   30.9187   28.9297   0.0209981  0.00201906
    6      16     47263     47247   30.7573   29.9531  0.00138272  0.00203065
    7      16     55169     55153   30.7748   30.8828  0.00121337  0.00202973
    8      16     63070     63054   30.7855   30.8633  0.00133439  0.00202877
    9      15     70408     70393     30.55    28.668  0.00144124  0.00204461
   10      11     78679     78668   30.7271   32.3242  0.00162555  0.00203309
Total time run:         10.0178
Total writes made:      78679
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     30.6793
Stddev Bandwidth:       1.68734
Max bandwidth (MB/sec): 34.125
Min bandwidth (MB/sec): 28.668
Average IOPS:           7853
Stddev IOPS:            431.959
Max IOPS:               8736
Min IOPS:               7339
Average Latency(s):     0.00203504
Stddev Latency(s):      0.00370041
Max latency(s):         0.0702117
Min latency(s):         0.000887922
Cleaning up (deleting benchmark objects)
Removed 78679 objects
Clean up completed and total clean up time :4.93871

Create a RBD image and run some tests on it

[root@cluster_mon]$ rbd create rbd_ec_meta/enricotest --size 100G --data-pool rbd_ec_data
[root@cluster_mon]$ rbd bench --io-type write rbd_ec_meta/enricotest --io-size 4M --io-total 100G

Once done, delete the image with

[root@cluster_mon]$ rbd ls -p rbd_ec_meta
[root@cluster_mon]$ rbd rm rbd_ec_meta/enricotest

All of the above steps bring to a fully functional Rados Block cluster. The only missing step is to create access keys for the OpenStack Cinder so that it can use the provided storage.

The upstream documentation on user management (and OpenStack is a user) is available at User Management

To create the relevant access key for OpenStack use the following command:

$ ceph auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd pool=volumes' mgr 'profile rbd pool=volumes'

which results in creating a user named "cinder" to run rbd commands on the pool named "volumes".

To store Glance images on ceph, a dedicated pool (pg_num may vary) and cephx keys are needed:

$ ceph osd pool create images 128 128 replica replicated_rule
$ ceph auth get-or-create client.images mon 'profile rbd' mgr 'profile rbd pool=images' osd 'profile rbd pool=images'

Enabling CephFS consists of creating data and metadata pools for CephFS and a new filesystem. It is also needed to create metadata servers (either dedicated or colocated with other daemons), else the cluster will show HEALTH_ERR and 1 filesystem offline. See below for the creation of metadata servers.

Follow the upstream documentation at Create a Ceph File System

Create at least two hosts to ceph/{hg_name}/mds. MDS daemons can be dedicated (preferable for large, busy clusters) or colocated with other daemons (e.g., on the osd host, assuming enough memory is available).

As soon as one MDS goes active, the cluster health will go back to HEALTH_OK. It is recommended to have at least 2 nodes running MDSes for failover. One can also consider to have a stand-by replay MDS to lower the time needed for a failover.

To provision CephFS File Shares via OpenStack Manila, a dedicated cephx must be provided to the OpenStack team. Create the key with:

$ ceph auth get-or-create client.manila mon 'allow r' mgr 'allow rw'

To provide object storage, it is needed to run Ceph Object Gateway daemons (radosgw).

RGWs can run on dedicated machines (by creating new hosts in hostgroup ceph/{hg_name}/rgw) or colocated with existing machines. In both cases, these classes need to be enabled:

The radosgw class radosgw.pp
The lb class lb.pp
The traefik class traefik.pp

Also, you may want to enable:

The S3 crons for specific quota and health checks (see include/s3{hourly,daily,weekly}.pp
Traefik log ingestion into the MONIT pipelines for ElasticSeach dashoboards (see s3-logging).

Always start with one RGW only and iterate over the configuration until it runs.

Some of the required data pools (default.rgw.control, default.rgw.meta, default.rgw.log, .rgw.root) are automatically created by the RGW at its first run. The creation of some other pools is triggered by specific actions, e.g., making a bucket will create pool default.rgw.buckets.index, pushing the first object will trigger creation of default.rgw.buckets.data.

It is highly recommended to pre-create all pools so that they have the right cursh rule, pg_num, etc. before data is written to them. If they get auto-created, they will use the default crush type (replicated), while we typically use erasure coding for object storage. Use an existing clusters as reference to configure pools.

The round-robin based DNS load balancing service is describe at DNS Load Balancing.

To create a new load-balanced alias for S3:

Go to https://aiermis.cern.ch/
Add LB Alias by specifying if it needs to be external and the number of hosts to return (Best Hosts)
Configure hg_ceph::classes::lb::lbalias and the relevant RadosGW configuration params accordingly (rgw dns name, rgw dns s3website name, rgw swift url. ...)
To support virtual host style bucket address (i.e., mybucket.s3.cern.ch) talk to the Network Team to have wildcard DNS enabled on the alias

Make sure you have included hg_ceph::classes:rbd_mirror and set up the bootstrap-rbd-mirror keyring.

You first have to add a rbd-mirror-peer keyring in the hostgroup ceph.

First get to your mon and run the following command:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.rbd-mirror-peer mon 'profile rbd-mirror-peer' osd 'profile rbd' -o {hg_name}.client.rbd-mirror-peer.keyring

Copy the keyring to aiadm and create the secret:

[user@aiadm]$ tbag set --hg ceph {hg_name}.client.rbd-mirror-peer.keyring --file {hg_name}.client.rbd-mirror-peer.keyring

Now your cluster can participate with the others already registered to mirror your RBD images! You can now add the following data to registers peers for your rbd-mirror daemons:

ceph::rbd_mirror:
  - peer1
  - peer2
  - ...

You first have to enable the mirroring of some of your pools: https://docs.ceph.com/en/octopus/rbd/rbd-mirroring/#enable-mirroring. Also check the configuration of those modes in the same page (journaling feature enabled on the RBD images, image snapshot settings, ...).

And then you can add peers like this:

[root@ceph{hg_name}-rbd-mirror-...]$ rbd mirror pool peer add {pool} client.rbd-mirror-peer@{remote_peer}

There are several channels to watch during your Rota shift:

Emails to ceph-admins@cern.ch:
- "Ceph Health Warn" mails.
- SNOW tickets from IT Repair Service.
- Prometheus Alerts.
SNOW tickets assigned to Ceph Service:
- Here is a link to the tickets needing to be taken: Ceph Assigned
Ceph Internal Mattermost channel
General informations on clusters (configurations, OSD types, HW, versions): Instance Version Tracking ticket

Each action you take should be noted down in a journal, which is to be linked or attached to the minutes of theCeph weekly meeting the following week. https://indico.cern.ch/category/9250/ Use HackMD, Notepad, ...

If you have any questions or take any significant actions, keep you colleagues informed in Mattermost

scsi_blockdevice_driver_error_reported
- Draining a Failing OSD
- Creating a new OSD
CephInconsistentPGs
Ceph PG Unfound
CephTargetDown
SSD Replacement
MDS Slow Ops
Large omap Objects

The IT Repair Service may ask ceph-admins to prepare a disk to be physically removed. The scripts needed for the replacement procedure may be found under ceph-scripts/tools/ceph-disk-replacement/.

For failing OSDs in wigner cluster, contact ceph-admins

watch ceph status <- keep this open in a separate window.
Login to the machine with a failing drive and run ./drain-osd.sh --dev /dev/sdX (the ticket should tell which drive is failing)
- For machines in /ceph/erin/osd/castor: You cannot run the script, ask ceph-admins.
- If the output is of the following form: Take notes of the OSD id <id>
```
ceph osd out osd.<id>
```
- Else
  - If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
  - Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
Run ./drain-osd.sh --dev /dev/sdX | sh
Once drained (can take a few hours), we now want to prepare the disk for replacement
- Run ./prepare-for-replacement.sh --dev /dev/sdX
- Continue if the output is of the following form and that the OSD id <id> displayed is consistent with what was given by the previous command:
```
systemctl stop ceph-osd@<id>
umount /var/lib/ceph/osd/ceph-<id>
ceph-volume lvm zap /dev/sdX --destroy
```
- (note that the --destroy flag will be dropped in case of a FileStore OSD)
- Else
  - If the script shows no output: Ceph is unhealthy or OSD is unsafe to stop, contact ceph-admins
  - Else if the script shows a broken output (especially missing <id>): Contact ceph-admins
Run ./prepare-for-replacement.sh --dev /dev/sdX | sh to execute.
Now the disk is safe to be physically removed.
- Notify the repair team in the ticket

When the IT Repair Service has replaced the broken disk with a new one, we have to format that disk with BlueStore to add it back to the cluster:

watch ceph status <- keep this open in a separate window.
Identify the osd id to use on this OSD:
- Check your notes from the drain procedure above.
- Cross-check with ceph osd tree down <-- look for the down osd on this host, should match your notes.
Run ./recreate-osd.sh --dev /dev/sdX and check that the output is according to the following:

On beesly cluster:

ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY

On gabe cluster:

ceph-volume lvm zap /dev/sdX
ceph-volume lvm zap /dev/ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db ceph-block-dbs-[0-6a-f]+/osd-block-db-[0-6a-z-]+

On erin cluster:

Regular case:

ceph-volume lvm zap /dev/sdX
ceph osd destroy <id> --yes-i-really-mean-it
ceph-volume lvm create --osd-id <id> --data /dev/sdX --block.db /dev/sdY

ceph/erin/castor/osd
- Script cannot be run, contact ceph-admins.

If the output is satisfactory, run ./recreate-osd.sh --dev /dev/sdX | sh

See OSD Replacement for many more details.

Familiarize yourself with the Upstream documentation

Check ceph.log on a ceph/*/mon machine to find the original "cluster [ERR]" line.

The inconsistent PGs generally come in two types:

deep-scrub: stat mismatch, solution is to repair the PG
- Here is an example on ceph\flax:

2019-02-17 16:23:05.393557 osd.60 osd.60 128.142.161.220:6831/3872729 56 : cluster [ERR] 1.85 deep-scrub : stat mismatch, got 149749/149749 objects, 0/0 clones, 149749/149749 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 135303283738/135303284584 bytes, 0/0 hit_set_archive bytes.
2019-02-17 16:23:05.393566 osd.60 osd.60 128.142.161.220:6831/3872729 57 : cluster [ERR] 1.85 deep-scrub 1 errors

candidate had a read error, solution follows below.

Notice that the doc says If read_error is listed in the errors attribute of a shard, the inconsistency is likely due to disk errors. You might want to check your disk used by that OSD. This is indeed the most common scenario.

In this case, a failing disk returns bogus data during deep scrubbing, and ceph will notice that the replicas are not all consistent with each other. The correct procedure is therefore to remove the failing disk from the cluster, let the PGs backfill, then finally to deep-scrub the inconsistent PG once again.

Here is an example on ceph/erin cluster, where the monitoring has told us that PG 64.657c has an inconsistent PG:

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~] grep shard /var/log/ceph/ceph.log
2017-04-12 06:34:26.763000 osd.508 128.142.25.116:6924/4070422 4602 : cluster [ERR] 64.657c shard 187:
soid 64:3ea78883:::1568573986@castorns.27153415189.0000000000000034:head candidate had a read error

A shard in this case refers to which OSD has the inconsistent object replica, in this case it's the "osd.187".

[09:16][root@cepherin0 (production:ceph/erin/mon*1) ~]# ceph osd find 187
{
   "osd": 187,
   "ip": "128.142.25.106:6820\/530456",
   "crush_location": {
       "host": "p05972678k94093",
       "rack": "EC06",
       "room": "0513-R-0050",
       "root": "default",
       "row": "EC"
   }
}

On the p05972678k94093 host we first need to find out which /dev/sd* device hosts that osd.187.

On BlueStore OSDs we need to check with ceph-volume lvm list or lvs:

[14:38][root@p05972678e32155 (production:ceph/erin/osd*30) ~]# lvs -o +devices,tags | grep 187
  osd-block-... ceph-... -wi-ao---- <5.46t        /dev/sdm(0) ....,ceph.osd_id=187,....

So we know the failed drive is /dev/sdm, now we can check for disk Medium errors:

[09:16][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# grep sdm /var/log/messages
[Wed Apr 12 12:27:59 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 04 00 00 00
[Wed Apr 12 12:27:59 2017] blk_update_request: critical medium error, dev sdm, sector 90638112
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Sense Key : Medium Error [current]
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] Add. Sense: Unrecovered read error
[Wed Apr 12 12:28:02 2017] sd 1:0:10:0: [sdm] CDB: Read(16) 88 00 00 00 00 00 05 67 07 20 00 00 00 08 00 00
[Wed Apr 12 12:28:02 2017] blk_update_request: critical medium error, dev sdm, sector 90638112

In this case, the disk is clearly failing.

Now check if that osd is safe to stop?

[14:41][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# ceph osd ok-to-stop osd.187
OSD(s) 187 are ok to stop without reducing availability, provided there are no other concurrent failures or interventions. 182 PGs are likely to be degraded (but remain available) as a result.

Since it is OK, we stop the osd, umount it, and mark it out.

[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# systemctl stop ceph-osd@187.service
[09:17][root@p05972678k94093 (production:ceph/erin/osd*30) ~]# umount /var/lib/ceph/osd/ceph-187
[09:17][root@p05972678k94093 (production:ceph/erin/osd*29) ~]# ceph osd out 187
marked out osd.187.

ceph status should now show the PG is in a state like this:

             1     active+undersized+degraded+remapped+inconsistent+backfilling

It can take a few 10s of minutes to backfill the degraded PG.

Once the inconsistent PG is no longer "undersized" or "degraded", use the script at ceph-scripts/tools/scrubbing/autorepair.sh to repair the PG and start the scubbing immediately.

Now check ceph status... You should see the scrubbing+repair started already on the inconsistent PG.

The PG unfound condition may be due to a race condition when PGs are scrubbed (see https://tracker.ceph.com/issues/51194) leading to PG reported as recovery_unfound.

Upstream documentation is available for general unfound objects

In case of unfound objects, ceph reports a HEALTH_ERR condition

# ceph -s
  cluster:
    id:     687634f1-03b7-415b-aff9-e21e6bedbe7c
    health: HEALTH_ERR
            1/282983194 objects unfound (0.000%)
            Possible data damage: 1 pg recovery_unfound
            Degraded data redundancy: 3/848949582 objects degraded (0.000%), 1 pg degraded
 
  services:
    mon: 3 daemons, quorum cephdata20-4675e5a59e,cephdata20-44bdbfa86f,cephdata20-83e1d8a16e (age 4h)
    mgr: cephdata20-83e1d8a16e(active, since 11w), standbys: cephdata20-4675e5a59e, cephdata20-44bdbfa86f
    osd: 576 osds: 575 up (since 9d), 573 in (since 9d)
 
  data:
    pools:   3 pools, 17409 pgs
    objects: 282.98M objects, 1.1 PiB
    usage:   3.2 PiB used, 3.0 PiB / 6.2 PiB avail
    pgs:     3/848949582 objects degraded (0.000%)
             1/282983194 objects unfound (0.000%)
             17342 active+clean
             60    active+clean+scrubbing+deep
             6     active+clean+scrubbing
             1     active+recovery_unfound+degraded

List the PGs in recovery_unfound state

# ceph pg ls recovery_unfound
PG      OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES        OMAP_BYTES*  OMAP_KEYS*  LOG   STATE                             SINCE  VERSION         REPORTED         UP                 ACTING             SCRUB_STAMP                      DEEP_SCRUB_STAMP
1.2d09    17232         3          0        1  72106876434            0           0  3373  active+recovery_unfound+degraded    37m  399723'3926620  399723:23220581  [574,671,662]p574  [574,671,662]p574  2023-01-12T13:27:34.752832+0100  2023-01-12T13:27:34.752832+0100

Check the ceph log (cat /var/log/ceph/ceph.log | grep ERR) for IO errors on the primary OSD of the PG. In this case, the disk backing osd.574 is failing with pending sectors (check with smartctl -a <device>)

2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error
2023-01-12T13:27:34.752327+0100 osd.574 (osd.574) 776 : cluster [ERR] 1.2d09 deep-scrub 0 missing, 1 inconsistent objects
2023-01-12T13:27:34.752830+0100 osd.574 (osd.574) 777 : cluster [ERR] 1.2d09 repair 1 errors, 1 fixed
2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)

Before taking any action, make sure that the version of the objected reported as unfound on the other two OSDs are more recent than the lost one:

List unfound object

# ceph pg 1.2d09 list_unfound
{
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.0bee1ae64c9012.00000000000032c4",
                "key": "",
                "snapid": -2,
                "hash": 2152017161,
                "max": 0,
                "pool": 1,
                "namespace": ""
            },
            "need": "399702'3923004",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": []
        }
    ],
    "state": "NotRecovering",
    "available_might_have_unfound": true,
    "might_have_unfound": [],
    "more": false

The missing object is at version 399702

Last osd map before read error: e399704

2023-01-12T13:07:24.463521+0100 mon.cephdata20-4675e5a59e (mon.0) 2714279 : cluster [DBG] osdmap e399704: 576 total, 575 up, 573 in
2023-01-12T13:07:39.543780+0100 osd.574 (osd.574) 775 : cluster [ERR] 1.2d09 shard 574 soid 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head : candidate had a read error

The object goes unfound at: e399710

2023-01-12T13:27:30.297813+0100 mon.cephdata20-4675e5a59e (mon.0) 2714933 : cluster [DBG] osdmap e399710: 576 total, 575 up, 573 in
2023-01-12T13:27:39.196815+0100 mgr.cephdata20-83e1d8a16e (mgr.340798757) 3573768 : cluster [DBG] pgmap v3244401: 17409 pgs: 1 active+recovery_unfound+degraded, 12 active+clean+scrubbing, 62 active+clean+scrubbing+deep, 17334 active+clean; 1.1 PiB data, 3.2 PiB used, 3.0 PiB / 6.2 PiB avail; 224 MiB/s rd, 340 MiB/s wr, 4.83k op/s; 3/848927301 objects degraded (0.000%); 1/282975767 objects unfound (0.000%)

The two copies on 671 and 662 are more recent -- 399702 VS 399709:

2023-01-12T13:27:34.783419+0100 osd.671 (osd.671) 3686 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819
2023-01-12T13:27:34.809509+0100 osd.662 (osd.662) 4313 : cluster [ERR] 1.2d09 push 1:90b4a201:::rbd_data.0bee1ae64c9012.00000000000032c4:head v 399702'3923004 failed because local copy is 399709'3924819

If copies are more recent than the lost one:

Set the primary osd (osd.574) out
The recovery_unfound object disappears and backfilling start
Once backfilled, deep-scrub the PG to check for inconsistencies

This is an special alert raised by prometheus. This indicates that for whatever reason a target node is not exposing its metrics anymore or prometheus server is not able to pull them. This does not imply that the node is offline, just that the node endpoint is down for prometheus.

To handle this tickets first we need to identify which is the affected target. This information should be in the ticket body.

The following Alerts are in Firing Status:
------------------------------------------------
Target cephpolbo-mon-0.cern.ch:9100 is down
Target cephpolbo-mon-2.cern.ch:9100 is down

Alert Details:
------------------------------------------------
Alertname: TargetDown
Cluster: polbo
Job: node
Monitor: cern
Replica: A
Severity: warning

After, we can go to the target section in prometheus's dashboard and cross-check the affected node. There you can find more information about the reason of being down.

This could be caused by the following reasons:

A node is offline or it's being restarted. Follow the normal procedures for understanding why the node is not online (ping, ssh, console access, SNOW ticket search...). Once the node is back, the target should be marked as UP again automatically.
If a new target was added recently, possibly there are mistakes in the target definition or some conectivity problems like the port being blocked.
- Review the target configuration in it-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml and refer to the monitoring guide.
- Make sure that the firewall configuration allows prometheus to scrape the data through the specified port.
In ceph, the daemons that expose the metrics are the mgr. Sometimes, could happen that the mgr hangs and then stop exposing the metrics.
- Check the mgr status and eventually restart it. Don't forget to collect information about the state in what you found it for further analysis. If all went well, after 30 seconds, the target should be UP again in prometheus dashboard. For double-check you can click in the endpoint url of the node and see if the metrics are now shown.

In order to drain the osds attached to a failing SSD, run the following command:

$> cd /root/ceph-scripts/tools/ceph-disk-replacement
$> ./ssd-drain-osd.sh --dev /dev/<ssd>
ceph osd out osd.<osd0>;
ceph osd primary-affinity osd.<osd0> 0;
ceph osd out osd.<osd1>;
ceph osd primary-affinity osd.<osd1> 0;
...
ceph osd out osd.<osdN>;
ceph osd primary-affinity osd.<osdN> 0;

If the output is similar to the one above, it is safe to re-run the commands adding | sh to actually put out of the cluster all the osds attached to the ssd.

Once the draining has been started, the osds need to be zapped before the ssd can be removed and physically replaced:

$> ./ssd-prepare-for-replacement.sh --dev /dev/<dev> -f
systemctl stop ceph-osd@<osd0>
umount /var/lib/ceph/osd/ceph-<osd0>
ceph-volume lvm zap --destroy --osd-id <osd0>
systemctl stop ceph-osd@<osd1>
umount /var/lib/ceph/osd/ceph-<osd1>
ceph-volume lvm zap --destroy --osd-id <osd1>
...
systemctl stop ceph-osd@<osdN>
umount /var/lib/ceph/osd/ceph-<osdN>
ceph-volume lvm zap --destroy --osd-id <osdN>

TBC

Check for long ongoing operations on the MDS reporting Slow Ops:

The mon shows SLOW_OPS warning:

ceph health details

cat /var/log/ceph/ceph.log | grep SLOW
    cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)

The affected MDS shows slow request in the logs:

cat /var/log/ceph/ceph-mds.cephcpu21-0c370531cf.log | grep -i SLOW
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : 30 slow requests, 1 included below; oldest blocked for > 2356.704295 secs
    2022-10-22T09:09:21.473+0200 7fe1b8054700  0 log_channel(cluster) log [WRN] : slow request 1924.631928 seconds old, received at 2022-10-22T08:37:16.841403+0200: client_request(client.366059605:743931 getattr AsXsFs #0x10251604c38 2022-10-22T08:37:16.841568+0200 caller_uid=1001710000, caller_gid=0{1001710000,}) currently dispatched

Dump the ongoing ops and check there are some with very long (minutes, hours) age:

ceph daemon mds.`hostname -s` ops | grep age | less

Identify the client with such long ops (age should be >900):

ceph daemon mds.`hostname -s` ops | egrep 'client|age' | less

    "description": "client_request(client.364075205:4876 getattr pAsLsXsFs #0x1023f14e5d8 2022-10-16T03:46:40.673900+0200 RETRY=184 caller_uid=0, caller_gid=0{})",
    "age": 0.87975248399999995,
        "reqid": "client.364075205:4876",
        "op_type": "client_request",
        "client_info": {
            "client": "client.364075205",

Get info on the client:

ceph daemon mds.`hostname -s` client ls id=<THE_ID>

IP address
Hostname
Ceph client version
Kernel version (in case of a kernel mount)
Mount point (on the client side)
Root (aka, the CephFS volume the client mounts)

Evict the client:

ceph tell mds.* client ls id=<THE_ID>
ceph tell mds.* client evict id=<THE_ID>

On S3 clusters, it may happen to see a HEALTH_WARN message reporting 1 large omap objects. This is very likely due to bucket index(es) being over full. Example:

"user_id": "warp-tests",
"buckets": [
    {
        "bucket": "warp-tests",
        "tenant": "",
        "num_objects": 9993106,
        "num_shards": 11,
        "objects_per_shard": 908464,
        "fill_status": "OVER"
    }
]

Proceed as follows:

Check bucket index(es) being over full is the actual problem:
```
radosgw-admin bucket limit check
```
If it it not possible to reshard the bucket tune osd_deep_scrub_large_omap_object_key_threshold properly
```
ceph config set osd osd_deep_scrub_large_omap_object_key_threshold 300000
```
Default is 200000; Gabe runs with 500000. Read at 42on.com
If it is possible to reshard the bucket, manually reshard any bucket showing fill_status WARN or OVER:
- keep the number of objects per shard around 50k
- pick a prime number of shards
- consider if the bucket will be ever growing or owners delete objects. If ever-growing, you may reshard to a high number of shards to avoid (or postpone) resharding in the future.
```
radosgw-admin bucket reshard --bucket=warp-tests --num-shards=211
```

Check in ceph.log which is the PG complining about the large omap objects and start a deep scrub on it (else the HEALTH_WARN won't go away)

# zcat  /var/log/ceph/ceph.log-20221204.gz | grep -i large
2022-12-03T06:48:37.975544+0100 osd.179 (osd.179) 996 : cluster [WRN] Large omap object found. Object: 9:22f5fbf8:::.dir.a1035ed2-37be-4e7d-892d-46728bc3d046.285532.1.1:head PG: 9.1fdfaf44 (9.344) Key count: 204639 Size (bytes): 60621488
2022-12-03T06:48:39.270652+0100 mon.cephdata22-12f31fcca0 (mon.0) 292373 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)

# ceph pg deep-scrub 9.344
instructing pg 9.344 on osd.179 to deep-scrub

Cluster	Lead	Use-case	Mon host (where?)	Release	Version	OS	Racks	IP Services	Power	SSB Upgrades?
barn	Enrico	Cinder: cp1, cpio1	cephbarn (hw)	pacific	16.2.9-1	RHEL8	BA09	S513-A-IP250	UPS-4/-C	Yes
beesly	Enrico	Glance Cinder: 1st AZ	cephmon (hw)	pacific	16.2.9-1	RHEL8	CD27-CD30	S513-C-IP152	UPS-3/-4i	Yes
cta	Roberto	CTA prod	cephcta (hw)	pacific	16.2.13-5	RHEL8	SI36-SI41	-		No, Julien Leduc
dwight	Zac	Testing + Manila: CephFS Testing	cephmond (vm,abc)	quincy	17.2.7-1	Alma8	CE01-CE03	S513-C-IP501		Yes + Manila MM
doyle		CephFS for DFS Projects	cephdoyle (hw)	quincy	17.2.7-2	RHEL9	CP18, CP19-21, CP22	S513-C-IP200	UPS-1	Yes + Sebast/Giuseppe
flax(*)	Abhi	Manila: Meyrin CephFS	cephflax (vm,abc)	pacific	16.2.9-1	RHEL8	BA10,SQ05 CQ18-CQ21 SJ04-SJ07	S513-A-IP558,S513-V-IP562 S513-C-IP164 S513-V-IP553	UPS-4/-C,UPS-1 UPS-1 UPS-3	Yes
gabe	Enrico	S3	cephgabe (hw)	pacific	16.2.13-5	RHEL8	SE04-SE07 SJ04-SJ07	S513-V-IP808 S513-V-IP553	UPS-1 UPS-3	Yes
jim	Enrico	HPC BE (CephFS)	cephjim (vm,abc)	quincy	17.2.7-1	RHEL8	SW11-SW15 SX11-SX15	S513-V-IP194 S513-V-IP193	UPS-3 UPS-3	Yes + Nils Hoimyr
kelly	Roberto	Cinder: hyperc + CTA preprod	cephkelly (hyperc)	quincy	17.2.7-1	RHEL8	CQ12-CQ22	S513-C-IP164	UPS-1	Yes + Julien Leduc
kapoor	Enrico	Cinder: cpio2, cpio3	cephkapoor (hyperc)	quincy	17.2.7-1	RHEL8	BE10 BE11 BE13	S513-A-IP22	UPS-4/-C	Yes
levinson	Abhi	Manila: Meyrin CephFS SSD A	cephlevinson (hw)	pacific	16.2.9-1	RHEL8	BA03 BA04 BA05 BA07	S513-A-IP120 S513-A-IP119 S513-A-IP121 S513-A-IP122	UPS-4/-C	Yes
meredith	Enrico	Cinder: io2, io3	cephmeredith (hw)	pacific	16.2.9-1	RHEL8	CK01-23	S513-C-IP562	UPS-2	Yes
nethub	Enrico	S3 FR + Cinder FR	cephnethub (hw)	pacific	16.2.13-5	RHEL8	HA06-HA09 HB01-HB06	S773-C-SI180 S773-C-IP200	EOD104,ESK404 EOD105 (CEPH-1519)	Yes
pam	Abhi	Manila: Meyrin CephFS B	cephpam (hw)	pacific	16.2.9-1	Alma8	CP16-19	S513-C-IP200	UPS-1	Yes
poc	Enrico	PCC Proof of Concept (CEPH-1382)	cephpoc (hyperc)	quincy	17.2.7-1	RHEL9	SU06	S513-V-SI263		No
ryan	Enrico	Cinder: 3rd AZ	cephryan (hw)	pacific	16.2.9-1	RHEL8	CE01-CE03	S513-C-IP501	UPS-2	Yes
stanmey	Zachary	S3 multi-site, Meyrin (secondary)	cephstanmey (hw)	reef	18.2.1-1	RHEL8	CP16-24	S513-C-IP200	UPS-1	No
stanpre	Zachary	S3 multi-site, Prevessin (master)	cephstanpre (hw)	reef	18.2.1-1	Alma8	HB01-HB06	S773-C-IP200	EOD105/0E	No
toby	Enrico	Stretch cluster	cephtoby (hw)	pacific	16.2.9-1	RHEL8	CP16-19 SJ04-07	S513-C-IP200 S513-V-IP553	UPS-1 UPS-3	No
vance	Enrico	Manila: HPC Theory-QCD	cephvance (hw)	quincy	17.2.7-1	Alma8	CP16-CP17, CP19, CP21, CP23-CP24	S513-C-IP200	UPS-1	Yes + Nils Hoimyr
wallace	Enrico	krbd: Oracle DB restore tests	cephwallace (hw)	quincy	17.2.7-1	RHEL8	CP18, CP20, CP22	S513-C-IP200	UPS-1	No, Dmytro Grzudo
vault	Enrico	Cinder: 2nd AZ	cephvault (hw)	pacific	16.2.9-1	RHEL8	SE04-SE07	S513-V-IP808	UPS-1	Yes

Flax locations details:

MONs: 3x OpenStack VMs, one in each availability zone
MDSes (CPU servers): 50% in barn, 50% in vault
- cephcpu21-0c370531cf, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
- cephcpu21-2456968853, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
- cephcpu21-46bb400fc8, BA10, S513-A-IP558
- cephcpu21-4a93514bf3, BA10, S513-A-IP558
- cephcpu21b-417b05bfee, BA10, S513-A-IP558
- cephcpu21b-4ad1d0ae5f, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
- cephcpu21b-a703fac16c, SQ05, S513-V-IP562, UPS 1 (EOD1*43)
- cephcpu21b-aecbee75a5, BA10, S513-A-IP558
Metadata pool: Main room, UPS-1 EOD1*43
Data pool: Vault, UPS-3 EOD3*43

Each production cluster has a designated cluster lead, who is the primary contact and responsible for that cluster.

The user-visible "services" provided by the clusters are documented in our Service Availability probe: https://gitlab.cern.ch/ai/it-puppet-hostgroup-ceph/-/blob/qa/code/files/sls/ceph-availability-producer.py#L19

The QoS provided by each user-visible cluster is described in OpenStack docs. Cinder volumes available on multiple AZs are of standard and io1 types.

Hostname	Customer	IPv4	IPv6	IPsvc VM	IPsvc Real	Runs on	OpenStack AZ	Room	Rack	Power
cephgabe-rgwxl-325de0fb1d	cvmfs	137.138.152.241	2001:1458:d00:13::1e5	S513-C-VM33	0513-C-IP33	P06636663U66968	cern-geneva-a	main	CH14	UPS-3
cephgabe-rgwxl-86d4c90cc6	cvmfs	137.138.33.24	2001:1458:d00:18::390	S513-V-VM936	0513-V-IP35	P06636688Q51842	cern-geneva-b	vault	SQ27	UPS-4
cephgabe-rgwxl-8930fc00f8	cvmfs	137.138.151.203	2001:1458:d00:12::3e0	S513-C-VM32	0513-C-IP32	P06636663N63480	cern-geneva-c	main	CH11	UPS-3
cephgabe-rgwxl-8ee4a698b7	cvmfs	137.138.44.245	2001:1458:d00:1a::24b	S513-C-VM933	0513-C-IP33	P06636663J50924	cern-geneva-a	main	CH16	UPS-3
cephgabe-rgwxl-3e0d67a086	default	188.184.73.131	2001:1458:d00:4e::100:4ae	S513-A-VM805	0513-A-IP561	I82006520073152	cern-geneva-c	barn	BC11	UPS-4/-C
cephgabe-rgwxl-652059ccf1	default	188.185.87.72	2001:1458:d00:3f::100:2bd	S513-A-VM559	0513-A-IP559	I82006525008611	cern-geneva-a	barn	BC06	UPS-4/-C
cephgabe-rgwxl-8e7682cb81	default	137.138.158.145	2001:1458:d00:14::341	S513-V-VM35	0513-V-IP35	P06636688R71189	cern-geneva-b	vault	SQ28	UPS-4
cephgabe-rgwxl-91b6e0d6dd	default	137.138.77.21	2001:1458:d00:1c::405	S513-C-VM931	0513-C-IP33	P06636663M67468	cern-geneva-a	main	CH13	UPS-3
cephgabe-rgwxl-895920ea1a	gitlab	137.138.158.221	2001:1458:d00:14::299	S513-V-VM35	0513-V-IP35	P06636688H41037	cern-geneva-b	vault	SQ29	UPS-4
cephgabe-rgwxl-9e3981c77a	gitlab	137.138.154.49	2001:1458:d00:13::3a	S513-C-VM33	0513-C-IP33	P06636663J50924	cern-geneva-a	main	CH16	UPS-3
cephgabe-rgwxl-dbb0bcc513	gitlab	188.184.102.175	2001:1458:d00:3b::100:2a9	S513-C-VM852	0513-C-IP852	I78724428177369	cern-geneva-c	main	EK03	UPS-2
cephgabe-rgwxl-26774321ac	jec-data	188.185.10.120	2001:1458:d00:63::100:39a	S513-V-VM902	0513-V-IP402	I88681450454656	cern-geneva-a	vault	SP23	UPS-4
cephgabe-rgwxl-a273d35b9d	jec-data	188.185.19.171	2001:1458:d00:65::100:32a	S513-V-VM406	S513-V-IP406	I88681458914473	cern-geneva-b	vault	SP27	UPS-4
cephgabe-rgwxl-d91c221898	jec-data	137.138.155.51	2001:1458:d00:13::14d	S513-C-VM33	0513-C-IP33	P06636663Y16806	cern-geneva-a	main	CH15	UPS-3
cephgabe-rgwxl-75569ebe5c	prometheus	137.138.149.253	2001:1458:d00:12::52f	S513-C-VM32	0513-C-IP32	P06636663G98563	cern-geneva-c	main	CH04	UPS-3
cephgabe-rgwxl-7658b46c78	prometheus	188.185.9.237	2001:1458:d00:63::100:424	S513-V-VM902	0513-V-IP402	I88681457779137	cern-geneva-a	vault	SP24	UPS-4
cephgabe-rgwxl-05386c6cdb	vistar	188.185.86.117	2001:1458:d00:3f::100:2d9	S513-A-VM559	0513-A-IP559	I82006526449210	cern-geneva-a	barn	BC05	UPS-4/-C
cephgabe-rgwxl-13f36a01c2	vistar	137.138.33.10	2001:1458:d00:18::1ee	S513-V-VM936	0513-V-IP35	P06636688C41209	cern-geneva-b	vault	SQ29	UPS-4
cephgabe-rgwxl-6da6da7653	vistar	188.184.74.136	2001:1458:d00:4e::100:5d	S513-A-VM805	0513-A-IP561	I82006527765435	cern-geneva-c	barn	BC13	UPS-4/-C

Check Grafana dashboards for unusual activity, patterns, memory usage:

https://filer-carbon.cern.ch/grafana/d/000000001/ceph-dashboard
https://filer-carbon.cern.ch/grafana/d/000000108/ceph-osd-mempools
https://filer-carbon.cern.ch/grafana/d/uHevna1Mk/ceph-hosts
For RGWs: https://filer-carbon.cern.ch/grafana/d/iyLKxjoGk/s3-rgw-perf-dumps
For CephFS: * https://filer-carbon.cern.ch/grafana/d/000000111/cephfs-detail
etc...

ceph osd pool ls detail - are the pool flags correct? e.g. nodelete,nopgchange,nosizechange
ceph df - assess amount of free space for capacity planning
ceph osd crush rule ls, ceph osd crush rule dump - are the crush rules as expected?
ceph balancer status - as expected?
ceph osd df tree - are the PGs per OSD balanced and a reasonable number, e.g. < 100.
ceph osd tree out, ceph osd tree down - are there any OSDs that are not being replaced properly?
ceph config dump - is the configuration as expected?
ceph telemetry status - check from config if it on, enable it

In case of a major incident (e.g., power cuts), revive clusters in the following order:

Beesly (RBD1, main, UPS-3/4), Flax (CephFS, everywhere), Gabe (S3, vault, UPS-1/3)
Vault (RBD2, vault, UPS-1), Levinson (CephFS SSD, vault, UPS-1), Meredith (RBD SSD, main, UPS-2)
Ryan (RBD3, main, UPS-2), CTA (ObjectStore, vault, UPS-1)
Jim, Dwight, Kelly, Pam (currently unused)
Barn, Kopano -- should not go down, as they are in critical power
NetHub -- 2nd network hub, Prevessin, diesel-backed (9/10 racks)

Ceph clusters' block devices specs

Cluster	Use-case	Mon alias	Release	Version	Notes
cslab	Test cluster for Network Lab (RQF2068297,CEPH-1348)	cephcslab	pacific	16.2.13-5	Binds to IPv6 only; 3 hosts Alma8 + 3 RHEL8
miniflax	Mini cluster mimicking Flax	None (ceph/miniflax/mon)	quincy	17.2.7-1
minigabe	Mini cluster mimicking Gabe (zone groups)	cephminigabe	pacific	16.2.13-6	RGW on minigabe-831ffcf9f9; Beast on 8080; RGW DNS: cephminigabe
next	RC and Cloud next region testing	cephnext01	quincy	17.2.6-4

We now want to have flavors per rack for our Ceph clusters, please reminds people from Ironic/CF to do that when a new delivery is installed!

We set root device hints on every new delivery so that we can be certain that Ironic installs the OS on the right drive (and if the corresponding drive fails the installation also fails).

There are multiple ways to set root device hints (see the OpenStack documentation). For our recent deliveries setting the model is typically sufficient to have only one possible drive for the root device.

To get the model of the drive you have to boot a node and get it from /sys/class, for instance: cat /sys/class/block/nvme0n1/device/model (you may also ask to get access to Ironic inspection data if it gets more complicated than that).

Then you can set the model on every nodes of the delivery. For instance, for delivery dl8642293 you would do:

export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
    grep dl8642293 | awk '{print $1}' | \
    xargs -L1 openstack baremetal node set --property root_device='{"model": "SAMSUNG MZ1LB960HAJQ-00007"}'

If it looks correct, pipe the output to shell to actually set the root device hints.

Check the root device hints were correctly set with:

export OS_PROJECT_NAME="IT Ceph Ironic"
openstack baremetal node list -f value | \
    grep dl8642293 | awk '{print $1}' | \
    xargs -L1 openstack baremetal node show -f json | jq .properties.root_device

The monitoring system in Ceph is based on Grafana, using Prometheus as datasource and the native ceph prometheus plugin as metric exporter. Prometheus node_exporter is used for node metrics (cpu, memory, etc).

For long-term metric storage, Thanos is used to store metrics in S3 (Meyrin)

All Ceph monitoring dashboards are available in monit-grafana (Prometheus). Although prometheus is the main datasource for ceph metrics, some plots/dashboard may still require the legacy Graphite datasource.
The prometheus server is configured in the host cephprom.cern.ch, hostgroup ceph/prometheus
Configuration files (Puppet):
- it-puppet-hostgroup-ceph/code/manifests/prometheus.pp
- it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
- it-puppet-hostgroup-ceph/data/hostgroup/ceph.yaml
- Alertmanager templates: it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl
- Alert definition: it-puppet-hostgroup-ceph/code/files/generated_rules/
Thanos infrastructure is under ceph/thanos hostgroup, configured via the corresponding hiera files.

A analog qa infrastructure is also available, which all components replicated (cephprom-qa, thanos-store-qa, etc). This qa infra is configured overriding the puppet environment:

it-puppet-hostogroup-ceph/data/hostgroup/ceph/environments/qa.yaml

Enable the prometheus mgr module in the cluster:

ceph mgr module enable prometheus

NOTE: Make sure that the port 9283 is accepting connections.

Instances that include the hg_ceph::classes::mgr class will be automatically discovered through puppetdb and scraped by prometheus.

To ensure that we don't lose metrics during mgr failovers, all the cluster mgr's will be scraped. As a side benefit, we can monitor the online status of the mgr's.
Run or wait for a puppet run on cephprom.cern.ch.

Instances that include the prometheus::node_exporter class (anything under ceph top hostgroup) will be automatically discovered through puppetdb and scraped by prometheus.

Alerts are defined in yaml files managed by puppet in:

it-puppet-hostgroup-ceph/files/prometheus/generated_rules

They are organised in services, so add the alert in the appropiate file (e.g: ceph alerts in alerts_ceph.yaml). The file rules.yaml is used to add recorded rules

There are 3 notification channels currently: e-mail, SNOW ticket and Mattermost message.

Before creating the alert, make sure you test your query in advance, for example using the Explore panel on Grafana. Once the query is working, proceed with the alert definition.

A prometheus alert could look like this:

rules:
  - alert: "CephOSDReadErrors"
    annotations:
      description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
      documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
      summary: "Device read errors detected on cluster {{ $labels.cluster }}"
    expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
    for: "30s"
    labels:
      severity: "warning"
      type: "ceph_default"

alert: Mandatory. Name of the alert, which will be part of the subject of the email, head of the ticket and title of the mattermost notification. Try to follow the same pattern as the ones already created CephDAEMONAlert. Daemon in uppercase and rest in camel case.
expr: Mandatory. PromQL query that defines the alert. The alert will trigger if the query returns one of more matches. It's a good exercise to use promdash for tuning the query to ensure that it is well formed.
for: Mandatory.The alert will be triggered if stays active for more than the specified time (e.g 30s, 1m, 1h).
annotations:summary: Mandatory. Express the actual alert in a a concise way.
annotations:description: Optional. Allow to specify more detailed information about the alert when the summary is not enough.
annotation:documentation: Optional. Allows to specify the url of the documentation/procedure to follow to handle the alert.
labels:severity: Mandatory. Defines the notification channel to use, based on the following:
- warning/critical: Sends an e-mail to ceph-alerts.
- ticket: Sends an e-mail AND creates an SNOW ticket.
- mattermost: Sends an e-email AND sends a Mattermost message to the ceph-bot channel.
labels:type: Optional. Allows to distinguish from alerts created upstream ceph_default from created by us ceph_cern. It has no actual implication on the alert functionality.
labels:xxxxx: Optional. You can add custom labels that could be used on the template.

NOTES

In order for the templating to work as expected, make sure that labels cluster or job_name are part of the resulting query. In case the query does not preserve labels (like count) you can specify manually the label and value in the labels section in the alert definition.

All annotations, if defined, will appear in the body of the ticket, e-mail or mattermost message generated by the alert.

Alerts are evaluated against the local prometheus server which contains metrics for the last 7 days. Take that into account while defining alerts that evaluates longer periods (like linear_predict). In such cases, you can create the alert in Grafana using the Thanos-LTMS metric datasource (more on that later this doc) .

In grafana or promdash you can access the alerts querying the metric called ALERTS

For more information about how to define an alert, refer to the Prometheus Documentation

Prometheus alerts are pre-configured to show the procedure needed for handling the alert via the annotation procedure_url. This is an optional argument that could be configured per alert rule.

Update the file rota.md on this repository and add the new procedure. Use this file for convenience, but you can create a new file if needed.

Edit the alert following instructions above, and add the link to the procedure under the annotations section, under the key documentation, for example:

- alert: "CephMdsTooManyStrays"
    annotations:
      documentation: "http://s3-website.cern.ch/cephdocs/ops/rota.html#cephmdstoomanystrays"
      summary: "The number of strays is above 500K"
    expr: "ceph_mds_cache_num_strays > 500000"
    for: "5m"
    labels:
      severity: "ticket"

Push the changes and prometheus server will reload automatically picking the new changes. Next time the alert is triggered, a link to the procedure will be shown in the alert body.

You can use the alertmanager Web Interface to silence alarms during scheduled interventions. Please always specify a reason for silencing the alarms (a JIRA link or ticket would be a plus). Additionally, for the alerts that generate an e-mail, you will find a link to silence it in the email body.

Alert grouping is enabled by default, so if the same alert is triggered in different nodes, we only receive one ticket with all involved nodes.

Both email and Snow Ticket templates are customizable. For doing that, you need to edit the following puppet file:

it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl

You have to use Golang's Template syntax. The structure of the file is as follows:

{{ define "ceph.email.subject" }}
....
{{ end }}
{{ define "ceph.email.body" }}
....
{{ end }}

For reference check the default AlertManager Templates

In case you add templates make sure that you adapt the AlertManager configuration accordingly:

- name: email
  email_configs:
  - to: ceph-admins@cern.ch
    from: altertmanager@locahost
    smarthost: cernmx.cern.ch:25
    headers:
      Subject: '{{ template "ceph.email.subject" . }}'
    html: '{{ template "ceph.email.body" . }}'

Note A restart of AlertManager is needed for the changes to be applied.

The prometheus dashboard or Dashprom is a powerful interface that allows to quickly asses the prometheus server status and also provide a quick way of querying metrics. The prometheus dashboard is accesible from this link: Promdash.

The prometheus dashboard is useful for:
- Checking the status of all targets: Target status
- Check the status of the alerts Alert Status
- For debug purposes, you can execute PromQL queries directly on the dashboard and change the intervals quickly.
- In grafana there is an icon just near the metric definition to view the current query in promdash.
- You can also use the Grafana Explorer.

Note: This will only give you access to the metrics of the last 7 days, refer to the next chapter for accessing older metrics.

The long term storage metrics are kept in S3 CERN Service using Thanos. The bucket is called prometheus-storage and is accessed using the EC2 credentials of Ceph's Openstack Project. Accesing to this metrics is transparent from Grafana:

Metrics of the last 7 days are served directly from prometheus local storage
Older metrics are pulled from S3.
As metrics in S3 contains downsampled versions (5m, 1h) is usually much faster that getting metrics from the local prometheus.
RAW metrics are also kept, so it is possible to zoom-in to the 15 second-resolution

There is a thanos promdash version here, from where you can access all historical metrics. This dashboard has some specific thanos features like deduplication (for use cases with more than one prometheus servers scrapping the same data) and the possibility of showing downsampled data (thanos stores two downsampled versions of the metrics, with 1h and 5m resolution). This downsampled data is also stored in S3.

You can find more detailed information in Thanos official webpage, but these are the list of active components in our current setup and the high level description of what they do:

Every time Prometheus dumps the data to disk (by default, each 2 hours), the thanos-sidecar uploads the metrics to the S3 bucket. It also acts as a proxy that serves Prometheus’s local data.

This is the storage proxy which serves the metrics stored in S3

This component reads the data from store(s) and sidecar(s) and answers PromSQL using the standard Prometheus HTTP API. This is the component you have to point from monitoring dashboards.

This is a detached component which compacts the data in S3 and also creates the downsampled versions.

Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

Normally we create ceph-mon's as VMs in the ceph/{hg_name}/mon hostgroup.

Example: Adding a monitor to the ceph/test cluster:

First, source the IT Ceph Storage Service environment on aiadm: link
Then create a virtual machine with the following parameters:
main-user/responsible: ceph-admins (the user of the VM)
VM Flavor: m2.2xlarge (monitors must withstand heavy loads)
OS: Centos7 (the preferred OS used in CERN applications)
Hostgroup: ceph/test/mon (Naming convention for puppet configuration)
VM name: cephtest-mon- (We use prefix to generate an id)
Availability zone: usually cern-geneva-[a/b/c]

Example command: (It will create a VM with the above parameters)

$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.2xlarge
--cc7 -g ceph/test/mon --prefix cephtest-mon-  --nova-availabilityzone cern-geneva-a
--nova-sshkey {your_openstack_key}

This command will create a VM named cephtest-mon-XXXXXXXXXX in the ceph/test/mon hostgroup. Puppet will take care of the initialization of the machine

When you deploy a monitor server, you have to choose an availability zone. We tend to use different availability zones to avoid a single point of failure.

Set the appstate and app_alarmed parameters if necessary

Example: Get the roger data for the VM cephtest-mon-d8788e3256

$ roger show cephtest-mon-d8788e3256

The output should be something similar to this:

[
    {
        "app_alarmed": false,
        "appstate": "build",
        "expires": "",
        "hostname": "cephtest-mon-d8788e3256.cern.ch",
        "hw_alarmed": true,
        "message": "",
        "nc_alarmed": true,
        "os_alarmed": true,
        "update_time": "1506418702",
        "update_time_str": "Tue Sep 26 11:38:22 2017",
        "updated_by": "tmourati",
        "updated_by_puppet": false
    }
]

You need to set the machine's state to "production", so it can be used in production.

The following command will set the target VM to production state:

$ roger update --appstate production --all_alarms=true cephtest-mon-XXXXXXXXXX

Now the roger show {host} should show something like this:

[
    {
        "app_alarmed": true,
        "appstate": "production",
        "..."
    }
]

We now let puppet configure the machine. This task will take an adequate amount of time, as it needs about two configuration cycles to apply the desired changes. After the second cycle you can SSH (as root) to the machine to check if everything is ok.

For example you can check the cluster's status with $ ceph -s

You should see the current host in the monitor quorum.

We prefer not to use load-balancing service and lbclient here (https://configdocs.web.cern.ch/dnslb/). There is no scenario in ceph where we want a mon to disappear from the alias.

We rather use the --load-N appoarch to create the alias with all the mons:

Go to network.cern.ch
Click on Update information and use the FQDN of the mon machine
- If prompted, make sure you host interface and not the IPMI one
Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
Multiple aliases are supported. Use a comma-separated list
Check the changes are correct and submit the request

In the case of a VM, we can't directly set an alias, but can set a property in OS to the same effectL

Log onto aiadm or lxplus
Set your environmental variables to the correct tenant e.g. `eval $(ai-rc 'Ceph Development')
- Check the vars are what you expect with env | grep OS paying attention to OS_region
set the alias using openstack with openstack server set --property landb-alias=CEPH{hg_name}--LOAD-N- {hostname}

Upstream documentation at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/

The cluster must be in HEALTH_OK state, i.e. the monitor must be in a a healthy quorum.
You should have a replacement for the current monitor already in the quorum. And there should be enough monitors so that the cluster can be healthy after one monitor is removed. Normally this means that we should have about 4 monitors in the quorum before starting.

Disable puppet: $ puppet agent --disable 'decommissioning mon'
(If needed) remove the DNS alias from this machine and wait until it is so:

- For physical machines, visit http://network.cern.ch → "Update Information".
- For a VM monitor, you can remove the alias from the `landb-alias` property. See [Cloud Docs](https://clouddocs.web.cern.ch/clouddocs/using_openstack/properties.html)

Check if monitor is ok-to-stop: $ ceph mon ok-to-stop <hostname>
Stop the monitor: $ systemctl stop ceph-mon.target. You should now get a HEALTH_WARN status by running $ ceph -s, for example 1 mons down, quorum 1,2,3,4,5.
Remove the monitor's configuration, data and secrets with:

```sh
$ rm /var/lib/ceph/tmp/keyring.mon.*
$ rm -rf /var/lib/ceph/mon/<hostname>
```

Remove the monitor from the ceph cluster:

```sh
$ ceph mon rm <hostname>
removing mon.<hostname> at <IP>:<port>, there will be 5 monitors
```

You should now have a HEALTH_OK status after the monitor removal.
(If monitored by prometheus) remove the hostname from the list of endpoints to monitor. See it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml

Move this machine to a spare hostgroup: $ ai-foreman updatehost -c ceph/spare {hostname}
Run puppet once: $ puppet agent -t
(If physical) Reinstall the server in the ceph/spare hostgroup:

```sh
aiadm> ai-installhost p01001532077488
...
1/1 machine(s) ready to be installed
Please reboot the host(s) to start the installation:
ai-remote-power-control cycle p01001532077488.cern.ch
aiadm> ai-remote-power-control cycle p01001532077488.cern.ch
```

Now the physical machine is installed in the ceph/spare hostgroup.

(If virtual) Kill the vm with: $ ai-kill-vm {hostname}

Move this machine to another hostgroup (e.g., /osd) of the same cluster: $ ai-foreman updatehost -c ceph/<cluster_name>/osd {hostname}
Run puppet to apply the changes: $ puppet agent -t

Upsream documentation here: http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-mds/

The procedure follows the same pattern as adding a monitor node (create_a_mon) to the cluster.

Make sure you add your mds to the corresponding hostgroup ceph/<cluster>/mds and prepare the Puppet code (check other ceph clusters with cephfs as a reference)

Example for the ceph/mycluster hostgroup:

$ ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins \
 --nova-flavor m2.2xlarge --cc7 -g ceph/<mycluster>/mds --prefix ceph<mycluster>-mds- \
 --nova-availabilityzone cern-geneva-a

Note: When deploying more than one mds, make sure that they are spreaded into different availability zones.

As written in the upstream documentation, a ceph filesystem needs at least two metadata servers. The first will be the main server that will handle the clients' requests and the second one is the backup. Don't forget also to put the metadata servers into different availability zones, in case some problem occurs to a site.

Because of resource limitations, the flavor of the machines could be m2.xlarge instead of m2.2xlarge. In the ceph/<mycluster> cluster we use 2 m2.2xlarge main servers and one m2.xlarge backup server.

When the machine is available (reachable by the dns service), you can alter its state into production with roger.

$ roger update --appstate production --all_alarms=true ceph<mycluster>-mds-XXXXXXXXXX

After 2-3 runs of puppet

Upstream documentation here: http://docs.ceph.com/docs/master/cephfs/multimds/

When your cephfs system can't handle the amount of client requests, you notice warnings about the mds or the requests on ceph status, you may need to use multiple active metadata servers.

After adding an mds to the cluster, you will notice on ceph status on the mds line something like the following line.

mds: cephfs-1/1/1 up  {0=cephironic-mds-716dc88600=up:active}, 1 up:standby-replay, 1 up:standby

The 1 up:standby-replay is the backup server and the 1 up:standby that is shown recently is the mds we just added. To make the standby server active, we need to execute the following line:

WARNING: Your cluster may have multiple filesystems, use the right one!

ceph fs set <fs_name> max_mds 2

The name of the ceph filesystem can be retrieved by using $ ceph fs ls and looking for the name: <fs_name> key-value pair.

Now your ceph status message should look like this:

...
mds: cephfs-2/2/2 up  {0=cephironic-mds-716dc88600=up:active,1=cephironic-mds-c4fbd7ee74=up:active}, 1 up:standby-replay
...

To see which osds are down, check with ceph osd tree down out.

[09:28][root@p06253939p44623 (production:ceph/beesly/osd*24) ~]# ceph osd tree down out
ID  CLASS WEIGHT     TYPE NAME                        STATUS REWEIGHT PRI-AFF
 -1       5589.18994 root default                                             
 -2       4428.02979     room 0513-R-0050                                     
 -6        917.25500         rack RA09                                        
 -7        131.03999             host p06253939j03957                         
430          5.45999                 osd.430            down        0 1.00000
-19        131.03999             host p06253939s09190                         
 24          5.45999                 osd.24             down        0 1.00000
405          5.45999                 osd.405            down        0 1.00000
 -9        786.23901         rack RA13                                        
-11        131.03999             host p06253939b84659                         
101          5.45999                 osd.101            down        0 1.00000
-32        131.03999             host p06253939u19068                         
577          5.45999                 osd.577            down        0 1.00000
-14        895.43903         rack RA17                                        
-34        125.58000             host p06253939f99921                         
742          5.45999                 osd.742            down        0 1.00000
-22        125.58000             host p06253939h70655                         
646          5.45999                 osd.646            down        0 1.00000
659          5.45999                 osd.659            down        0 1.00000
718          5.45999                 osd.718            down        0 1.00000
-26        131.03999             host p06253939v20205                         
650          5.45999                 osd.650            down        0 1.00000
-33        131.03999             host p06253939w66726                         
362          5.45999                 osd.362            down        0 1.00000
654          5.45999                 osd.654            down        0 1.00000

Check the tickets for the machines in Service Now. Those who interest us are the named : [GNI] exception.scsi_blockdevice_driver_error_reported or exception.nonwriteable_filesystems.
- If the repair service replaced the disk(s), it will be written in the ticket. So you can continue on the next step.

Simple format: osd as logical volume of one disk

This is a sample output of listing the disks in lvm fashion. You will notice the number of devices (disks) in each osd is one. Also these devices don't use any ssds for performance boost.

(Ceph volume listing takes some time to complete)

[13:55][root@p05972678e21448 (production:ceph/erin/osd*30) ~]# ceph-volume lvm list

===== osd.335 ======

  [block]    /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba

      type                      block
      osd id                    335
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      encrypted                 0
      cephx lockbox secret      
      block uuid                PXCHQW-4aXo-isAR-NdYU-3FQ2-E18Q-whJa92
      block device              /dev/ceph-d0befafb-3e7c-4ffc-ab25-ef01c48e69ac/osd-block-c2fd8d2e-8a38-42b7-a03c-9285f2b973ba
      vdo                       0
      crush device class        None
      devices                   /dev/sdw

===== osd.311 ======

  [block]    /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e

      type                      block
      osd id                    311
      cluster fsid              eecca9ab-161c-474c-9521-0e5118612dbb
      cluster name              ceph
      osd fsid                  1bfad506-c450-4116-8ba5-ac356be87a9e
      encrypted                 0
      cephx lockbox secret      
      block uuid                O5fYcf-aGW8-NVWC-lr5G-BsuC-Yx3H-WZl24a
      block device              /dev/ceph-b7be4aa7-0c7f-4786-a214-66116420a2cc/osd-block-1bfad506-c450-4116-8ba5-ac356be87a9e
      vdo                       0
      crush device class        None
      devices                   /dev/sdt

This is an example of an osd that uses an ssd for its metadata. It has a db part in which the metadata is stored.

[14:04][root@p06253939e35392 (production:ceph/dwight/osd*24) ~]# ceph-volume lvm list


====== osd.29 ======

  [block]    /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48

      type                      block
      osd id                    29
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  dff889e7-5db5-4c5e-9aab-151e8ad17b48
      db device                 /dev/sdac3
      encrypted                 0
      db uuid                   9762cd49-8f1c-4c29-88ca-ff78f6bdd35c
      cephx lockbox secret      
      block uuid                HuzwbL-mVvi-Ubve-1C5D-fjeh-dmZq-ivNNnY
      block device              /dev/ceph-f06dffa2-a9d8-47da-9af2-b4d4f9260557/osd-block-dff889e7-5db5-4c5e-9aab-151e8ad17b48
      crush device class        None
      devices                   /dev/sdk

  [  db]    /dev/sdac3

      PARTUUID                  9762cd49-8f1c-4c29-88ca-ff78f6bdd35c

====== osd.88 ======

  [block]    /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558

      type                      block
      osd id                    88
      cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
      cluster name              ceph
      osd fsid                  f19541f6-42b2-4612-a700-ec5ac8ed4558
      db device                 /dev/sdab6
      encrypted                 0
      db uuid                   f0b652e1-0161-4583-a50b-45a0a2348e9a
      cephx lockbox secret      
      block uuid                cHqcZG-wsON-P9Lw-4pTa-R1pd-GUwR-iqCMBg
      block device              /dev/ceph-14f3cd75-764b-4fa4-96a3-c2976d0ad0e5/osd-block-f19541f6-42b2-4612-a700-ec5ac8ed4558
      crush device class        None
      devices                   /dev/sdu

  [  db]    /dev/sdab6

      PARTUUID                  f0b652e1-0161-4583-a50b-45a0a2348e9a

One way is to have an ssd and do a simple partitioning. Each partition will be attached to an osd. If the ssd part is broken, eg the disk failed, all the osds who use this ssd will be rendered useless. Therefore each osd has to be replaced. There is also a chance that the ssd is formatted through lvm, the metadata database part will look like this:

[  db]    /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85

    type                      db
    osd id                    220
    cluster fsid              e7681812-f2b2-41d1-9009-48b00e614153
    cluster name              ceph
    osd fsid                  81f9ed48-d27d-44b6-9ac0-f04799b5d0d5
    db device                 /dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
    encrypted                 0
    db uuid                   wZnT18-ZcHh-jcKn-pHic-LCrL-Gpqh-Srq1VL
    cephx lockbox secret      
    block uuid                z8CwSn-Iap5-3xDX-v0NA-9smI-EHx1-EGxdAR
    block device              /dev/ceph-6bb8b94b-4974-44b0-ae6c-667896807328/osd-a59a8661-c966-443b-9384-b2676a3d42d8
    vdo                       0
    crush device class        None
    devices                   /dev/md125

ceph-volume lvm list is slow, save its output to ~/ceph-volume.out and work with that file instead.
Check if the ssd device exists and it is failed.
Check if it is used as a metadata database for osds, or as a regular osd.
1. If it is a metadata database:
  1. Locate all osds that use it (lvm list + grep)
  2. Follow the procedure for each affected osd
2. Treat it as a regular osd (normal replacement)
Mark out the osd: ceph osd out $OSD_ID
Destroy the osd: ceph osd destroy $OSD_ID --yes-i-really-mean-it
Stop the osd daemon: systemctl stop ceph-osd@OSD_ID
Unmount the filesystem: umount /var/lib/ceph/osd/ceph-$OSD_ID
If the osd has uses a metadata database (on ssds)
1. If it is a regular partition, remove the partition I guess
2. If it's an lvm, remove it:
  1. eg for "/dev/cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85"
  2. lvremove cephrocks/cache-b0bf5753-07b7-40eb-bf3a-cc6f35de1d85
Run ceph-volume lvm zap /dev/sdXX --destroy
In case ceph-volume fails to list the defected devices or zap the disks. You can get the information you need through lvs -o devices,lv_tags | grep type=block and use vgremove instead for the osd block.
In case you can't get any information through ceph-volume or lvs about the defective devices, you should list the working osds and umount the unused folders with:
```
$ umount `ls -d -1 /var/lib/ceph/osd/* | grep -v -f <(grep -Po '(?<=osd\.)[0-9]+' ceph-volume.out)`
```
Now you should wait until the devices have been replaced. Skip this meaningful step if they are already replaced.
If the osd has any metadata database used elsewhere (ssd) you should prepare it in case of lvm. For naming we use cache-`uuid -v4`. Just recreate the lvm you removed at step 7 with: lvcreate -name $name -l 100%FREE $VG. Lvm has three categories, PVs which are the physical devices (e.g /dev/sda), VGs which are the volume groups that contain one or more physical devices. And LVs are the "paritions" of VGs. For simplicity we use 1 PV per VG, and one LV per VG. In case you have more than one LV per VG, when you recreate it use e.g for 4 LVs per VG 25%VG instead of 100%FREE.

Recreate the OSD using ceph volume, use a destroyed osd's id from the same host

$ ceph-volume lvm create --bluestore --data /dev/sdXXX --block.db (VG/LV or ssd partition) --osd-id XXX

Run this script with the defective device ceph-scripts/tools/lvm_stripe_remove.sh /dev/sdXXX (it doesn't take a list of devices)
The program will report what cleanup it did, you will need the 2nd and 3rd line, which are the two disks that make the failed osd and the last line which is the osd id.
In any case the script failed, you can open it, as it is documented and follow the steps manually.
If you have more than one osd replacement you can repeat steps 1 and 2, as the 5th step can done in the end.
Pass the set of disks from step 1 after you have all of them working on this script:
```
ceph-scripts/ceph-volume/striped-osd-prepare.sh /dev/sd[a-f]
```
It uses ls inside so you can use wildcards if you are bored to write '/dev/sdX' all the time.
It will output a list of commands to be executed in order, run all EXCEPT THE ceph-volume create one. Add at the end of the ceph-volume create line the argument --osd-id XXX with the number of the destroyed osd id, and run the command.

Ceph is tightly integrated with Openstack and it is this latter the main access point to the storage from the user perspective. As a result, Openstack is the main source of information for the data stored on Ceph: Project names, project owners, quotas, etc. Some noteble exceptions remain, for example local S3 accounts on Gabe and the whole Nethub cluster.

This page collects some example of what it is possible to retrieve from Openstack to know better the storage items we manage.

To gain visibility on the metadata stored by Openstack, it is needed to have access to the services project in Openstack. Typically all members of ceph-admins are part of it. services is a special project with storage administrator capabilities that allows to retrieve various pieces of information on the whole Openstack instance and on existing projects, compute resources, storage, etc...

Use the services project simply by setting:

OS_PROJECT_NAME=services

Get the list of openstack projects with their names and IDs:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project list | head -n 10
+--------------------------------------+------------------------------------------------------------------+
| ID                                   | Name                                                             |
+--------------------------------------+------------------------------------------------------------------+
| 0000d664-f697-423b-8595-57aea89be355 | Stuff...                                                         |
| 0007808b-2f41-41c5-bd7c-3bd1f1f94cb2 | Other stuff...                                                   |
| 00100a6d-b71c-415d-9dbc-3f78c2b8372a | Stuff continues...                                               |
| 001d902d-f76e-4222-a5d0-ca6529e8221f | ...                                                              |
| 0026e800-f134-4622-b0ef-4a03283a3965 | ...                                                              |
| 00292adf-92ad-4815-966c-a9296266b0a0 | ...                                                              |
| 004b5668-4ebe-418d-83bc-1cdadf059c85 | ...                                                              |

Get details of a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack project show 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+-------------+--------------------------------------+
| Field       | Value                                |
+-------------+--------------------------------------+
| chargegroup | af9298f2-041b-0944-7904-3b41fde4f97f |
| chargerole  | default                              |
| description | Ceph Storage Service                 |
| domain_id   | default                              |
| enabled     | True                                 |
| fim-lock    | True                                 |
| fim-skip    | True                                 |
| id          | 5d8ea54e-697d-446f-98f3-da1ce8f8b833 |
| is_domain   | False                                |
| name        | IT Ceph Storage Service              |
| options     | {}                                   |
| parent_id   | default                              |
| tags        | ['s3quota']                          |
| type        | service                              |
+-------------+--------------------------------------+

Identify the owner of a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack role assignment list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833 --names --role owner
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| Role  | User             | Group | Project                         | Domain | System | Inherited |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+
| owner | dvanders@Default |       | IT Ceph Storage Service@Default |        |        | False     |
+-------+------------------+-------+---------------------------------+--------+--------+-----------+

List the RBD volumes in a project:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume list --project 5d8ea54e-697d-446f-98f3-da1ce8f8b833
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name               | Status    | Size | Attached to                                                   |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+
| 5143d9e4-8470-4ac4-821e-57ef99f24060 | buildkernel        | in-use    |  200 | Attached to 8afce55e-313f-432c-a764-b0ada783a268 on /dev/vdb  |
| c0f1a9f7-8308-412a-92da-afcc20db3c4c | clickhouse-data-01 | available |  500 |                                                               |
| 53406846-445f-4f47-b4c5-e8558bb1bbed | cephmirror-io1     | in-use    | 3000 | Attached to dfc9a14a-ff4b-490a-ab52-e6c9766205ad on /dev/vdc  |
| c2c31270-0b95-4e28-9ac0-6d9876ea7f32 | metrictank-data-01 | in-use    |  500 | Attached to fbdff7a0-7b5b-47c0-b496-5a8afcc8e528 on /dev/vdb  |
+--------------------------------------+--------------------+-----------+------+---------------------------------------------------------------+

Show details of a volume:

[ebocchi@aiadm81 ~]$ OS_PROJECT_NAME=services openstack volume show c0f1a9f7-8308-412a-92da-afcc20db3c4c
+--------------------------------+-------------------------------------------+
| Field                          | Value                                     |
+--------------------------------+-------------------------------------------+
| attachments                    | []                                        |
| availability_zone              | ceph-geneva-1                             |
| bootable                       | false                                     |
| consistencygroup_id            | None                                      |
| created_at                     | 2021-11-04T08:34:51.000000                |
| description                    |                                           |
| encrypted                      | False                                     |
| id                             | c0f1a9f7-8308-412a-92da-afcc20db3c4c      |
| migration_status               | None                                      |
| multiattach                    | False                                     |
| name                           | clickhouse-data-01                        |
| os-vol-host-attr:host          | cci-cinder-qa-w01.cern.ch@beesly#standard |
| os-vol-mig-status-attr:migstat | None                                      |
| os-vol-mig-status-attr:name_id | None                                      |
| os-vol-tenant-attr:tenant_id   | 5d8ea54e-697d-446f-98f3-da1ce8f8b833      |
| properties                     |                                           |
| replication_status             | None                                      |
| size                           | 500                                       |
| snapshot_id                    | None                                      |
| source_volid                   | None                                      |
| status                         | available                                 |
| type                           | io1                                       |
| updated_at                     | 2021-11-04T08:35:15.000000                |
| user_id                        | tmourati                                  |
+--------------------------------+-------------------------------------------+

Show the snapshots for a volume in a project:

[ebocchi@aiadm84 ~]$ OS_PROJECT_NAME=services openstack volume snapshot list --project 79b9e379-f89d-4b3a-9827-632b9bf16e98 --volume d182a910-b40a-4dc0-89b7-890d6fa01efd
+--------------------------------------+-------------------+-------------+-----------+-------+
| ID                                   | Name              | Description | Status    |  Size |
+--------------------------------------+-------------------+-------------+-----------+-------+
| 798d06dc-6af4-420d-89ce-1258104e1e0f | snapv_webstuff03  |             | available | 30000 |
+--------------------------------------+-------------------+-------------+-----------+-------+

OpenStack colleagues might report problems purging images

[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash ls
2ccb86bd4fca85 volume-3983f035-a47f-46e8-868c-04d2345c3786
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
8df764f0d51e64 volume-eb48e00f-ea31-4d28-91a1-4f8319724da7
99e74530298e95 volume-18fbb3e6-fb37-4547-8d27-dcbc5056c2b2
ebcc84aa45a3da volume-821b9755-dd42-4bf5-a410-384339a2d9f0

[root@cci-cinder-u01 ~]# rbd -c /etc/ceph/ceph.conf --id volumes --pool volumes trash purge
2021-02-17 15:42:46.911 7f674affd700 -1 librbd::image::PreRemoveRequest: 0x7f6744001880 check_image_watchers: image has watchers - not removing
Removing images: 0% complete...failed.

Find out which are the watchers with using the identifier on the left-hand side:

[15:52][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados listwatchers -p volumes rbd_header.2ccb86bd4fca85
watcher=188.184.103.106:0/964233084 client.634461458 cookie=140076936413376

Get in touch with the owner of the machine. The easiest way to fix stuck watchers is to reboot the machine.

Further information (might require untrash) about the volume can be found with

[18:31][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rbd info volumes/volume-00067659-3d1e-4e22-a5d7-212aba108500
rbd image 'volume-00067659-3d1e-4e22-a5d7-212aba108500':
    size 500 GiB in 128000 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: e8df4c4fe1aa8f
    block_name_prefix: rbd_data.e8df4c4fe1aa8f
    format: 2
    features: layering, striping, exclusive-lock, object-map
    op_features:
    flags:
    stripe unit: 4 MiB
    stripe count: 1

and with (no untrash required)

[18:32][root@p05517715y58557 (production:ceph/beesly/mon*2:peon) ~]# rados stat -p volumes rbd_header.e8df4c4fe1aa8f
volumes/rbd_header.e8df4c4fe1aa8f mtime 2020-11-23 10:25:56.000000, size 0

We have seen a case of an image in Beesly's trash that cannot be purged:

# rbd --pool volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b

# rbd --pool volumes trash purge
Removing images: 0% complete...failed.
2021-03-10 13:58:42.849 7f78b3fc9c80 -1 librbd::api::Trash: remove:
error: image is pending restoration.

When trying to delete manually, it says there are some watchers, but this is actually not the case:

# rbd --pool volumes trash remove 5afa5e5a07b8bc
rbd: error: image still has watchers2021-03-10 14:00:21.262 7f93ee8f8c80
-1 librbd::api::Trash: remove: error: image is pending restoration.
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client
to timeout.
Removing image:
0% complete...failed.

# rados listwatchers -p volumes rbd_header.5afa5e5a07b8bc
#

This has been reported upstream. Check:

ceph-users with subject "Unpurgeable rbd image from trash"
ceph-tracker https://tracker.ceph.com/issues/49716

The original answer was

$ rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file $ hexedit key_file ## CHANGE LAST BYTE FROM '01' to '00' $ rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc --input-file key_file $ rbd trash rm --pool volumes 5afa5e5a07b8bc

To unstuck the image and make it purgeable

Get the value for its ID in rdb_trash

# rbd -p volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
[09:42][root@p05517715d82373 (qa:ceph/beesly/mon*2:peon) ~]# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc key_file
Writing to key_file

Make a safety copy of the original key_file

# cp -vpr key_file key_file_master

Edit the key_file with an hex editor and change the last byte from '01' to '00'

# hexedit key_file

Make sure the edited file contains only that change

# xxd key_file > file
# xxd key_file_master > file_master
# diff file file_master
5c5
< 0000040: 2a60 09c5 d416 00                        *`.....
---
> 0000040: 2a60 09c5 d416 01                        *`.....

Set the edited file to be the new value

# rados -p volumes setomapval rbd_trash id_5afa5e5a07b8bc < key_file

Get it back and check that the last byte is now '00'

# rados -p volumes getomapval rbd_trash id_5afa5e5a07b8bc
value (71 bytes) :
00000000  02 01 41 00 00 00 00 2b  00 00 00 76 6f 6c 75 6d  |..A....+...volum|
00000010  65 2d 30 32 64 39 35 39  66 65 2d 61 36 39 33 2d  |e-02d959fe-a693-|
00000020  34 61 63 62 2d 39 35 65  32 2d 63 61 30 34 62 39  |4acb-95e2-ca04b9|
00000030  36 35 33 38 39 62 12 05  2a 60 09 c5 d4 16 12 05  |65389b..*`......|
00000040  2a 60 09 c5 d4 16 00                              |*`.....|
00000047

Now you can finally purge the image

# rbd -p volumes trash purge
Removing images: 100% complete...done.
# rbd -p volumes trash ls
#

We had a ticket (RQF2003413) of a user unable to delete a volume because of linked snapshots.

Dump the RBD info available on CEPH using the volume ID (see openstack_info of the undeletable volume:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd 
rbd image 'volume-d182a910-b40a-4dc0-89b7-890d6fa01efd':
    size 29 TiB in 7680000 objects
    order 22 (4 MiB objects)
    snapshot_count: 1
    id: 457afdd323be829
    block_name_prefix: rbd_data.457afdd323be829
    format: 2
    features: layering
    op_features:
    flags:
    access_timestamp: Fri Mar 25 12:19:12 2022

The snapshot_count reports 1, which indicates one snapshot exists for the volume.

Now, list the snapshots for the undeletable volumes:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical snap ls --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd 
SNAPID  NAME                                           SIZE    PROTECTED  TIMESTAMP
    37  snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f  29 TiB  yes

In turn, it is possible to create volumes from snapshots. To check if they exist, list the child(ren) volume(s) from snapshots

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical children --image volume-d182a910-b40a-4dc0-89b7-890d6fa01efd --snap snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
cinder-critical/volume-b9d0035f-857c-46b6-b614-4480c462d306

This last is a brand new volume, that still keeps a reference to the snapshot it originates from:

[root@cephdata21b-226814ead6 (qa:ceph/beesly/mon*50:peon) ~]# rbd --pool cinder-critical info --image volume-b9d0035f-857c-46b6-b614-4480c462d306
rbd image 'volume-b9d0035f-857c-46b6-b614-4480c462d306':
    size 29 TiB in 7680000 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 7f8067e3510b0d
    block_name_prefix: rbd_data.7f8067e3510b0d
    format: 2
    features: layering, striping, exclusive-lock, object-map
    op_features:
    flags:
    access_timestamp: Fri Mar 25 12:20:51 2022
    modify_timestamp: Fri Mar 25 12:36:48 2022
    parent: cinder-critical/volume-d182a910-b40a-4dc0-89b7-890d6fa01efd@snapshot-798d06dc-6af4-420d-89ce-1258104e1e0f
    overlap: 29 TiB
    stripe unit: 4 MiB
    stripe count: 1

The parent field shows the volume comes from a snapshot, which cannot be deleted as the volume-from-snapshot is implemented as copy-on-write (see overlap: 29 TiB) via RBD layering.

Openstack can flatten volumes-from-snapshots in case these need to be made independent from the parent. Alternatively, to delete to parent volume, it is required to delete both the volume-from-snapshot and the snapshot.

Large omap objects trigger HEALTH WARN messages and can be due to poorly sharded bucket indexes.

The following example report about a over-limit bucket on nethub detected on 2021/05/21.

Look for Large omap object found. in ceph logs (/var/log/ceph/ceph/log):

2021-05-21 04:34:00.879483 osd.867 (osd.867) 240 : cluster [WRN] Large omap object found. Object: 7:7bae080b:::.dir.fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29:head PG: 7.d01075de (7.de) Key count: 610010 Size (bytes): 198156342
2021-05-21 04:34:11.622372 mon.cephnethub-data-c116fa59b2 (mon.0) 659324 : cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)

These lines show that:

The pool suffering from the problem is pool number 7
The PG suggering is 7.de
The shared object is a bucket index: .dir. represents bucket indexes
The affected bucket has id fe32212d-631b-44fe-8d35-03f5a3551af1.142704632.29 (sadly , there is no way to map it to a name)

To verify this is actually a bucket index, one can also check what pool #7 stores:

[14:21][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph osd pool ls detail | grep "pool 7"
pool 7 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 30708 lfor 0/0/2063 flags hashpspool,nodelete,nopgchange,nosizechange stripe_width 0 application rgw

Run radosgw-admin bucket limit check to see how bucket index sharding is doing. It might take a while, it is recommended to dump to file.
Check the output of radosgw-admin bucket limit check and look for buckets with OVER "fill_status":

{
    "bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
    "tenant": "",
    "num_objects": 767296,
    "num_shards": 0,
    "objects_per_shard": 767296,
    "fill_status": "OVER 749%"
},

Check in the radosgw logs (please, use mco to look through all the RGWs) if the radosgw process has tried to reshard the bucket recently but did not manage. Example:

ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:19:40.316 7fd2ce2a4700  1 check_bucket_shards bucket cboxbackproj-sftnight-lgdocs need resharding  old num shards 0 new num sh
ards 18
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.624 7fd2cd2a2700  0 NOTICE: resharding operation on bucket index detected, blocking
ceph-client.rgw.cephnethub-data-0509dffff2.log-20210514.gz:2021-05-13 12:20:12.625 7fd2cd2a2700  0 RGWReshardLock::lock failed to acquire lock on cboxbackproj-sftnight-lgdocs:fe32212d-631b-44fe-8d35-
03f5a3551af1.142705079.19 ret=-16

This only applies whether dynamic resharding is enabled:

[14:27][root@cephnethub-data-0509dffff2 (qa:ceph/nethub/traefik*26) ~]# cat /etc/ceph/ceph.conf  | grep resharding
rgw dynamic resharding = true

Reshard the bucket index manually:

radosgw-admin reshard add --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs --num-shards 18

The number of shards can be inferred from the logs inspected at point 4. -i If dynamic resharding is disable, a little math is required. Check the bucket stats (radosgw-admin bucket stats --bucket <bucket_name>) and make sure usage --> rgw.main --> num_objects divided by the number of shards does not exceed 100000 (50000 is recommended).

Example:

[14:29][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# radosgw-admin bucket stats --bucket cboxbackproj-cboxbackproj-sftnight-lgdocs
{
    "bucket": "cboxbackproj-cboxbackproj-sftnight-lgdocs",
[...]
    "usage": {
        "rgw.main": {
            "size": 4985466767640,
            "size_actual": 4987395952640,
            "size_utilized": 4985466767640,
            "size_kb": 4868619891,
            "size_kb_actual": 4870503860,
            "size_kb_utilized": 4868619891,
            "num_objects": 941202
        }
    },
}

with 941202 / 18 = 52289

5b. Once added the bucket to be reshared, start the reshard process:

radosgw-admin reshard list
radosgw-admin reshard process

Check after some time that the radosgw-admin bucket stats --bucket <bucket_name> reports the right number of shards and that radosgw-admin bucket limit check no longer shows OVER or WARNING for the re-sharded bucket.
To clear the HEALTH_WARN message for the large omap object, start a deep scrub on the affected pg:

[14:31][root@cephnethub-data-98ab89f75a (production:ceph/nethub/mon*27:peon) ~]# ceph pg deep-scrub 7.de
instructing pg 7.de on osd.867 to deep-scrub

This warning shows that a client stopped responding to messages from the MDS. Sometimes it is harmless (perhaps a client disconnected "uncleanly", e.g. a hard reboot), or it could indicate the client is overloaded, deadlocked on something else.

If the same client is appearing repeatedly, it may be useful to get in touch with the owner of the client machine. (ai-dump <hostname> on aiadm).

When the MDS cache is full, it will need to clear inodes from its cache. This normally also means that the MDS needs to ask some clients to also remove some inodes from their cache too.

If the client fails to respond to this cache recall request, then Ceph will log this warning.

Clients stuck in this state for an extended period of time can cause issues -- follow up with the machine owner to understand the problem.

Note: Ceph-fuse v13.2.1 has a bug which triggers this issue -- users should update to a newer client release.

This means that a user is trying to mount a Manila share that either doesn't exist or they didn't create a key yet. It is harmless, but if repeated then get in touch with the user.

An HPC client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
    mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
failing to respond to capability release client_id: 69092525
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec

Indeed there was a hung write on hpc070.cern.ch:

# cat /sys/kernel/debug/ceph/*/osdc
245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.00000001
0x400024        1 write

I restarted osd.100 and the deadlocked request went away.

Note: If you are looking for the old notes related to the infrasctructure based on consul and nomad, please refer to old documentation.

The CERN S3 service (s3.cern.ch) is provided by the gabe cluster and an arbitrary number of radosgw running on VMs. Each node in the ceph/gabe/radosgw hostgroup also runs a reverse-proxy daemon (Træfik), to spread the load on the VMs running a radosgw and to route traffic to different dedicated RGWs (cvmfs, gitlab, ...).

A second S3 cluster (s3-fr-prevessin-1.cern.ch) is also available in Prevessin Nethub Hub (nethub).

Both clusters (as of July 2021) use similar technologies: Ceph, RGWs, Træfik, Logstash, ....

RadosGW: Daemon handling S3 requests and interacting with the Ceph cluster
Træfik: Handles HTTP(S) requests from the Internet and spreads the load on radosgw daemons.
Logstash: Sidecar process that ships the access logs produced by Træfik to the MONIT infrastructure.

Upstream RadosGW documentation: (https://docs.ceph.com/en/nautilus/radosgw/)
Upstream documentation on radosgw-admin tool: (https://docs.ceph.com/en/nautilus/man/8/radosgw-admin/)
Træfik documentation: (https://docs.Træfik.io/)
S3 Script guide: (https://gitlab.cern.ch/ceph/ceph-guide/-/blob/master/src/ops/s3-scripts.md)

Træfik: http://s3.cern.ch/traefik/ (requires basic auth)
ElasticSearch for access logs: https://es-ceph.cern.ch/ (from CERN network only)
Various S3 dashboards (and underlying Ceph clusters) on Filer Carbon
Buckets rates (and others) on Monit Grafana

Maintenance Tasks

Each machine running Træfik/RGW is:

Part of the s3.cern.ch alias (managed by lbclient), with Træfik accepting connections on port 80 and 443 for HTTP and HTTPS, respectively
A backend RadosGW for all the Træfiks of the cluster, with the Ceph RadosGW daemon accepting connections on port 8080

To remove a machine from s3.cern.ch, touch /etc/nologin or change the roger status to intervention/disabled (roger update --appstate=intervention <hostname>). This will make lbclient return a negative value and the machine will be removed from the alias.
To remove temporarily a RadosGW from the list of backends (e.g., for a cluster upgrade), touch /etc/nologin and the RadosGW process will return 503 for requests to /swift/healthcheck. This path is used by Træfik healthcheck and, if the return code is different from 200, Træfik will stop sending requests to that backend. Wait few minutes to let in-flight requests complete, then restart the RadosGW process without clients noticing. See Pull Request to implement healthcheck disabling path.
To remove permanently a RadosGW from the list of backends (e.g., decommissioning), change the Træfik dynamic configuration via puppet in Træfik.yaml by removing the machine from the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Spawn a new VM with the script cephgabe-rgwtraefik-create.sh from aiadm
Wait for the VM to be online and run puppet several times so that the configuration is up to date
Make sure you have received the email confirming the VM has been added to the firewall set (and so it is reachable from the big Internet)
Make sure the new VM serves requests as expects (Test IPv4 and IPv6, HTTP and HTTPS):

curl -vs --resolve s3.cern.ch:{80, 443}:<ip_address_of_new_VM> http(s)://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -

Add the VM to the Prometheus s3_lb job (see prometheus puppet config) to monitor its availability and collect statistics on failed (HTTP 50*) requests
Change the roger status to production and enable all alarms. The machine will now be part of the s3.cern.ch alias
Update the Træfik dynamic configuration via puppet in Træfik.yaml by adding the new backend to the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).

Edit the list of backend nodes in the Træfik dynamic configuration via puppet in Træfik.yaml by adding/removing/shuffling around the server. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).
If adding/removing, make sure the list of monitored endpoints by Prometheus is up to date. See prometheus puppet config.

The certificate is provided by CDA. You should ask them to buy a new one with the correct SANs. Once the new certificate is provided, copy-paste it on https://tools.keycdn.com/certificate-chain -- It will return a certificate chain with all the required intermediate certificates. This certificate chain is the one to be put in Teigi and be used by Træfik. Please, split it and check the validity of each certificate validity with openssl x509 -in <filename> -noout -text. Typically, the root CA certificate, the intermediate certificate and the private key do not change.

Once validates, it should be put in Teigi under ceph/gabe/radosgw/træfik:

s3_ch_ssl_certificate
s3_ch_ssl_private_key

Next, the certificate must be deployed on all machines via puppet. Mcollective can be of help to bulk-run puppet on all the Træfik machines:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Last, the certificate must be loaded by Træfik. While the certificate is part of Træfik's dynamic configuration, Træfik does not seem to reload it if the certificate file (distributed via puppet) changes on disk. Puppet will still notify the Træfik service when the certificate file changes (see traefik.pp) to no avail.

Since 2022, a configuration change in Træfik (Traefik: hot-reload certificates when touching (or editing) dynamic file) allows reloading the certificate when the Traefik dynamic configuration file changes. It is sufficient to touch /etc/traefik/traefik.dynamic.conf to have the certificate reloaded, with no need to drain the machine and restart the Traefik process:

Make sure the new certificate file is available on the machine (/etc/ssl/certs/radosgw.crt)
Tail the logs of the Traefik service: tail -f /var/log/traefik/service.log
Touch Traefik's dynamic configuration file: touch /etc/traefik/traefik.dynamic.conf
Check the new certificate is in place:

curl -vs --resolve s3.cern.ch:443:<the_ip_address_of_the_machine> https://s3.cern.ch --output /dev/null 2>&1 | grep ^* | grep date
*  start date: Mar  1 00:00:00 2022 GMT
*  expire date: Mar  1 23:59:59 2023 GMT

The same certificates are also used by the Nethub cluster and distributed via Teigi under ceph/nethub/traefik:

s3_fr_ssl_certificate
s3_fr_ssl_private_key

There is a daily cronjob that checks S3 user quota usage and sends a list of accounts reaching 90% of their quota. Upon reception of this email, we should get in touch with the user and see if they can (1) free some space by deleting unnecessary data or (2) request more space.

Currently, there is some rgw accounts that will come without an associated email address. A way to investigate who owns the account is to log into aiadm.cern.ch and run the following commands (in /root/ceph-scripts/tools/s3-accounting/)

./cern-get-accounting-unit.sh --id `./s3-user-to-accounting-unit.py <rgw account id>`

This will give you the user name of the associated openstack tenant's owner, with the contact email address.

The s3.cern.ch alias is managed by aiermis and/or by the kermis CLI utility on aiadm

[ebocchi@aiadm81 ~]$ kermis -a s3 -o read
INFO:kermis:[
    {
        "AllowedNodes": "",
        "ForbiddenNodes": "",
        "alias_name": "s3.cern.ch",
        "behaviour": "mindless",
        "best_hosts": 10,
        "clusters": "none",
        "cnames": [],
        "external": "yes",
        "hostgroup": "ceph/gabe/radosgw",
        "id": 3019,
        "last_modification": "2018-11-01T00:00:00",
        "metric": "cmsfrontier",
        "polling_interval": 300,
        "resource_uri": "/p/api/v1/alias/3019/",
        "statistics": "long",
        "tenant": "golang",
        "ttl": null,
        "user": "dvanders"
    }
]

As of July 2021, the alias returns the 10 best hosts (based on the lbclient score) out of all the machines that are part of the alias, which are typically more. Also, the members of the alias are refreshed every 5 minutes (300 seconds).

Follow the procedure defined for the other Ceph clusters. In a nutshell:

Start with mons, then mgrs. OSDs go last.
If upgrading OSDs, ceph osd set {noin, nout}
yum update to update the packages (check that the ceph package is actually upgraded)
systemctl restart ceph-{mon, mgr, osd}
Always make sure the daemons came back alive and all OSDs repeered before continuing with the following machine

To safely upgrade the RadosGW, touch /etc/nologin to have it returning 503 to the healthcheck probes from Træfik (see more about healthcheck disabling path above). This allows for draining the RadosGW by not sending new requests to it and letting in-flight ones finish gently.

After few minutes, one can assume there are no more in-flight requests and the RadosGW can be update and restarted: systemctl restart ceph-{mon, mgr, osd}. Make sure the RadosGW came back alive by tailing the log at /var/log/ceph/ceph-client.rgw.*; it should still return 503 to the Træfik healthchecks. Now remove /etc/nologin and check the requests flow with 200.

To safely upgrade Træfik, the frontend machine must be removed from the load-balanced alias by touching /etc/nologin (this will also disable the RadosGW due to the healthcheck disabling path -- see above). Wait for some time and make sure no (or little) traffic is handled by Træfik by checking its access logs (/var/log/traefik/access.log)`. Some clients (e.g., GitLab, CBack) are particularly sticky and rarely re-resolve the alias to IPs -- there is nothing you can do to push those clients away.

When no (or little) traffic goes through Træfik, update the traefik::version parameter and run puppet. The new Træfik binary will be installed on the host and the service will be restarted.

Check with curl that Træfik works as expected. Example:

$ curl -vs --resolve s3.cern.ch:80:188.184.74.136 http://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
* Added s3.cern.ch:80:188.184.74.136 to DNS cache
* Hostname s3.cern.ch was found in DNS cache
*   Trying 188.184.74.136:80...
* TCP_NODELAY set
* Connected to s3.cern.ch (188.184.74.136) port 80 (#0)
> GET /cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished HTTP/1.1
> Host: s3.cern.ch
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Bucket: cvmfs-atlas
< Cache-Control: max-age=61
< Content-Length: 601
< Content-Type: application/x-cvmfs
< Date: Fri, 22 Apr 2022 14:45:27 GMT
< Etag: "b5dbc3633d7bb27d10610f5f1079a192"
< Last-Modified: Fri, 22 Apr 2022 14:11:10 GMT
< X-Amz-Request-Id: tx00000000000000143ffd3-006262bf87-28e3e206-default
< X-Rgw-Object-Type: Normal
< 
Ca5b48a4ed8f0ca46b79584104564da32b42a1c45
B1385472
Rd41d8cd98f00b204e9800998ecf8427e
D240
S103476
Gno
Ano
Natlas.cern.ch
{...cut...}
* Connection #0 to host s3.cern.ch left intact

If successful, allow the machine to join the load-balanced pool by removing /etc/nologin.

radosgw-admin is used to manage users, quotas, buckets, indexes, and all other aspects of the radosgw service.

End-users get S3 quota from OpenStack (see Object Storage).

In special cases (e.g., Atlas Event Index, CVMFS Stratum 0s, GitLab, Indico, ...), we create users that exist only in Ceph and are not managed by OpenStack. To create a new user of this kind, it is needed to know user_id, email address, display name, quota (optional).

Create the user with:

radosgw-admin user create --uid=<user_id> --email=<email_address> --display-name=<display_name>

To set a quota for the user:

radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>
radosgw-admin quota enable --quota-scope=user --uid=<user_id>

Example:

radosgw-admin user create --uid=myuser --email="myuser@cern.ch" --display-name="myuser"
radosgw-admin quota set --quota-scope=user --uid=myuser --max-size=500G
radosgw-admin quota enable --quota-scope=user --uid=myuser

It is sufficient to set the updated quota value for the user:

radosgw-admin quota set --quota-scope=user --uid=<user_id> --max-size=<quota>

RGW shards bucket indices over several objects. The default number of shards per index is 32 in our clusters. It is best practice to keep the number of objects per shard below 100000. You can check the compliance across all buckets with radosgw-admin bucket limit check.

If there is a bucket with "fill_status": "OVER 100.000000%" then it should be resharded. E.g.

> radosgw-admin bucket reshard --bucket=lhcbdev-test --num-shards=128
tenant: 
bucket name: lhcbdev-test
old bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.24333603.1
new bucket instance id: 61c59385-085d-4caa-9070-63a3868dccb6.76824996.1
total entries: 1000 2000 ... 8599000 8599603
2019-06-17 09:27:47.718979 7f2b7665adc0  1 execute INFO: reshard of bucket "lhcbdev-test" from "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.24333603.1" to "lhcbdev-test:61c59385-085d-4caa-9070-63a3868dccb6.76824996.1" completed successfully

It is convenient to use the SWIFT protocol to retrieve quota information.

Create the SWIFT user as a subuser:

radosgw-admin subuser create --uid=<user_id> --subuser=<user_id>:swift --access=full

This generates a secret key that can be used on the client side to authenticate with SWIFT.

On clients, install the swift package (provided in the OpenStack Repo on linuxsoft) and retrieve quota information with

swift \
    -V 1 \
    -A https://s3.cern.ch/auth/v1.0 \
    -U <user_id>:swift \
    -K <secret_key> \
    stat

Access logs from Træfik reverse-proxy are collected via a side-car process called fluentbit. It pushes the logs to Monit Logs infrastructure for later processing by Logstash for filtering and enrichment running on Monit Marathon. Eventually, logs are then pushed to HDFS (/project/monitoring/archive/s3/logs) and to Elasticsearch for storage and visualization.

Since late April 2022, we use fluentbit on RadosGWs+Træfik frontends as it is much more gentle on memory than Logstash (which we were using previously).

fluentbit tails the log files produced by Træfik (both HTTP access logs and Træfik daemon logs), add a few fields and context through metadata, and pushes the records to the Monit Logs infrastructure at URI monit-logs-s3.cern.ch:10013/s3 using TLS encryption.

It is installed via puppet (exmaple for Gabe) by using the shared class fluentbit.pp responsible for installation and configuration of the fluentbit service.

fluentbit on the RadosGWs+Træfik frontends is configured to tail two input files, namely the access (/var/log/traefik/access.log) and the daemon (/var/log/traefik/service.log) logs of Træfik. Logs from the access (daemon) file are tagged as traefik.access.* (traefik.service.*), labelled as s3_access (s3_daemon). Before sending to the Monit infrastructure, the message is prepared to define the payload data and metadata (see monit.lua):

producer is s3 (used to build path on HDFS) -- must be whitelisted on the Monit infra;
type defines if the logs are access or daemon (used to build path on HDFS);
index_prefix defines the index for the logs (is used by Logstashon Monit Marathon and on Elasticsearch).

Logstash is the tool that reads the aggregated log stream from Kafka, does most of the transformation and writes to Elasticsearch.

This Logstash process runs in a Docker container on the Monit Marathon cluster (see Applications --> storage --> s3logs-to-es). For debugging purposes, stdout and stderr of the container are available on monit-spark-master.cern.ch:5050/ -- They do not work from Marathon.

The Dockerfile, configuration pipeline, etc., are stored in s3logs-to-es.

This Logstash instance:

removes the additional fields introduced by the Monit infrastructure (metadata unused by us)
parses the original message as json document
adds costing information
adds geographical information of the client IP (geoIP)
copies a subset of fields relevant for CSIR to a different index
...and pushes the results (full logs, and CSIR stripped version) to Elasticsearch

We finally have our dedicated Elasticsearch instance managed by the Elasticsearch Service.

There's not much to configure from our side, just a few useful links and the endpoint config repository:

Data is kept for:

10 days on fast SSD storage, local to the ES cluster
other 20 days (30 total) on ceph storage
13 months (stripped-down version, some fields are filtered out -- see below) for CSIR purposes

Indexes on ES must start with ceph_s3. This is the only whitelisted pattern, and hence the only one allowed. We currently use different indexes:

ceph_s3_access: Access logs for Gabe (s3.cern.ch)
ceph_s3_daemon: Traefik service logs for Gabe
ceph_s3_access-csir: Stripped down version of Gabe access logs for CSIR, retained for 13 months
ceph_s3_fr_access: Access logs of Nethub (s3-fr-prevessin-1.cern.ch)
ceph_s3_fr_daemon: Traefik service logs for Nethub
ceph_s3_fr_access-csir: Stripped down version of Nethub access logs for CSIR, retained for 13 months

ES is also a data source for Monit grafana dashboards:

Grafana uses basic auth to ES with user ceph_ro:<password> (The password is stored in Teigi: ceph/gabe/es-ceph_ro-password)
ES must have the internal user ceph_ro configured with permissions to read ceph* indexes

HDFS is solely used as a storage backed to store the logs for 13 months for CSIR purposes. As of July 2021, HDFS stores the full logs (to be verified if they do not eat too much space on HDFS). To check/read logs on HDFS, you must have access to the HDFS cluster (see prerequisites) and from lxplus

source /cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/setup.sh
source /cvmfs/sft.cern.ch/lcg/etc/hadoop-confext/hadoop-swan-setconf.sh analytix 3.2 spark3
kinit
hdfs dfs -ls /project/monitoring/archive/s3/logs

All the information regarding centos stream 8 can be found in this document.

Create new CS8 nodes with representative configurations and validate
Enable the upgrade (top-level hostgroup, sub-hostgroup, etc)
```
base::migrate::stream8: true
```
Follow the instructions
- Run Puppet twice.
- Run distro-sync.
- Reboot.

CephFS backups are currently added on demand. Request a backup opening a ticket to Ceph Service.

Stored in Nethub cluster (Prevessin, FR)
Snapshot based. Not point-in-timeconsistent (no snpshots, no fsfreeze or so)
By default, we keep last 7 daily snapshots, last 5 weekly snapshots and last 6 monthly snapshots.
Backup repositories are encrypted.

Use the following procedure Link
Enabled clusters: flax, levinson, pam, doyle

Please open a ticket to Ceph Service.

See cback.docs.cern.ch/.

Using and Operating Ceph

User's Guide

Which Storage Service is Right for Me?

Using Block Storage

Using CephFS

Using S3 or SWIFT

Configure aws cli

Setting up aws

Testing

Delete all object versions

Useful links

Operator's Guide

Create a Ceph Test Cluster

Create CEPH Cluster

Prepare the hostgroups

First Monitor Configuration

Add more Monitors and OSD's

Creating a CEPH cluster

Table of Contents

Prerequisites

Introduction - Hostgroups

Creating a configuration for your new cluster

Creating your first monitor node

With TBag authentication

Creating manager hosts

Creating osd hosts

Creating the first pool

Finalize cluster configuration

Security Flags on Pools

Monitoring

Details on lbalias for mons

Benchmarking

Rados bench

RBD bench

RBD clusters

Create Cinder key for use with OpenStack

Create an Images pool for use with OpenStack Glance

CephFS Clusters

Creating metadata servers

Create Manila key for use with OpenStack

S3 Clusters

Creating rgw hosts

Creating a DNS load-balanced alias

Integration with OpenStack Keystone

RBD Mirroring

Adding peers to rbd-mirror

Peerings pools

What to watch?

Taking notes

Keeping the Team Informed

Common Procedures

exception.scsi_blockdevice_driver_error_reported

Draining a Failing OSD

Creating a new OSD (on a replacement disk)

CephInconsistentPGs

Handle a failing disk

Where is osd.187?

Repairing a PG

Ceph PG Unfound

CephTargetDown

SSD Replacement

Draining OSDs attached to a failing SSD

Prepare for replacement

Recreate the OSD

MDS Slow Ops

Large omap objects

Ceph Clusters

Production Clusters

s3.cern.ch RGWs

Reviewing a Cluster Status

Clusters' priority

Hardware Specs

Test clusters

Preparing a new delivery

Flavor per rack

Setting root device hints

Ceph Monitoring

About Ceph Monitoring

Access the monitoring system

Add/remove a cluster to/from the monitoring system

Create an `Images` pool for use with OpenStack Glance

LVM formatting using `ceph-volume`

Ceph logging `[WRN] evicting unresponsive client`

Ceph logging `[WRN] clients failing to respond to cache pressure`

Ceph logging `[WRN] client session with invalid root denied`