Creating a CEPH cluster

Table of Contents

Follow the below instructions to create a new CEPH cluster in CERN

Prerequisites

  • Access to aiadm.cern.ch
  • Proper GIT configuration
  • Member of ceph administration e-groups
  • OpenStack environment configured, link

Introduction - Hostgroups

First, we have to create the hostgroups in which we want to build our cluster in.

The hostgroups provide a layer of abstraction for configuring automatically a
cluster using Puppet. The first group called ceph, ensures that each
machine in this hostgroup has ceph installed, configured and running. The second
group, called first sub-hostgroup, ensures that each machine will communicate
with machines in the same sub-hostgroup forming a cluster. These machines will
have specific configuration defined later in this guide. The second sub-hostgroup
ensures that each machine will act as its corresponding role in the cluster.

For example we first create our cluster's hostgroup with its name that is provided by your task.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}

As each cluster has its own features, the 2 basic sub-hostgroups for a ceph
cluster are the mon and osd.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mon
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/osd

These sub-hostgroups will contain the monitors and the osd hosts.

If the cluster has to use CephFS and/or Rados gateway we need to create the
appropriate sub-hostgroups.

[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mds      #for CephFS
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/radosgw  #for the rados gateway

Creating a configuration for your new cluster

Go to gitlab.cern.ch and search for it-puppet-hostgroup-ceph. This repository
contains the configuration for all the machines under the ceph hostgroup. Clone
the repository, create a new branch based on qa, and go to it-puppet-hostgroup-ceph/code/manifests.
From there, you will create the {hg_name}.pp file and the {hg_name} folder.

The {hg_name}.pp should contain the following code: (replace {hg_name} with the cluster's name)

class hg_ceph::{hg_name} {
  include hg_ceph::include::base
}

This will load the basic configuration for ceph on each machine. The {hg_name} folder should contain the *.pp files for the appropriate 2nd sub-hostgroups.

The files under your cluster's folder will have the following basic format:

File {role}.pp:

class hg_ceph::{hg_name}::{role} {
  include hg_ceph::classes::{role}
}

The include will use a configuration template located in it-puppet-hostgroup-ceph/code/manifests/classes

The roles are: mon, mgr, osd, mds and radosgw. It is good to run both mon and mgr together, so we usually create the following class e.g.:

class hg_ceph::{hg_name}::mon {
  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr
}

The following code will configure machines in "ceph/{hg_name}/mon" to act as
monitors and mgrs together. After you are done with creating the needed files
for your task. Your "code/manifests" path should look like this:

# Using kermit as {hg_name}

kermit.pp
kermit/mon.pp
kermit/osd.pp
# Optional, only if requested by the JIRA ticket
kermit/mds.pp
kermit/radosgw.pp

Create a YAML configuration file for the new hostgroup in it-puppet-hostgroup-ceph/data/hostgroup/ceph with name {hg_name}.yaml. This files contains all the basic configuration parameters that are common to all the nodes in the cluster.

ceph::conf::fsid: d3c77094-4d74-4acc-a2bb-1db1e42bb576

ceph::params::release: octopus

lbalias: ceph{hg_name}.cern.ch
hg_ceph::classes::mon::enable_lbalias: false

hg_ceph::classes::mon::enable_health_cron: true
hg_ceph::classes::mon::enable_sls_cron: true

Where:

  • ceph::conf::fsid can be generated by uuid tool;
  • lbalias is the alias the mons are part of.

Git add the following files, commit and push your branch. BEFORE you push, do a git pull --rebase origin qa to avoid any conflicts with your request. The command line will provide a link to submit a merge request.

@dvanders is currently the administrator of the repo, so you should assign him the task to check your request and eventually merge it.

Creating your first monitor node

Follow the instructions to create exactly one monitor here. DO NOT ADD more than one machines to the ceph/{hg_name}/mon hostgroup, otherwise your first monitor will always deadlock and you will need to remove the others and rebuild the first one again.

With TBag authentication

Once we are able to login to the node, we will need to create the keys to be
able to bootstrap new nodes to the cluster. We will first have to create the
inital key, so mons can be created in our new cluster.

[root@ceph{hg_name}-mon-...]$ ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'

Login to aiadm, copy the key from the monitor host and store it on tbag.

[user@aiadm]$ mkdir -p ~/private/tbag/{hg_name}
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.mon .
[user@aiadm]$ tbag set --hg ceph/{hg_name}/mon keyring.mon --file keyring.mon

Login to your mon host and run puppet puppet agent -t, repeat until you see a running ceph-mon process.

Run the following to disable some warning and enable some features for ceph:

[root@ceph{hg_name}-mon-...]$ ceph mon enable-msgr2
[root@ceph{hg_name}-mon-...]$ ceph osd set-require-min-compat-client luminous
[root@ceph{hg_name}-mon-...]$ ceph config set mon auth_allow_insecure_global_id_reclaim false

Note that enable-msgr2 will need to be run again after all mons have been created.

We will need to repeat this procedure for the mgr, osd, mds, rgw and rbd-mirror depending on what we need:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
# Optional, only if the cluster uses CephFS
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mds mon 'allow profile bootstrap-mds'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mds > /tmp/keyring.bootstrap-mds
# Optional, only if the cluster uses a Rados Gateway
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' -o /tmp/keyring.bootstrap-rgw
# Optional, only if the cluster uses a rbd-mirror
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-rbd-mirror -o /tmp/keyring.bootstrap-rbd-mirror

Login to aiadm, copy the keys from the monitor host and use them with tbag.

Make sure you don't have any excess keys in the /tmp folder (5 max, mon/mgr/osd/mds/rgw).
We don't need to provide the specific subgroup for each key, because that will cause confusion, "ceph/{hg_name}" is enough.

[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.* .
[user@aiadm]$ scp {mon_host}:/etc/ceph/keyring .
[user@aiadm]$ for file in *; do tbag set --hg ceph/{hg_name} $file --file $file; done
# Make sure to copy all the generated keys on `/mnt/projectspace/tbag` of `cephadm.cern.ch` as well:
[user@aiadm]$ scp -r . root@cephadm:/mnt/projectspace/tbag/{hg_name}

Now we create the other monitors using the same procedure as the first one using ai-bs. The other monitors will be configured automatically.

Creating manager hosts

The procedure is very similar to the one for the creation of mons:

  • Create new VMs;
  • Add them to the ceph/{hg_name}/mgr hostgroup;
  • Set the right roger state for the new VMs;

Instructions for the creation of mons still hold here, with the necessary changes for mgrs.

As stated above, in some cases it is necessary to colocate mons and mgs. If so, it is not needed to create new machines for mgrs but simply include the mgr class in the mon manifest:

class hg_ceph::{hg_name}::mon {

  include hg_ceph::classes::mon
  include hg_ceph::classes::mgr

}

Creating osd hosts

The OSD hosts will be usually given to you to be prepared by formatting the disks
and adding them to the cluster. The tool used to format the disks will be ceph-volume.
The provision will happen with lvm. Make sure your disks are empty, run pvs and
vgs to check if they have any lvm data.

We can safely ignore the system disks in case they are used with lvm. On every
host run ceph-volume lvm zap {disk} --destroy to zap the disks and remove any
lvm data. In case your hosts contain only one type of disk like HDD or SSD for
OSDS we can run the following command for the provision of our OSDS:

# It works like the ls command, if we need to create OSDS from /dev/sdc to /dev/sdz we can try this
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sd[c-z]

You will be prompted to check the OSD creation plan and if you agree with the
following changes you can input yes to create the OSDS. If you are trying to
automate this task you can pass the --yes parameter to the ceph-volume lvm batch
command. In the case you have SSDs to back the HDDs to create hybrid OSDs using
SDD block.DB and HDD block.data you will have to run the above command per SSD:

# 2 SSDs sda sdb 4HDDs sdc sdd sde sdf
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sda /dev/sdc /dev/sdd
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sdb /dev/sde /dev/sdf

The problem with the current lvm batch implementation is that it creates a single
volume group for the block.DB part. Therefore, when an SSD fails, the whole set
of OSDs in the host become corrupted. So in order to minimize the cost, we run batch per SSD.

Run ceph osd tree to check whether the OSDs are placed correctly in the tree.
If the OSDs are not set as described with grep ^crush /etc/ceph/ceph.conf you
will need to remove the line containing something like update crush on start
and restart the OSDs of that host. You can also create/move/delete buckets with (examples):

  • ceph osd crush add-bucket CK13 rack
  • ceph osd crush move CK13 room=0513-R-0050
  • ceph osd crush move 0513-R-0050 root=default
  • ceph osd crush move cephflash21a-ff5578c275 rack=CK13

Now you are one step away from having a functional cluster.
Next step is to create a pool so we can be able to use the storage of our cluster.

Creating the first pool

A pool in ceph is the root namespace of an object store system. A pool has its
own data redudancy schema and access permissions. In the case cephfs is used, two
pools are created, one for data and one for metadata, or in the case to support
openstack various pools are created for storing images and volumes and shares.
To create a pool we first have to understand what type of data redundancy we
should use: replicated or EC. If the task already defines what should happen,
then you can go to the ceph documentation:

BEFORE you create a pool you first need to create a CRUSH rule that matches
to your cluster's schema:

You can get the schema by running ceph osd tree | less.

As an example, the meredith cluster runs with 4+2 EC and the failure domain is rack. Create the required erasure-code-profile with:

[root@cephmeredithmon...]$ ceph osd erasure-code-profile ls
default

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8

[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2 k=4 m=2 crush-failure-domain=rack --force
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=rack
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

NEVER modify an existing profile. That would change the data placement on disk!
Here we use the --force flag only because the new jera_4plus2 is not used yet.

Now create a CRUSH rule with the defined profile:

[root@cephmeredithmon...]$ ceph osd crush rule create-erasure rack_ec jera_4plus2
created rule rack_ec at 1

[root@cephmeredithmon...]$ ceph osd crush rule ls
replicated_rule
rack_ec

[root@cephmeredithmon...]$ ceph osd crush rule dump rack_ec
{
    "rule_id": 1,
    "rule_name": "rack_ec",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 6,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "rack"
        },
        {
            "op": "emit"
        }
    ]
}

The last thing that is left is to calculate the number of PGs to keep the cluster running optimally. The Ceph developers reccomend 30 to 100 PGs per OSD, keep in mind that the data
redundancy schema counts as a multiplier. For example, if you have 100 OSDs you
will need at least 3K to 10K PGs. The number of the PGs must be a power of
two. So, we will use at least 1024(x3) to 2048(x3) PGs on the pool creation
command. Keep in mind that there may be a need for additional pools, such as
"test" which is created on every cluster for the profound reason of testing.

In general the formula is the following:

MaxPGs = \begin{cases}
NumOSDs*100/ReplicationSize &\text{if } replicated \\
NumOSDs*100/(k+m) &\text{if } erasure\ coded
\end{cases}

Then we use the closest power of two, which is less than the above number.
Example on meredith (368 OSDs, EC -- k=4, m=2): MaxPGs=6133 --> MaxPGs=4096

Now, let's create the pools following the upstream documentation Create a pool.

We should have at least one test pool and one data pool:

  • Create the test pool. It should always be replicated and not EC:

    [root@cephmeredithmon...]$ ceph osd pool create test 512 512 replicated replicated_rule
    pool 'test' created
    
    [root@cephmeredithmon...]$ ceph osd pool ls detail
    pool 6 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1710 flags hashpspool stripe_width 0 application test
    
  • Create the data pool (named 'rbd_ec_data' here) with EC:

    [root@cephmeredithmon...]$ ceph osd pool create rbd_ec_data 4096 4096 erasure jera_4plus2 rbd_ec_data
    pool 'rbd_ec_data' created
    [root@cephmeredithmon...]$ ceph osd pool ls detail | grep rbd_ec_data
    pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 1554 flags hashpspool stripe_width 16384
    

Finalize cluster configuration

Security Flags on Pools

  1. Make sure the security flags {nodelete, nopgchange, nosizechange} are set for all the pools
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1711 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
...

If not, set the flags with

[root@cluster_mon]$ ceph osd pool set <pool_name> {nodelete, nopgchange, nosizechange} 1
  1. pg_autoscale_mode should be set to off
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1985 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd

If the output shows anything for autoscale_mode, disable autoscaling with

[root@cluster_mon]$ ceph osd pool set <pool_name> pg_autoscale_mode off
  1. Set the application type for each pool in the cluster
[root@cluster_mon]$ ceph osd pool application enable my_test_pool test
[root@cluster_mon]$ ceph osd pool application enable my_rbd_pool rbd
  1. If relevant, enable the balancer
[root@cluster_mon]$ ceph balancer on
[root@cluster_mon]$ ceph balancer mode upmap
[root@cluster_mon]$ ceph config set mgr mgr/balancer/upmap_max_deviation 1

The parameter upmap_max_deviation is used to spread the PGs more evenly across the OSDs.
Check with

[root@cluster_mon]$ ceph balancer status
{
    "plans": [],
    "active": true,
    "last_optimize_started": "Tue Jan 12 16:47:48 2021",
    "last_optimize_duration": "0:00:00.296960",
    "optimize_result": "Optimization plan created successfully",
    "mode": "upmap"
}

[root@cluster_mon]$ ceph config dump
WHO   MASK LEVEL    OPTION                           VALUE RO 
  mgr      advanced mgr/balancer/active              true     
  mgr      advanced mgr/balancer/mode                upmap    
  mgr      advanced mgr/balancer/upmap_max_deviation 1        

Also, after quite some time spent balancing, the number of PGs per OSD should be even.
Focus on the PGS column of the output of ceph osd df tree

[root@cluster_mon]$ ceph osd df tree

ID  CLASS WEIGHT    REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE VAR  PGS STATUS TYPE NAME                                
 -1       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -        root default                             
 -5       642.74780        - 643 TiB 414 GiB  46 GiB 505 KiB  368 GiB 642 TiB 0.06 1.00   -            room 0513-R-0050                     
 -4        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                rack CK01                        
 -3        27.94556        -  28 TiB  18 GiB 2.0 GiB     0 B   16 GiB  28 TiB 0.06 1.00   -                    host cephflash21a-04f5dd1763 
  0   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  75     up                 osd.0                    
  1   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  69     up                 osd.1                    
  2   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  72     up                 osd.2                    
  3   ssd   1.74660  1.00000 1.7 TiB 1.1 GiB 127 MiB     0 B    1 GiB 1.7 TiB 0.06 1.00  70     up                 osd.3       

Monitoring

Cluster monitoring is offered by:

  • Health crons enabled at the hostgroup level (see the YAML file above):
    • enable_health_cron enables sending the email report that checks the current health status and greps in recent ceph.log
    • enable_sls_cron enables sending metrics to filer-carbon that populate the Ceph Health dashboard
  • Regular polling performed by cephadm.cern.ch
  • Prometheus
  • Watcher clients (CephFS) that mount and test FS availability

To enable polling from cephadm, proceed as follows:

  1. Add the new cluster to it-puppet-hostgroup-ceph/code/manifest/admin.pp. Consider Admin newclusters as reference merge request. (note, if you are adding a cephFS cluster, you do not need to add it to the ### BASIC CEPH CLIENTS Array.
  2. Create a client.admin key on the cluster
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.admin mon 'allow *' mgr 'allow *' osd 'allow *' mds 'allow *'
[client.admin]
        key = <the_super_secret_key>
  1. Add the key to tbag in the ceph/admin hostgroup (the secret must contain the full output of the command above)
tbag set --hg ceph/admin <cluster_name>.keyring --file <keyring_filename>
tbag set --hg ceph/admin <cluster_name>.admin.secret
Enter Secret: <paste secret here>
  1. Add the new cluster to it-puppet-module-ceph/data/ceph.yaml otherwise the clients (cephadm included) will lack the mon hostname. (Consider Add ryan cluster as a reference merge request.) Double check you are using the appropriate port.
  2. ssh to cephadm and run puppet a couple of times
  3. Make sure files at <cluster_name>.client.admin.keyring and at <cluster_name>.conf exist and show the appropriate content
  4. Check the health of the cluster with
[root@cephadm]# ceph --cluster=<cluster_name> health
HEALTH_OK
  1. Cephadm is also resposnbile for producting the availability numbers sent to the central IT Service Availability Overview. If the cluster needs to be reported in IT SAO, add it to ceph-availability-producer.py with a relevant description.

To enable monitoring from Prometheus, add the new cluster to prometheus.yaml. Also, the Prometheus module must be enabled on the MGR (Documentation: https://docs.ceph.com/en/octopus/mgr/prometheus/) for metrics to be retrieved:

ceph mgr module enable prometheus

To ensure a CephFS cluster is represented adequetely, there are some unique steps we must take:

  1. Update the it-puppet-module-cephfs README.md and code/data/common.yaml to include the new cluster (Consider add doyle cluster as a reference merge request.)
  2. Update it-puppet-hostgroup-ceph watchers definition in code/manfiests/test/cepfs/watchers.pp to ensure the new cluster is mounted by the watchers. (consider watchers.pp: add doyle definition an example merge request)
  3. SSH to one of the watcher nodes (e.g. cephfs-testc9-d81171f572.cern.ch) and run puppet a few times to synchronise the changes.
  4. Checking cat /proc/mounts | grep ceph for an appropriate systemd mount and navigating to one of the directories within / let you examine if the FS is availible.

Details on lbalias for mons

We prefer not to use load-balancing service and lbclient here (https://configdocs.web.cern.ch/dnslb/). There is no scenario in ceph where we want a mon to disappear from the alias.

We rather use the --load-N- appoarch to create the alias with all the mons:

  • Go to network.cern.ch
  • Click on Update information and use the FQDN of the mon machine
    • If prompted, make sure you host interface and not the IPMI one
  • Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
  • Multiple aliases are supported. Use a comma-separated list
  • Check the changes are correct and submit the request

Benchmarking

Note: What follows is not proper benchmarking but some quick hints the cluster works as expected.

Good reading at Benchmarking performance

Rados bench

Start a test on pool 'my_test_pool' with 30s duration and blockize 4096 B

[root@cluster_mon]$ rados bench -p my_test_pool 10 write -b 4096

hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephflash21a-a6564a2ee7.cern._1768589
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16      8752      8736   34.1231    34.125  0.00130825  0.00182201
    2      16     16913     16897   32.9995   31.8789  0.00104112  0.00189076
    3      15     24678     24663   32.1108   30.3359  0.00139087  0.00194522
    4      16     32189     32173   31.4167   29.3359   0.0209055   0.0019863
    5      16     39595     39579   30.9187   28.9297   0.0209981  0.00201906
    6      16     47263     47247   30.7573   29.9531  0.00138272  0.00203065
    7      16     55169     55153   30.7748   30.8828  0.00121337  0.00202973
    8      16     63070     63054   30.7855   30.8633  0.00133439  0.00202877
    9      15     70408     70393     30.55    28.668  0.00144124  0.00204461
   10      11     78679     78668   30.7271   32.3242  0.00162555  0.00203309
Total time run:         10.0178
Total writes made:      78679
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     30.6793
Stddev Bandwidth:       1.68734
Max bandwidth (MB/sec): 34.125
Min bandwidth (MB/sec): 28.668
Average IOPS:           7853
Stddev IOPS:            431.959
Max IOPS:               8736
Min IOPS:               7339
Average Latency(s):     0.00203504
Stddev Latency(s):      0.00370041
Max latency(s):         0.0702117
Min latency(s):         0.000887922
Cleaning up (deleting benchmark objects)
Removed 78679 objects
Clean up completed and total clean up time :4.93871

RBD bench

Create a RBD image and run some tests on it

[root@cluster_mon]$ rbd create rbd_ec_meta/enricotest --size 100G --data-pool rbd_ec_data
[root@cluster_mon]$ rbd bench --io-type write rbd_ec_meta/enricotest --io-size 4M --io-total 100G

Once done, delete the image with

[root@cluster_mon]$ rbd ls -p rbd_ec_meta
[root@cluster_mon]$ rbd rm rbd_ec_meta/enricotest

RBD clusters

Create Cinder key for use with OpenStack

All of the above steps bring to a fully functional Rados Block cluster. The only missing step is to create access keys for the OpenStack Cinder so that it can use the provided storage.

The upstream documentation on user management (and OpenStack is a user) is available at User Management

To create the relevant access key for OpenStack use the following command:

$ ceph auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd pool=volumes' mgr 'profile rbd pool=volumes'

which results in creating a user named "cinder" to run rbd commands on the pool named "volumes".

Create an Images pool for use with OpenStack Glance

To store Glance images on ceph, a dedicated pool (pg_num may vary) and cephx keys are needed:

$ ceph osd pool create images 128 128 replica replicated_rule
$ ceph auth get-or-create client.images mon 'profile rbd' mgr 'profile rbd pool=images' osd 'profile rbd pool=images'

CephFS Clusters

Enabling CephFS consists of creating data and metadata pools for CephFS and a new filesystem. It is also needed to create metadata servers (either dedicated or colocated with other daemons), else the cluster will show HEALTH_ERR and 1 filesystem offline. See below for the creation of metadata servers.

Follow the upstream documentation at Create a Ceph File System

Creating metadata servers

Create at least two hosts to ceph/{hg_name}/mds. MDS daemons can be dedicated (preferable for large, busy clusters) or colocated with other daemons (e.g., on the osd host, assuming enough memory is available).

As soon as one MDS goes active, the cluster health will go back to HEALTH_OK. It is recommended to have at least 2 nodes running MDSes for failover. One can also consider to have a stand-by replay MDS to lower the time needed for a failover.

Create Manila key for use with OpenStack

To provision CephFS File Shares via OpenStack Manila, a dedicated cephx must be provided to the OpenStack team. Create the key with:

$ ceph auth get-or-create client.manila mon 'allow r' mgr 'allow rw'

S3 Clusters

Creating rgw hosts

To provide object storage, it is needed to run Ceph Object Gateway daemons (radosgw).

RGWs can run on dedicated machines (by creating new hosts in hostgroup ceph/{hg_name}/rgw) or colocated with existing machines. In both cases, these classes need to be enabled:

Also, you may want to enable:

  • The S3 crons for specific quota and health checks (see include/s3{hourly,daily,weekly}.pp
  • Traefik log ingestion into the MONIT pipelines for ElasticSeach dashoboards (see s3-logging).

Always start with one RGW only and iterate over the configuration until it runs.

Some of the required data pools (default.rgw.control, default.rgw.meta, default.rgw.log, .rgw.root) are automatically created by the RGW at its first run. The creation of some other pools is triggered by specific actions, e.g., making a bucket will create pool default.rgw.buckets.index, pushing the first object will trigger creation of default.rgw.buckets.data.

It is highly recommended to pre-create all pools so that they have the right cursh rule, pg_num, etc. before data is written to them. If they get auto-created, they will use the default crush type (replicated), while we typically use erasure coding for object storage. Use an existing clusters as reference to configure pools.

Creating a DNS load-balanced alias

The round-robin based DNS load balancing service is describe at DNS Load Balancing.

To create a new load-balanced alias for S3:

  1. Go to https://aiermis.cern.ch/
  2. Add LB Alias by specifying if it needs to be external and the number of hosts to return (Best Hosts)
  3. Configure hg_ceph::classes::lb::lbalias and the relevant RadosGW configuration params accordingly (rgw dns name, rgw dns s3website name, rgw swift url. ...)
  4. To support virtual host style bucket address (i.e., mybucket.s3.cern.ch) talk to the Network Team to have wildcard DNS enabled on the alias

Integration with OpenStack Keystone


RBD Mirroring

Make sure you have included hg_ceph::classes:rbd_mirror and set up the bootstrap-rbd-mirror keyring.

Adding peers to rbd-mirror

You first have to add a rbd-mirror-peer keyring in the hostgroup ceph.

First get to your mon and run the following command:

[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.rbd-mirror-peer mon 'profile rbd-mirror-peer' osd 'profile rbd' -o {hg_name}.client.rbd-mirror-peer.keyring

Copy the keyring to aiadm and create the secret:

[user@aiadm]$ tbag set --hg ceph {hg_name}.client.rbd-mirror-peer.keyring --file {hg_name}.client.rbd-mirror-peer.keyring

Now your cluster can participate with the others already registered to mirror your RBD images! You can now add the following data to registers peers for your rbd-mirror daemons:

ceph::rbd_mirror:
  - peer1
  - peer2
  - ...

Peerings pools

You first have to enable the mirroring of some of your pools: https://docs.ceph.com/en/octopus/rbd/rbd-mirroring/#enable-mirroring. Also check the configuration of those modes in the same page (journaling feature enabled on the RBD images, image snapshot settings, ...).

And then you can add peers like this:

[root@ceph{hg_name}-rbd-mirror-...]$ rbd mirror pool peer add {pool} client.rbd-mirror-peer@{remote_peer}
Improve me !