Creating a CEPH cluster
Table of Contents
- Introduction
- Puppet configuration
- Creating monitor hosts
- Creating manager hosts
- Creating osd hosts
- Creating the first pool
- Finalize cluster configuration
- RBD Clusters
- CephFS Clusters
- S3 Clusters
Follow the below instructions to create a new CEPH cluster in CERN
Prerequisites
- Access to aiadm.cern.ch
- Proper GIT configuration
- Member of ceph administration e-groups
- OpenStack environment configured, link
Introduction - Hostgroups
First, we have to create the hostgroups in which we want to build our cluster in.
The hostgroups provide a layer of abstraction for configuring automatically a
cluster using Puppet. The first group called ceph, ensures that each
machine in this hostgroup has ceph installed, configured and running. The second
group, called first sub-hostgroup, ensures that each machine will communicate
with machines in the same sub-hostgroup forming a cluster. These machines will
have specific configuration defined later in this guide. The second sub-hostgroup
ensures that each machine will act as its corresponding role in the cluster.
For example we first create our cluster's hostgroup with its name that is provided by your task.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}
As each cluster has its own features, the 2 basic sub-hostgroups for a ceph
cluster are the mon and osd.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mon
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/osd
These sub-hostgroups will contain the monitors and the osd hosts.
If the cluster has to use CephFS and/or Rados gateway we need to create the
appropriate sub-hostgroups.
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/mds #for CephFS
[user@aiadm]$ ai-foreman addhostgroup ceph/{hg_name}/radosgw #for the rados gateway
Creating a configuration for your new cluster
Go to gitlab.cern.ch and search for it-puppet-hostgroup-ceph
. This repository
contains the configuration for all the machines under the ceph hostgroup. Clone
the repository, create a new branch based on qa, and go to it-puppet-hostgroup-ceph/code/manifests
.
From there, you will create the {hg_name}.pp
file and the {hg_name}
folder.
The {hg_name}.pp
should contain the following code: (replace {hg_name}
with the cluster's name)
class hg_ceph::{hg_name} {
include hg_ceph::include::base
}
This will load the basic configuration for ceph on each machine. The
{hg_name}
folder should contain the *.pp files for the appropriate 2nd
sub-hostgroups.
The files under your cluster's folder will have the following basic format:
File {role}.pp
:
class hg_ceph::{hg_name}::{role} {
include hg_ceph::classes::{role}
}
The include will use a configuration template located in
it-puppet-hostgroup-ceph/code/manifests/classes
The roles are: mon, mgr, osd, mds and radosgw. It is good to run both mon and mgr together, so we usually create the following class e.g.:
class hg_ceph::{hg_name}::mon {
include hg_ceph::classes::mon
include hg_ceph::classes::mgr
}
The following code will configure machines in "ceph/{hg_name}/mon" to act as
monitors and mgrs together. After you are done with creating the needed files
for your task. Your "code/manifests" path should look like this:
# Using kermit as {hg_name}
kermit.pp
kermit/mon.pp
kermit/osd.pp
# Optional, only if requested by the JIRA ticket
kermit/mds.pp
kermit/radosgw.pp
Create a YAML configuration file for the new hostgroup in
it-puppet-hostgroup-ceph/data/hostgroup/ceph
with name {hg_name}.yaml. This
files contains all the basic configuration parameters that are common to all
the nodes in the cluster.
ceph::conf::fsid: d3c77094-4d74-4acc-a2bb-1db1e42bb576
ceph::params::release: octopus
lbalias: ceph{hg_name}.cern.ch
hg_ceph::classes::mon::enable_lbalias: false
hg_ceph::classes::mon::enable_health_cron: true
hg_ceph::classes::mon::enable_sls_cron: true
Where:
ceph::conf::fsid
can be generated byuuid
tool;lbalias
is the alias the mons are part of.
Git add the following files, commit and push your branch. BEFORE you push,
do a git pull --rebase origin qa
to avoid any conflicts with your request.
The command line will provide a link to submit a merge request.
@dvanders is currently the administrator of the repo, so you should assign him the task to check your request and eventually merge it.
Creating your first monitor node
Follow the instructions to create exactly one monitor here.
DO NOT ADD more than one machines to the ceph/{hg_name}/mon
hostgroup,
otherwise your first monitor will always deadlock and you will need to remove
the others and rebuild the first one again.
With TBag authentication
Once we are able to login to the node, we will need to create the keys to be
able to bootstrap new nodes to the cluster. We will first have to create the
inital key, so mons can be created in our new cluster.
[root@ceph{hg_name}-mon-...]$ ceph-authtool --create-keyring /tmp/keyring.mon --gen-key -n mon. --cap mon 'allow *'
Login to aiadm, copy the key from the monitor host and store it on tbag.
[user@aiadm]$ mkdir -p ~/private/tbag/{hg_name}
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.mon .
[user@aiadm]$ tbag set --hg ceph/{hg_name}/mon keyring.mon --file keyring.mon
Login to your mon host and run puppet puppet agent -t
, repeat until you see a running ceph-mon process.
Run the following to disable some warning and enable some features for ceph:
[root@ceph{hg_name}-mon-...]$ ceph mon enable-msgr2
[root@ceph{hg_name}-mon-...]$ ceph osd set-require-min-compat-client luminous
[root@ceph{hg_name}-mon-...]$ ceph config set mon auth_allow_insecure_global_id_reclaim false
Note that enable-msgr2
will need to be run again after all mons have been created.
We will need to repeat this procedure for the mgr, osd, mds, rgw and rbd-mirror depending on what we need:
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mgr mon 'allow profile bootstrap-mgr'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mgr > /tmp/keyring.bootstrap-mgr
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-osd mon 'allow profile bootstrap-osd'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-osd > /tmp/keyring.bootstrap-osd
# Optional, only if the cluster uses CephFS
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create-key client.bootstrap-mds mon 'allow profile bootstrap-mds'
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-mds > /tmp/keyring.bootstrap-mds
# Optional, only if the cluster uses a Rados Gateway
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' -o /tmp/keyring.bootstrap-rgw
# Optional, only if the cluster uses a rbd-mirror
[root@ceph{hg_name}-mon-...]$ ceph auth get client.bootstrap-rbd-mirror -o /tmp/keyring.bootstrap-rbd-mirror
Login to aiadm, copy the keys from the monitor host and use them with tbag.
Make sure you don't have any excess keys in the /tmp
folder (5 max, mon/mgr/osd/mds/rgw).
We don't need to provide the specific subgroup for each key, because that will cause confusion,
"ceph/{hg_name}" is enough.
[user@aiadm]$ cd ~/private/tbag/{hg_name}
[user@aiadm]$ scp {mon_host}:/tmp/keyring.* .
[user@aiadm]$ scp {mon_host}:/etc/ceph/keyring .
[user@aiadm]$ for file in *; do tbag set --hg ceph/{hg_name} $file --file $file; done
# Make sure to copy all the generated keys on `/mnt/projectspace/tbag` of `cephadm.cern.ch` as well:
[user@aiadm]$ scp -r . root@cephadm:/mnt/projectspace/tbag/{hg_name}
Now we create the other monitors using the same procedure as the first one using ai-bs
.
The other monitors will be configured automatically.
Creating manager hosts
The procedure is very similar to the one for the creation of mons:
- Create new VMs;
- Add them to the ceph/{hg_name}/mgr hostgroup;
- Set the right roger state for the new VMs;
Instructions for the creation of mons still hold here, with the necessary changes for mgrs.
As stated above, in some cases it is necessary to colocate mons and mgs. If so, it is not needed to create new machines for mgrs but simply include the mgr class in the mon manifest:
class hg_ceph::{hg_name}::mon {
include hg_ceph::classes::mon
include hg_ceph::classes::mgr
}
Creating osd hosts
The OSD hosts will be usually given to you to be prepared by formatting the disks
and adding them to the cluster. The tool used to format the disks will be ceph-volume
.
The provision will happen with lvm. Make sure your disks are empty, run pvs
and
vgs
to check if they have any lvm data.
We can safely ignore the system disks in case they are used with lvm. On every
host run ceph-volume lvm zap {disk} --destroy
to zap the disks and remove any
lvm data. In case your hosts contain only one type of disk like HDD or SSD for
OSDS we can run the following command for the provision of our OSDS:
# It works like the ls command, if we need to create OSDS from /dev/sdc to /dev/sdz we can try this
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sd[c-z]
You will be prompted to check the OSD creation plan and if you agree with the
following changes you can input yes to create the OSDS. If you are trying to
automate this task you can pass the --yes
parameter to the ceph-volume lvm batch
command. In the case you have SSDs to back the HDDs to create hybrid OSDs using
SDD block.DB and HDD block.data you will have to run the above command per SSD:
# 2 SSDs sda sdb 4HDDs sdc sdd sde sdf
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sda /dev/sdc /dev/sdd
[root@cephdataYY-...]$ ceph-volume lvm batch /dev/sdb /dev/sde /dev/sdf
The problem with the current lvm batch implementation is that it creates a single
volume group for the block.DB part. Therefore, when an SSD fails, the whole set
of OSDs in the host become corrupted. So in order to minimize the cost, we run batch per SSD.
Run ceph osd tree
to check whether the OSDs are placed correctly in the tree.
If the OSDs are not set as described with grep ^crush /etc/ceph/ceph.conf
you
will need to remove the line containing something like update crush on start
and restart the OSDs of that host.
You can also create/move/delete buckets with (examples):
ceph osd crush add-bucket CK13 rack
ceph osd crush move CK13 room=0513-R-0050
ceph osd crush move 0513-R-0050 root=default
ceph osd crush move cephflash21a-ff5578c275 rack=CK13
Now you are one step away from having a functional cluster.
Next step is to create a pool so we can be able to use the storage of our cluster.
Creating the first pool
A pool in ceph is the root namespace of an object store system. A pool has its
own data redudancy schema and access permissions. In the case cephfs is used, two
pools are created, one for data and one for metadata, or in the case to support
openstack various pools are created for storing images and volumes and shares.
To create a pool we first have to understand what type of data redundancy we
should use: replicated or EC. If the task already defines what should happen,
then you can go to the ceph documentation:
BEFORE you create a pool you first need to create a CRUSH rule that matches
to your cluster's schema:
You can get the schema by running ceph osd tree | less
.
As an example, the meredith cluster runs with 4+2 EC and the failure domain is rack. Create the required erasure-code-profile with:
[root@cephmeredithmon...]$ ceph osd erasure-code-profile ls
default
[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8
[root@cephmeredithmon...]$ ceph osd erasure-code-profile set jera_4plus2 k=4 m=2 crush-failure-domain=rack --force
[root@cephmeredithmon...]$ ceph osd erasure-code-profile get jera_4plus2
crush-device-class=
crush-failure-domain=rack
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
NEVER modify an existing profile. That would change the data placement on disk!
Here we use the --force
flag only because the new jera_4plus2
is not used yet.
Now create a CRUSH rule with the defined profile:
[root@cephmeredithmon...]$ ceph osd crush rule create-erasure rack_ec jera_4plus2
created rule rack_ec at 1
[root@cephmeredithmon...]$ ceph osd crush rule ls
replicated_rule
rack_ec
[root@cephmeredithmon...]$ ceph osd crush rule dump rack_ec
{
"rule_id": 1,
"rule_name": "rack_ec",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 6,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "rack"
},
{
"op": "emit"
}
]
}
The last thing that is left is to calculate the number of PGs to keep the cluster running optimally.
The Ceph developers reccomend 30 to 100 PGs per OSD, keep in mind that the data
redundancy schema counts as a multiplier. For example, if you have 100 OSDs you
will need at least 3K to 10K PGs. The number of the PGs must be a power of
two. So, we will use at least 1024(x3) to 2048(x3) PGs on the pool creation
command. Keep in mind that there may be a need for additional pools, such as
"test" which is created on every cluster for the profound reason of testing.
In general the formula is the following:
MaxPGs = \begin{cases}
NumOSDs*100/ReplicationSize &\text{if } replicated \\
NumOSDs*100/(k+m) &\text{if } erasure\ coded
\end{cases}
Then we use the closest power of two, which is less than the above number.
Example on meredith (368 OSDs, EC -- k=4, m=2): MaxPGs=6133 --> MaxPGs=4096
Now, let's create the pools following the upstream documentation Create a pool.
We should have at least one test pool and one data pool:
-
Create the test pool. It should always be
replicated
and not EC:[root@cephmeredithmon...]$ ceph osd pool create test 512 512 replicated replicated_rule pool 'test' created [root@cephmeredithmon...]$ ceph osd pool ls detail pool 6 'test' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 1710 flags hashpspool stripe_width 0 application test
-
Create the data pool (named 'rbd_ec_data' here) with EC:
[root@cephmeredithmon...]$ ceph osd pool create rbd_ec_data 4096 4096 erasure jera_4plus2 rbd_ec_data pool 'rbd_ec_data' created [root@cephmeredithmon...]$ ceph osd pool ls detail | grep rbd_ec_data pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 1554 flags hashpspool stripe_width 16384
Finalize cluster configuration
Security Flags on Pools
- Make sure the security flags {nodelete, nopgchange, nosizechange} are set for all the pools
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1711 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
...
If not, set the flags with
[root@cluster_mon]$ ceph osd pool set <pool_name> {nodelete, nopgchange, nosizechange} 1
pg_autoscale_mode
should be set tooff
[root@cluster_mon]$ ceph osd pool ls detail
pool 4 'rbd_ec_data' erasure size 6 min_size 5 crush_rule 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 1985 lfor 0/0/1559 flags hashpspool,ec_overwrites,nodelete,nopgchange,nosizechange,selfmanaged_snaps stripe_width 16384 application rbd
If the output shows anything for autoscale_mode
, disable autoscaling with
[root@cluster_mon]$ ceph osd pool set <pool_name> pg_autoscale_mode off
- Set the application type for each pool in the cluster
[root@cluster_mon]$ ceph osd pool application enable my_test_pool test
[root@cluster_mon]$ ceph osd pool application enable my_rbd_pool rbd
- If relevant, enable the balancer
[root@cluster_mon]$ ceph balancer on
[root@cluster_mon]$ ceph balancer mode upmap
[root@cluster_mon]$ ceph config set mgr mgr/balancer/upmap_max_deviation 1
The parameter upmap_max_deviation
is used to spread the PGs more evenly across the OSDs.
Check with
[root@cluster_mon]$ ceph balancer status
{
"plans": [],
"active": true,
"last_optimize_started": "Tue Jan 12 16:47:48 2021",
"last_optimize_duration": "0:00:00.296960",
"optimize_result": "Optimization plan created successfully",
"mode": "upmap"
}
[root@cluster_mon]$ ceph config dump
WHO MASK LEVEL OPTION VALUE RO
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 1
Also, after quite some time spent balancing, the number of PGs per OSD should be even.
Focus on the PGS
column of the output of ceph osd df tree
[root@cluster_mon]$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 642.74780 - 643 TiB 414 GiB 46 GiB 505 KiB 368 GiB 642 TiB 0.06 1.00 - root default
-5 642.74780 - 643 TiB 414 GiB 46 GiB 505 KiB 368 GiB 642 TiB 0.06 1.00 - room 0513-R-0050
-4 27.94556 - 28 TiB 18 GiB 2.0 GiB 0 B 16 GiB 28 TiB 0.06 1.00 - rack CK01
-3 27.94556 - 28 TiB 18 GiB 2.0 GiB 0 B 16 GiB 28 TiB 0.06 1.00 - host cephflash21a-04f5dd1763
0 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 75 up osd.0
1 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 69 up osd.1
2 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 72 up osd.2
3 ssd 1.74660 1.00000 1.7 TiB 1.1 GiB 127 MiB 0 B 1 GiB 1.7 TiB 0.06 1.00 70 up osd.3
Monitoring
Cluster monitoring is offered by:
- Health crons enabled at the hostgroup level (see the YAML file above):
enable_health_cron
enables sending the email report that checks the current health status and greps in recentceph.log
enable_sls_cron
enables sending metrics to filer-carbon that populate the Ceph Health dashboard
- Regular polling performed by
cephadm.cern.ch
- Prometheus
- Watcher clients (CephFS) that mount and test FS availability
To enable polling from cephadm, proceed as follows:
- Add the new cluster to
it-puppet-hostgroup-ceph/code/manifest/admin.pp
. Consider Admin newclusters as reference merge request. (note, if you are adding a cephFS cluster, you do not need to add it to the### BASIC CEPH CLIENTS
Array. - Create a
client.admin
key on the cluster
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.admin mon 'allow *' mgr 'allow *' osd 'allow *' mds 'allow *'
[client.admin]
key = <the_super_secret_key>
- Add the key to tbag in the
ceph/admin
hostgroup (the secret must contain the full output of the command above)
tbag set --hg ceph/admin <cluster_name>.keyring --file <keyring_filename>
tbag set --hg ceph/admin <cluster_name>.admin.secret
Enter Secret: <paste secret here>
- Add the new cluster to
it-puppet-module-ceph/data/ceph.yaml
otherwise the clients (cephadm
included) will lack the mon hostname. (Consider Add ryan cluster as a reference merge request.) Double check you are using the appropriate port. - ssh to
cephadm
and run puppet a couple of times - Make sure files at
<cluster_name>.client.admin.keyring
and at<cluster_name>.conf
exist and show the appropriate content - Check the health of the cluster with
[root@cephadm]# ceph --cluster=<cluster_name> health
HEALTH_OK
- Cephadm is also resposnbile for producting the availability numbers sent to the central IT Service Availability Overview. If the cluster needs to be reported in IT SAO, add it to ceph-availability-producer.py with a relevant description.
To enable monitoring from Prometheus, add the new cluster to prometheus.yaml. Also, the Prometheus module must be enabled on the MGR (Documentation: https://docs.ceph.com/en/octopus/mgr/prometheus/) for metrics to be retrieved:
ceph mgr module enable prometheus
To ensure a CephFS cluster is represented adequetely, there are some unique steps we must take:
- Update the it-puppet-module-cephfs
README.md
andcode/data/common.yaml
to include the new cluster (Consider add doyle cluster as a reference merge request.) - Update
it-puppet-hostgroup-ceph
watchers definition incode/manfiests/test/cepfs/watchers.pp
to ensure the new cluster is mounted by the watchers. (consider watchers.pp: add doyle definition an example merge request) - SSH to one of the watcher nodes (e.g.
cephfs-testc9-d81171f572.cern.ch
) and run puppet a few times to synchronise the changes. - Checking
cat /proc/mounts | grep ceph
for an appropriate systemd mount and navigating to one of the directories within/
let you examine if the FS is availible.
Details on lbalias for mons
We prefer not to use load-balancing service and lbclient
here (https://configdocs.web.cern.ch/dnslb/).
There is no scenario in ceph where we want a mon to disappear from the alias.
We rather use the --load-N-
appoarch to create the alias with all the mons:
- Go to
network.cern.ch
- Click on
Update information
and use the FQDN of the mon machine- If prompted, make sure you host interface and not the IPMI one
- Add "ceph{hg_name}--LOAD-N-" to the list IP Aliases under TCP/IP Interface Information
- Multiple aliases are supported. Use a comma-separated list
- Check the changes are correct and submit the request
Benchmarking
Note: What follows is not proper benchmarking but some quick hints the cluster works as expected.
Good reading at Benchmarking performance
Rados bench
Start a test on pool 'my_test_pool' with 30s duration and blockize 4096 B
[root@cluster_mon]$ rados bench -p my_test_pool 10 write -b 4096
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cephflash21a-a6564a2ee7.cern._1768589
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 8752 8736 34.1231 34.125 0.00130825 0.00182201
2 16 16913 16897 32.9995 31.8789 0.00104112 0.00189076
3 15 24678 24663 32.1108 30.3359 0.00139087 0.00194522
4 16 32189 32173 31.4167 29.3359 0.0209055 0.0019863
5 16 39595 39579 30.9187 28.9297 0.0209981 0.00201906
6 16 47263 47247 30.7573 29.9531 0.00138272 0.00203065
7 16 55169 55153 30.7748 30.8828 0.00121337 0.00202973
8 16 63070 63054 30.7855 30.8633 0.00133439 0.00202877
9 15 70408 70393 30.55 28.668 0.00144124 0.00204461
10 11 78679 78668 30.7271 32.3242 0.00162555 0.00203309
Total time run: 10.0178
Total writes made: 78679
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 30.6793
Stddev Bandwidth: 1.68734
Max bandwidth (MB/sec): 34.125
Min bandwidth (MB/sec): 28.668
Average IOPS: 7853
Stddev IOPS: 431.959
Max IOPS: 8736
Min IOPS: 7339
Average Latency(s): 0.00203504
Stddev Latency(s): 0.00370041
Max latency(s): 0.0702117
Min latency(s): 0.000887922
Cleaning up (deleting benchmark objects)
Removed 78679 objects
Clean up completed and total clean up time :4.93871
RBD bench
Create a RBD image and run some tests on it
[root@cluster_mon]$ rbd create rbd_ec_meta/enricotest --size 100G --data-pool rbd_ec_data
[root@cluster_mon]$ rbd bench --io-type write rbd_ec_meta/enricotest --io-size 4M --io-total 100G
Once done, delete the image with
[root@cluster_mon]$ rbd ls -p rbd_ec_meta
[root@cluster_mon]$ rbd rm rbd_ec_meta/enricotest
RBD clusters
Create Cinder key for use with OpenStack
All of the above steps bring to a fully functional Rados Block cluster. The only missing step is to create access keys for the OpenStack Cinder so that it can use the provided storage.
The upstream documentation on user management (and OpenStack is a user) is available at User Management
To create the relevant access key for OpenStack use the following command:
$ ceph auth get-or-create client.cinder mon 'profile rbd' osd 'profile rbd pool=volumes' mgr 'profile rbd pool=volumes'
which results in creating a user named "cinder" to run rbd commands on the pool named "volumes".
Create an Images
pool for use with OpenStack Glance
To store Glance images on ceph, a dedicated pool (pg_num may vary) and cephx keys are needed:
$ ceph osd pool create images 128 128 replica replicated_rule
$ ceph auth get-or-create client.images mon 'profile rbd' mgr 'profile rbd pool=images' osd 'profile rbd pool=images'
CephFS Clusters
Enabling CephFS consists of creating data and metadata pools for CephFS and a new filesystem.
It is also needed to create metadata servers (either dedicated or colocated with other daemons), else the cluster will show HEALTH_ERR
and 1 filesystem offline. See below for the creation of metadata servers.
Follow the upstream documentation at Create a Ceph File System
Creating metadata servers
Create at least two hosts to ceph/{hg_name}/mds
.
MDS daemons can be dedicated (preferable for large, busy clusters) or colocated with other daemons (e.g., on the osd host, assuming enough memory is available).
As soon as one MDS goes active, the cluster health will go back to HEALTH_OK
.
It is recommended to have at least 2 nodes running MDSes for failover.
One can also consider to have a stand-by replay MDS to lower the time needed for a failover.
Create Manila key for use with OpenStack
To provision CephFS File Shares via OpenStack Manila, a dedicated cephx must be provided to the OpenStack team. Create the key with:
$ ceph auth get-or-create client.manila mon 'allow r' mgr 'allow rw'
S3 Clusters
Creating rgw hosts
To provide object storage, it is needed to run Ceph Object Gateway daemons (radosgw
).
RGWs can run on dedicated machines (by creating new hosts in hostgroup ceph/{hg_name}/rgw
) or colocated with existing machines.
In both cases, these classes need to be enabled:
- The
radosgw
class radosgw.pp - The
lb
class lb.pp - The
traefik
class traefik.pp
Also, you may want to enable:
- The S3 crons for specific quota and health checks (see
include/s3{hourly,daily,weekly}.pp
- Traefik log ingestion into the MONIT pipelines for ElasticSeach dashoboards (see s3-logging).
Always start with one RGW only and iterate over the configuration until it runs.
Some of the required data pools (default.rgw.control
, default.rgw.meta
, default.rgw.log
, .rgw.root
)
are automatically created by the RGW at its first run. The creation of some other pools
is triggered by specific actions, e.g., making a bucket will create pool default.rgw.buckets.index
,
pushing the first object will trigger creation of default.rgw.buckets.data
.
It is highly recommended to pre-create all pools so that they have the right cursh rule,
pg_num, etc. before data is written to them. If they get auto-created, they will use
the default crush type (replicated
), while we typically use erasure coding for object storage.
Use an existing clusters as reference to configure pools.
Creating a DNS load-balanced alias
The round-robin based DNS load balancing service is describe at DNS Load Balancing.
To create a new load-balanced alias for S3:
- Go to https://aiermis.cern.ch/
Add LB Alias
by specifying if it needs to be external and the number of hosts to return (Best Hosts)- Configure
hg_ceph::classes::lb::lbalias
and the relevant RadosGW configuration params accordingly (rgw dns name
,rgw dns s3website name
,rgw swift url
. ...) - To support virtual host style bucket address (i.e.,
mybucket.s3.cern.ch
) talk to the Network Team to have wildcard DNS enabled on the alias
Integration with OpenStack Keystone
RBD Mirroring
Make sure you have included hg_ceph::classes:rbd_mirror
and set up the
bootstrap-rbd-mirror keyring.
Adding peers to rbd-mirror
You first have to add a rbd-mirror-peer keyring in the hostgroup ceph.
First get to your mon and run the following command:
[root@ceph{hg_name}-mon-...]$ ceph auth get-or-create client.rbd-mirror-peer mon 'profile rbd-mirror-peer' osd 'profile rbd' -o {hg_name}.client.rbd-mirror-peer.keyring
Copy the keyring to aiadm and create the secret:
[user@aiadm]$ tbag set --hg ceph {hg_name}.client.rbd-mirror-peer.keyring --file {hg_name}.client.rbd-mirror-peer.keyring
Now your cluster can participate with the others already registered to mirror your RBD images! You can now add the following data to registers peers for your rbd-mirror daemons:
ceph::rbd_mirror:
- peer1
- peer2
- ...
Peerings pools
You first have to enable the mirroring of some of your pools: https://docs.ceph.com/en/octopus/rbd/rbd-mirroring/#enable-mirroring. Also check the configuration of those modes in the same page (journaling feature enabled on the RBD images, image snapshot settings, ...).
And then you can add peers like this:
[root@ceph{hg_name}-rbd-mirror-...]$ rbd mirror pool peer add {pool} client.rbd-mirror-peer@{remote_peer}