S3 notes - Using and Operating Ceph

Note: If you are looking for the old notes related to the infrasctructure based on consul and nomad, please refer to old documentation.

The CERN S3 service (s3.cern.ch) is provided by the gabe cluster and an arbitrary number of radosgw running on VMs. Each node in the ceph/gabe/radosgw hostgroup also runs a reverse-proxy daemon (Træfik), to spread the load on the VMs running a radosgw and to route traffic to different dedicated RGWs (cvmfs, gitlab, ...).

A second S3 cluster (s3-fr-prevessin-1.cern.ch) is also available in Prevessin Nethub Hub (nethub).

Both clusters (as of July 2021) use similar technologies: Ceph, RGWs, Træfik, Logstash, ....

RadosGW: Daemon handling S3 requests and interacting with the Ceph cluster
Træfik: Handles HTTP(S) requests from the Internet and spreads the load on radosgw daemons.
Logstash: Sidecar process that ships the access logs produced by Træfik to the MONIT infrastructure.

Upstream RadosGW documentation: (https://docs.ceph.com/en/nautilus/radosgw/)
Upstream documentation on radosgw-admin tool: (https://docs.ceph.com/en/nautilus/man/8/radosgw-admin/)
Træfik documentation: (https://docs.Træfik.io/)
S3 Script guide: (https://gitlab.cern.ch/ceph/ceph-guide/-/blob/master/src/ops/s3-scripts.md)

Træfik: http://s3.cern.ch/traefik/ (requires basic auth)
ElasticSearch for access logs: https://es-ceph.cern.ch/ (from CERN network only)
Various S3 dashboards (and underlying Ceph clusters) on Filer Carbon
Buckets rates (and others) on Monit Grafana

Maintenance Tasks

Each machine running Træfik/RGW is:

Part of the s3.cern.ch alias (managed by lbclient), with Træfik accepting connections on port 80 and 443 for HTTP and HTTPS, respectively
A backend RadosGW for all the Træfiks of the cluster, with the Ceph RadosGW daemon accepting connections on port 8080

To remove a machine from s3.cern.ch, touch /etc/nologin or change the roger status to intervention/disabled (roger update --appstate=intervention <hostname>). This will make lbclient return a negative value and the machine will be removed from the alias.
To remove temporarily a RadosGW from the list of backends (e.g., for a cluster upgrade), touch /etc/nologin and the RadosGW process will return 503 for requests to /swift/healthcheck. This path is used by Træfik healthcheck and, if the return code is different from 200, Træfik will stop sending requests to that backend. Wait few minutes to let in-flight requests complete, then restart the RadosGW process without clients noticing. See Pull Request to implement healthcheck disabling path.
To remove permanently a RadosGW from the list of backends (e.g., decommissioning), change the Træfik dynamic configuration via puppet in Træfik.yaml by removing the machine from the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Spawn a new VM with the script cephgabe-rgwtraefik-create.sh from aiadm
Wait for the VM to be online and run puppet several times so that the configuration is up to date
Make sure you have received the email confirming the VM has been added to the firewall set (and so it is reachable from the big Internet)
Make sure the new VM serves requests as expects (Test IPv4 and IPv6, HTTP and HTTPS):

curl -vs --resolve s3.cern.ch:{80, 443}:<ip_address_of_new_VM> http(s)://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -

Add the VM to the Prometheus s3_lb job (see prometheus puppet config) to monitor its availability and collect statistics on failed (HTTP 50*) requests
Change the roger status to production and enable all alarms. The machine will now be part of the s3.cern.ch alias
Update the Træfik dynamic configuration via puppet in Træfik.yaml by adding the new backend to the servers list. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).

Edit the list of backend nodes in the Træfik dynamic configuration via puppet in Træfik.yaml by adding/removing/shuffling around the server. The new configuration must be applied by all the other nodes. To run puppet everywhere, use mco from aiadm (see above).
If adding/removing, make sure the list of monitored endpoints by Prometheus is up to date. See prometheus puppet config.

The certificate is provided by CDA. You should ask them to buy a new one with the correct SANs. Once the new certificate is provided, copy-paste it on https://tools.keycdn.com/certificate-chain -- It will return a certificate chain with all the required intermediate certificates. This certificate chain is the one to be put in Teigi and be used by Træfik. Please, split it and check the validity of each certificate validity with openssl x509 -in <filename> -noout -text. Typically, the root CA certificate, the intermediate certificate and the private key do not change.

Once validates, it should be put in Teigi under ceph/gabe/radosgw/træfik:

s3_ch_ssl_certificate
s3_ch_ssl_private_key

Next, the certificate must be deployed on all machines via puppet. Mcollective can be of help to bulk-run puppet on all the Træfik machines:

[ebocchi@aiadm81 ~]$ mco puppet runonce -T ceph -F hostgroup=ceph/gabe/radosgw/Træfik --dt=3

 * [ ============================================================> ] 14 / 14

Finished processing 14 / 14 hosts in 114.60 ms

Last, the certificate must be loaded by Træfik. While the certificate is part of Træfik's dynamic configuration, Træfik does not seem to reload it if the certificate file (distributed via puppet) changes on disk. Puppet will still notify the Træfik service when the certificate file changes (see traefik.pp) to no avail.

Since 2022, a configuration change in Træfik (Traefik: hot-reload certificates when touching (or editing) dynamic file) allows reloading the certificate when the Traefik dynamic configuration file changes. It is sufficient to touch /etc/traefik/traefik.dynamic.conf to have the certificate reloaded, with no need to drain the machine and restart the Traefik process:

Make sure the new certificate file is available on the machine (/etc/ssl/certs/radosgw.crt)
Tail the logs of the Traefik service: tail -f /var/log/traefik/service.log
Touch Traefik's dynamic configuration file: touch /etc/traefik/traefik.dynamic.conf
Check the new certificate is in place:

curl -vs --resolve s3.cern.ch:443:<the_ip_address_of_the_machine> https://s3.cern.ch --output /dev/null 2>&1 | grep ^* | grep date
*  start date: Mar  1 00:00:00 2022 GMT
*  expire date: Mar  1 23:59:59 2023 GMT

The same certificates are also used by the Nethub cluster and distributed via Teigi under ceph/nethub/traefik:

s3_fr_ssl_certificate
s3_fr_ssl_private_key

There is a daily cronjob that checks S3 user quota usage and sends a list of accounts reaching 90% of their quota. Upon reception of this email, we should get in touch with the user and see if they can (1) free some space by deleting unnecessary data or (2) request more space.

Currently, there is some rgw accounts that will come without an associated email address. A way to investigate who owns the account is to log into aiadm.cern.ch and run the following commands (in /root/ceph-scripts/tools/s3-accounting/)

./cern-get-accounting-unit.sh --id `./s3-user-to-accounting-unit.py <rgw account id>`

This will give you the user name of the associated openstack tenant's owner, with the contact email address.

The s3.cern.ch alias is managed by aiermis and/or by the kermis CLI utility on aiadm

[ebocchi@aiadm81 ~]$ kermis -a s3 -o read
INFO:kermis:[
    {
        "AllowedNodes": "",
        "ForbiddenNodes": "",
        "alias_name": "s3.cern.ch",
        "behaviour": "mindless",
        "best_hosts": 10,
        "clusters": "none",
        "cnames": [],
        "external": "yes",
        "hostgroup": "ceph/gabe/radosgw",
        "id": 3019,
        "last_modification": "2018-11-01T00:00:00",
        "metric": "cmsfrontier",
        "polling_interval": 300,
        "resource_uri": "/p/api/v1/alias/3019/",
        "statistics": "long",
        "tenant": "golang",
        "ttl": null,
        "user": "dvanders"
    }
]

As of July 2021, the alias returns the 10 best hosts (based on the lbclient score) out of all the machines that are part of the alias, which are typically more. Also, the members of the alias are refreshed every 5 minutes (300 seconds).

Follow the procedure defined for the other Ceph clusters. In a nutshell:

Start with mons, then mgrs. OSDs go last.
If upgrading OSDs, ceph osd set {noin, nout}
yum update to update the packages (check that the ceph package is actually upgraded)
systemctl restart ceph-{mon, mgr, osd}
Always make sure the daemons came back alive and all OSDs repeered before continuing with the following machine

To safely upgrade the RadosGW, touch /etc/nologin to have it returning 503 to the healthcheck probes from Træfik (see more about healthcheck disabling path above). This allows for draining the RadosGW by not sending new requests to it and letting in-flight ones finish gently.

After few minutes, one can assume there are no more in-flight requests and the RadosGW can be update and restarted: systemctl restart ceph-{mon, mgr, osd}. Make sure the RadosGW came back alive by tailing the log at /var/log/ceph/ceph-client.rgw.*; it should still return 503 to the Træfik healthchecks. Now remove /etc/nologin and check the requests flow with 200.

To safely upgrade Træfik, the frontend machine must be removed from the load-balanced alias by touching /etc/nologin (this will also disable the RadosGW due to the healthcheck disabling path -- see above). Wait for some time and make sure no (or little) traffic is handled by Træfik by checking its access logs (/var/log/traefik/access.log)`. Some clients (e.g., GitLab, CBack) are particularly sticky and rarely re-resolve the alias to IPs -- there is nothing you can do to push those clients away.

When no (or little) traffic goes through Træfik, update the traefik::version parameter and run puppet. The new Træfik binary will be installed on the host and the service will be restarted.

Check with curl that Træfik works as expected. Example:

$ curl -vs --resolve s3.cern.ch:80:188.184.74.136 http://s3.cern.ch/cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished --output -
* Added s3.cern.ch:80:188.184.74.136 to DNS cache
* Hostname s3.cern.ch was found in DNS cache
*   Trying 188.184.74.136:80...
* TCP_NODELAY set
* Connected to s3.cern.ch (188.184.74.136) port 80 (#0)
> GET /cvmfs-atlas/cvmfs/atlas.cern.ch/.cvmfspublished HTTP/1.1
> Host: s3.cern.ch
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Bucket: cvmfs-atlas
< Cache-Control: max-age=61
< Content-Length: 601
< Content-Type: application/x-cvmfs
< Date: Fri, 22 Apr 2022 14:45:27 GMT
< Etag: "b5dbc3633d7bb27d10610f5f1079a192"
< Last-Modified: Fri, 22 Apr 2022 14:11:10 GMT
< X-Amz-Request-Id: tx00000000000000143ffd3-006262bf87-28e3e206-default
< X-Rgw-Object-Type: Normal
< 
Ca5b48a4ed8f0ca46b79584104564da32b42a1c45
B1385472
Rd41d8cd98f00b204e9800998ecf8427e
D240
S103476
Gno
Ano
Natlas.cern.ch
{...cut...}
* Connection #0 to host s3.cern.ch left intact

If successful, allow the machine to join the load-balanced pool by removing /etc/nologin.

Using and Operating Ceph

S3 Operations notes

About the architecture

Components

Useful documentation

Dashboards

Maintenance Tasks

Removal of one Træfik/RGW machine from the cluster

Create a new Træfik/RGW VM

Change/Add/Remove the backend RadosGWs

Change Træfik TLS certificate

Quota alerts

Further notes on s3.cern.ch alias

Upgrading software

Upgrade mon/mgr/osd

Upgrading RGW

Upgrading Træfik