Monitoring - Using and Operating Ceph

The monitoring system in Ceph is based on Grafana, using Prometheus as datasource and the native ceph prometheus plugin as metric exporter. Prometheus node_exporter is used for node metrics (cpu, memory, etc).

For long-term metric storage, Thanos is used to store metrics in S3 (Meyrin)

All Ceph monitoring dashboards are available in monit-grafana (Prometheus). Although prometheus is the main datasource for ceph metrics, some plots/dashboard may still require the legacy Graphite datasource.
The prometheus server is configured in the host cephprom.cern.ch, hostgroup ceph/prometheus
Configuration files (Puppet):
- it-puppet-hostgroup-ceph/code/manifests/prometheus.pp
- it-puppet-hostgroup-ceph/data/hostgroup/ceph/prometheus.yaml
- it-puppet-hostgroup-ceph/data/hostgroup/ceph.yaml
- Alertmanager templates: it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl
- Alert definition: it-puppet-hostgroup-ceph/code/files/generated_rules/
Thanos infrastructure is under ceph/thanos hostgroup, configured via the corresponding hiera files.

A analog qa infrastructure is also available, which all components replicated (cephprom-qa, thanos-store-qa, etc). This qa infra is configured overriding the puppet environment:

it-puppet-hostogroup-ceph/data/hostgroup/ceph/environments/qa.yaml

Enable the prometheus mgr module in the cluster:

ceph mgr module enable prometheus

NOTE: Make sure that the port 9283 is accepting connections.

Instances that include the hg_ceph::classes::mgr class will be automatically discovered through puppetdb and scraped by prometheus.

To ensure that we don't lose metrics during mgr failovers, all the cluster mgr's will be scraped. As a side benefit, we can monitor the online status of the mgr's.
Run or wait for a puppet run on cephprom.cern.ch.

Instances that include the prometheus::node_exporter class (anything under ceph top hostgroup) will be automatically discovered through puppetdb and scraped by prometheus.

Alerts are defined in yaml files managed by puppet in:

it-puppet-hostgroup-ceph/files/prometheus/generated_rules

They are organised in services, so add the alert in the appropiate file (e.g: ceph alerts in alerts_ceph.yaml). The file rules.yaml is used to add recorded rules

There are 3 notification channels currently: e-mail, SNOW ticket and Mattermost message.

Before creating the alert, make sure you test your query in advance, for example using the Explore panel on Grafana. Once the query is working, proceed with the alert definition.

A prometheus alert could look like this:

rules:
  - alert: "CephOSDReadErrors"
    annotations:
      description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
      documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
      summary: "Device read errors detected on cluster {{ $labels.cluster }}"
    expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
    for: "30s"
    labels:
      severity: "warning"
      type: "ceph_default"

alert: Mandatory. Name of the alert, which will be part of the subject of the email, head of the ticket and title of the mattermost notification. Try to follow the same pattern as the ones already created CephDAEMONAlert. Daemon in uppercase and rest in camel case.
expr: Mandatory. PromQL query that defines the alert. The alert will trigger if the query returns one of more matches. It's a good exercise to use promdash for tuning the query to ensure that it is well formed.
for: Mandatory.The alert will be triggered if stays active for more than the specified time (e.g 30s, 1m, 1h).
annotations:summary: Mandatory. Express the actual alert in a a concise way.
annotations:description: Optional. Allow to specify more detailed information about the alert when the summary is not enough.
annotation:documentation: Optional. Allows to specify the url of the documentation/procedure to follow to handle the alert.
labels:severity: Mandatory. Defines the notification channel to use, based on the following:
- warning/critical: Sends an e-mail to ceph-alerts.
- ticket: Sends an e-mail AND creates an SNOW ticket.
- mattermost: Sends an e-email AND sends a Mattermost message to the ceph-bot channel.
labels:type: Optional. Allows to distinguish from alerts created upstream ceph_default from created by us ceph_cern. It has no actual implication on the alert functionality.
labels:xxxxx: Optional. You can add custom labels that could be used on the template.

NOTES

In order for the templating to work as expected, make sure that labels cluster or job_name are part of the resulting query. In case the query does not preserve labels (like count) you can specify manually the label and value in the labels section in the alert definition.

All annotations, if defined, will appear in the body of the ticket, e-mail or mattermost message generated by the alert.

Alerts are evaluated against the local prometheus server which contains metrics for the last 7 days. Take that into account while defining alerts that evaluates longer periods (like linear_predict). In such cases, you can create the alert in Grafana using the Thanos-LTMS metric datasource (more on that later this doc) .

In grafana or promdash you can access the alerts querying the metric called ALERTS

For more information about how to define an alert, refer to the Prometheus Documentation

Prometheus alerts are pre-configured to show the procedure needed for handling the alert via the annotation procedure_url. This is an optional argument that could be configured per alert rule.

Update the file rota.md on this repository and add the new procedure. Use this file for convenience, but you can create a new file if needed.

Edit the alert following instructions above, and add the link to the procedure under the annotations section, under the key documentation, for example:

- alert: "CephMdsTooManyStrays"
    annotations:
      documentation: "http://s3-website.cern.ch/cephdocs/ops/rota.html#cephmdstoomanystrays"
      summary: "The number of strays is above 500K"
    expr: "ceph_mds_cache_num_strays > 500000"
    for: "5m"
    labels:
      severity: "ticket"

Push the changes and prometheus server will reload automatically picking the new changes. Next time the alert is triggered, a link to the procedure will be shown in the alert body.

You can use the alertmanager Web Interface to silence alarms during scheduled interventions. Please always specify a reason for silencing the alarms (a JIRA link or ticket would be a plus). Additionally, for the alerts that generate an e-mail, you will find a link to silence it in the email body.

Alert grouping is enabled by default, so if the same alert is triggered in different nodes, we only receive one ticket with all involved nodes.

Both email and Snow Ticket templates are customizable. For doing that, you need to edit the following puppet file:

it-puppet-hostgroup-ceph/code/files/prometheus/am-templates/ceph.tmpl

You have to use Golang's Template syntax. The structure of the file is as follows:

{{ define "ceph.email.subject" }}
....
{{ end }}
{{ define "ceph.email.body" }}
....
{{ end }}

For reference check the default AlertManager Templates

In case you add templates make sure that you adapt the AlertManager configuration accordingly:

- name: email
  email_configs:
  - to: ceph-admins@cern.ch
    from: altertmanager@locahost
    smarthost: cernmx.cern.ch:25
    headers:
      Subject: '{{ template "ceph.email.subject" . }}'
    html: '{{ template "ceph.email.body" . }}'

Note A restart of AlertManager is needed for the changes to be applied.

The prometheus dashboard or Dashprom is a powerful interface that allows to quickly asses the prometheus server status and also provide a quick way of querying metrics. The prometheus dashboard is accesible from this link: Promdash.

The prometheus dashboard is useful for:
- Checking the status of all targets: Target status
- Check the status of the alerts Alert Status
- For debug purposes, you can execute PromQL queries directly on the dashboard and change the intervals quickly.
- In grafana there is an icon just near the metric definition to view the current query in promdash.
- You can also use the Grafana Explorer.

Note: This will only give you access to the metrics of the last 7 days, refer to the next chapter for accessing older metrics.

The long term storage metrics are kept in S3 CERN Service using Thanos. The bucket is called prometheus-storage and is accessed using the EC2 credentials of Ceph's Openstack Project. Accesing to this metrics is transparent from Grafana:

Metrics of the last 7 days are served directly from prometheus local storage
Older metrics are pulled from S3.
As metrics in S3 contains downsampled versions (5m, 1h) is usually much faster that getting metrics from the local prometheus.
RAW metrics are also kept, so it is possible to zoom-in to the 15 second-resolution

There is a thanos promdash version here, from where you can access all historical metrics. This dashboard has some specific thanos features like deduplication (for use cases with more than one prometheus servers scrapping the same data) and the possibility of showing downsampled data (thanos stores two downsampled versions of the metrics, with 1h and 5m resolution). This downsampled data is also stored in S3.

You can find more detailed information in Thanos official webpage, but these are the list of active components in our current setup and the high level description of what they do:

Every time Prometheus dumps the data to disk (by default, each 2 hours), the thanos-sidecar uploads the metrics to the S3 bucket. It also acts as a proxy that serves Prometheus’s local data.

This is the storage proxy which serves the metrics stored in S3

This component reads the data from store(s) and sidecar(s) and answers PromSQL using the standard Prometheus HTTP API. This is the component you have to point from monitoring dashboards.

This is a detached component which compacts the data in S3 and also creates the downsampled versions.

Using and Operating Ceph

Ceph Monitoring

About Ceph Monitoring

Access the monitoring system

Add/remove a cluster to/from the monitoring system

Add/remove a node for node metrics (cpu, memory, etc)

Add/remove an alert rule to/from the monitoring system

Create / Link procedure/documentation to Prometheus Alert.

Step 1: Create the procedure in case does not exist yet.

Step 2: Edit the alert rule and link to the procedure.

Silence Alarms

Alert Grouping

Modifying AlertManager Templates

Accessing the prometheus dashboard (promdash)

Long Term Metric Storage - LTMS

Accessing the thanos dashboard

Thanos Architecture

Sidecar

Store

Querier

Compactor