Restic Cephfs Backup (Automatic)

This guide will cover the basic architecture and operations on the distributed backup system using restic

All the puppet configuration is under the following hostgroup structure:

ceph/restic/
ceph/restic/agent
ceph/restic/agent/backup

The code of the different scripts reside in the following git repository:

cback backup agents / cli

Architecture

These are the actual components of the current system and their role:

cephrestic-backup-NN (cephrestic-backup.cern.ch)

Stateless nodes and actual workers of the system. This nodes contain a restic agent each, which is always running and checking for new backup jobs every 5 seconds. When a job is found, the agent will handle the backup copying files from cephfs to s3.

cback-switch

This daemon runs every hour at a random minute in every agent and changes the status of Completed backups after 24 hours so they become Pending (check Operating section). This daemon will do the same process for the prune mechanism, making Pending all the jobs with no recent prune in the last week.

S3 Storage

This is where we store the backups. Each user has its own bucket named like cboxback-<user_name> (cboxbackproj-svc_account for the projects). Every restic agent has the utility s3cmd installed and configured so we can list the actual buckets:

s3cmd ls

CAUTION: Not needed to say, but deleting the S3 bucket will delete all backup data and snapshot information. The backup won't fail, instead a new fresh backup will be triggered. So, take care while operating the bucket directly and eventually disable the related backup job cback backup disable <id>.

Configuration

The basic configuration of the backup system is done by config files managed by puppet through hiera. These config files reside in /etc/cback/cback-<type-of-agent>-config.json.

The available configuration parameters are explained in each data hostgroup (hiera) file.

Command Line Interface (cback)

For operating the system there is a command line tool called cback. This tool is available in any of the backup, prune or restore agents. This tool is still in development so always check cback --help to see the actual commands.

Operating Backup

  • Check backup status
cback backup status 

These are the possible backup status:

  • Enabled Only the enabled jobs will be taken into account by the backup agents.
  • Pending The backup is ready to be backed up. Any available agent will pick this job whenever is free unless the job is disabled.
  • Running The job is running at that moment. Check cback backup status to see which agent is taking care of the job.
  • Failed There was a problem with that backup. Check cback backup status user_name | job_id to check what went bad.
  • Completed The last backup was successful. This is not a permanent state. After the default 24 hours, the status will be changed to Pending

Only the jobs Enabled + Pending and prune status different to Running will be processed by the backup agents. The command cback backup reset <id> will set the status to Pending.

  • Check the status of a particular user or backup id:
cback backup status rvalverd
  • List all backups
cback backup ls
  • List all backups by status:
cback backup ls [failed|completed|running|pending|disabled]
  • Enable / disable a backup job
cback backup enable|disable <backup_id>

NOTE: This command does not stop a running backup. In case that the backup is running, it will go until the end but won't be available for the subsequent backups.

  • Reset a backup job (changes the status to Pending)
cback backup reset <backup_id>
  • Add a new backup job
cback backup add <user_name> <instance> <path> [--bucket-prefix=<prefix>] [--bucket-name=<name>] [--enable]

Example:

cback backup add rvalverd cvmfs /cephfs-flax/volumes/_nogroup/234234 --enable

This will add a new backup job and will store the specified path on a bucket called cephback-rvalverd. The bucket will be created automatically on the first run of the backup.

NOTE 1: By default, the <user_name> flag will be used to generate the name of the bucket concatenating it with the bucket prefix (by default cephback-). If user_name is rvalverd, the bucket name will be named cephback-rvalverd.

NOTE 2: It's possible to add more than one backup per user as long as the path is different

NOTE 3: If the instance does not exist, it will be created automatically. This field is only used for categorizing the jobs, so does not need to match an existing ceph instance and is not used in the actual backup logic.

NOTE 4: All the backup jobs are added as Pending+Disabled by default unless --enable flag is set, which will add the backup as Pending+Enabled. The flag --enable will also set prune as Enabled+Pending.

NOTE 5: If --bucket-prefix is not specified, the default will be used: cephback-. This is configurable through Puppet.

NOTE 6: If --bucket-name is specified, its value will be used instead of any other combination

NOTE 7: S3 repository will be created automatically by the backup agent on the first run of the backup.

  • Delete backup job. A interactive shell will be presented to delete backup metadata and also S3 bucket contents if needed. Use it with care, no recovery is possible. Is not possible to delete backups in running status.
cback backup delete <backup_id> 

Restoring a backup

Currently, refer to the restic documentation in order to recover the data.

For operating the repository using restic you need to:

  • Source the enviroment configuration:
source /etc/cback/restic_env
  • Get the url of the backup to operate:
cback backup status <user_name | backup_id>
  • Run normal restic commands:
restic -r s3:s3.cern.ch/cephback-rvalverd snaphots|restore|find ...

Refer to restic help for all available options.

Scaling the System

Vertically:

  • You can run as many process as you wish in any agent spawning a new process like systemctl start cback-<type_of_agent>@<new_agent_id>

For example: If we have only one agent in cephrestic-backup-01 we can do the following to have two:

[rvalverd@cephrestic-backup-01]$ systemctl start cback-backup@2

The number of agents to run in each machine is not managed by puppet (currently) so changes are persistent. If an agent crashes won't be restarted by puppet. This will be addressed in further versions of the system.

Horizontally:

You need to spawn a new machine in the required hostgroup:

  • backup agent: ceph/restic/agent/backup
  • prune agent: ceph/restic/agent/prune
  • restore agent: ceph/restic/agent/restore

For example, for adding a new backup agent N (we assume that we have N-1 currently)

[rvalverd@aiadm09 ~]$ eval `ai-rc "IT Ceph Storage Service"`
ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.large --cc7 -g ceph/restic/agent/backup --foreman-environment qa cephrestic-backup-N.cern.ch 

Adding the node to the load balanced alias:

openstack server set --property landb-alias=cephrestic-backup--load-N- cephrestic-backup-N

Once the installation is done and puppet is done, you need to log-in to the machine and start the daemon (this will be done automatically in a further version of the system):

[rvalverd@cephrestic-backup-N]$ systemctl start cback-backup@1

After that, the agent should start pulling jobs

Using the log system

The log of any agent could be found on /var/log/cback/cback-<type_agent>.log You can grep for the job_id for convenience, for example:

cat /var/log/cback/cback-backup.log | grep 3452

Operating with the backup repository using upstream Restic

As the system uses upstream version of restic, the backup repository could be managed directly. Restic is installed in all backup agents.

  • First, you need to source the configuration:
source /etc/cback/restic_env

NOTE: If that file is not available, you can export the contents of /etc/sysconfig/restic_env

And then, you can refer to restic documentation about how to use the tool.

  • Here is an example of how to list the available snapshots of one backup:
restic -r s3:s3.cern.ch/cephback-rvalverd snapshots

For convenience or long debugging sessions, you can also seed the repository information as a environmental variable:

export RESTIC_REPOSITORY=s3:s3.cern.ch/cephback-rvalverd

This way you don't need to specify the -r flag every time.

  • Here is another example about how to mount the repository as a filesystem (read-only):
restic -r s3:s3.cern.ch/cephback-rvalverd mount /mnt

Data backup with Restic (manual)

This document describes how to backup your block storage or CephFS with restic. Here we describe backing up to S3, but the tool supports several other backends as well.

Restic/S3 Setup

export RESTIC_REPOSITORY=s3:s3.cern.ch/<my_backup_repo>
export RESTIC_PASSWORD_FILE=<secret_path_of_a_file_with_the_repo_pass_inside>

export AWS_ACCESS_KEY_ID=<s3_access_key>
export AWS_SECRET_ACCESS_KEY=<s3_secret_access_key>

Restic Download / Install

Restic Install

Initialize Backup Repository

restic init

Backup

restic backup <my_share>

NOTE: By default, restic place the cache files on $HOME/.cache, if you want to specify another path for the cache you can use the --cache-dir <dir> flag.

Restore

There are two options, directly using the restic restore command or mounting the backup repository and copy the files from it.

Directly

  • List backup snapshots
restic snapshots
  • Restore the selected snapshot
restic restore <snapshot_id> --target <target_path>

NOTE: you can use restic find to look for specific files inside a snapshot.

Using the mount option

  • You can browse your backup repository using fuse
restic mount /mnt/<my_repo>

NOTE: You can run restic snapshots to see the correlation between the snapshot id and the folder.

Delete a snapshot

  • List snapshots restic snapshots
  • Forget snapshot restic forget <snapshot_id>

Interesting flags for restic forget

  -l, --keep-last n         keep the last n snapshots
  -H, --keep-hourly n       keep the last n hourly snapshots
  -d, --keep-daily n        keep the last n daily snapshots
  -w, --keep-weekly n       keep the last n weekly snapshots
  -m, --keep-monthly n      keep the last n monthly snapshots
  -y, --keep-yearly n       keep the last n yearly snapshots
      --keep-tag taglist    keep snapshots with this taglist (can be specified multiple times) (default [])
  • Clean the repo (this will delete all forgotten snapshots)
restic prune
  • All-in-one
restic forget <snapshot_id> --prune

Check the repository for inconsistencies

restic check

Crontab job setup

mm hh dom m dow restic backup <my_share> 

More info

restic --help

Official restic Documentation

Improve me !