Restic Cephfs Backup (Automatic)
This guide will cover the basic architecture and operations on the distributed backup system using restic
All the puppet configuration is under the following hostgroup structure:
ceph/restic/
ceph/restic/agent
ceph/restic/agent/backup
The code of the different scripts reside in the following git repository:
Architecture
These are the actual components of the current system and their role:
cephrestic-backup-NN (cephrestic-backup.cern.ch)
Stateless nodes and actual workers of the system. This nodes contain a restic agent each, which is always running
and checking for new backup jobs every 5 seconds. When a job is found, the agent will handle the backup copying files
from cephfs
to s3
.
cback-switch
This daemon runs every hour at a random minute in every agent and changes the status of Completed
backups after
24 hours so they become Pending
(check Operating section). This daemon will do the same process for
the prune mechanism, making Pending
all the jobs with no recent prune in the last week.
S3 Storage
This is where we store the backups. Each user has its own bucket named like cboxback-<user_name>
(cboxbackproj-svc_account
for the projects). Every restic agent
has the utility s3cmd
installed and configured so we can list the actual buckets:
s3cmd ls
CAUTION: Not needed to say, but deleting the S3 bucket will delete all backup data and snapshot information. The backup won't fail, instead a new fresh backup will be triggered. So, take care while operating the bucket directly and eventually disable the related backup job
cback backup disable <id>
.
Configuration
The basic configuration of the backup system is done by config files managed by puppet through hiera.
These config files reside in /etc/cback/cback-<type-of-agent>-config.json
.
The available configuration parameters are explained in each data hostgroup (hiera) file.
Command Line Interface (cback)
For operating the system there is a command line tool called cback
. This tool is available in any of the backup, prune
or restore agents. This tool is still in development so always check cback --help
to see the actual commands.
Operating Backup
- Check backup status
cback backup status
These are the possible backup status:
Enabled
Only the enabled jobs will be taken into account by the backup agents.Pending
The backup is ready to be backed up. Any available agent will pick this job whenever is free unless the job is disabled.Running
The job is running at that moment. Checkcback backup status
to see which agent is taking care of the job.Failed
There was a problem with that backup. Checkcback backup status user_name | job_id
to check what went bad.Completed
The last backup was successful. This is not a permanent state. After the default 24 hours, the status will be changed toPending
Only the jobs Enabled
+ Pending
and prune status different to Running
will be processed by the backup agents. The command cback backup reset <id>
will
set the status to Pending
.
- Check the status of a particular user or backup id:
cback backup status rvalverd
- List all backups
cback backup ls
- List all backups by status:
cback backup ls [failed|completed|running|pending|disabled]
- Enable / disable a backup job
cback backup enable|disable <backup_id>
NOTE: This command does not stop a running backup. In case that the backup is running, it will go until the end but won't be available for the subsequent backups.
- Reset a backup job (changes the status to
Pending
)
cback backup reset <backup_id>
- Add a new backup job
cback backup add <user_name> <instance> <path> [--bucket-prefix=<prefix>] [--bucket-name=<name>] [--enable]
Example:
cback backup add rvalverd cvmfs /cephfs-flax/volumes/_nogroup/234234 --enable
This will add a new backup job and will store the specified path on a bucket called cephback-rvalverd
. The bucket
will be created automatically on the first run of the backup.
NOTE 1: By default, the
<user_name>
flag will be used to generate the name of the bucket concatenating it with the bucket prefix (by defaultcephback-
). If user_name isrvalverd
, the bucket name will be namedcephback-rvalverd
.
NOTE 2: It's possible to add more than one backup per user as long as the path is different
NOTE 3: If the instance does not exist, it will be created automatically. This field is only used for categorizing the jobs, so does not need to match an existing ceph instance and is not used in the actual backup logic.
NOTE 4: All the backup jobs are added as
Pending
+Disabled
by default unless--enable
flag is set, which will add the backup asPending
+Enabled
. The flag--enable
will also set prune asEnabled
+Pending
.
NOTE 5: If
--bucket-prefix
is not specified, the default will be used:cephback-
. This is configurable through Puppet.
NOTE 6: If
--bucket-name
is specified, its value will be used instead of any other combination
NOTE 7: S3 repository will be created automatically by the backup agent on the first run of the backup.
- Delete backup job. A interactive shell will be presented to delete backup metadata and also S3 bucket contents if needed. Use it with care, no recovery is possible. Is not possible to delete backups in running status.
cback backup delete <backup_id>
Restoring a backup
Currently, refer to the restic documentation in order to recover the data.
For operating the repository using restic you need to:
- Source the enviroment configuration:
source /etc/cback/restic_env
- Get the url of the backup to operate:
cback backup status <user_name | backup_id>
- Run normal restic commands:
restic -r s3:s3.cern.ch/cephback-rvalverd snaphots|restore|find ...
Refer to restic help
for all available options.
Scaling the System
Vertically:
- You can run as many process as you wish in any agent spawning a new process like
systemctl start cback-<type_of_agent>@<new_agent_id>
For example:
If we have only one agent in cephrestic-backup-01
we can do the following to have two:
[rvalverd@cephrestic-backup-01]$ systemctl start cback-backup@2
The number of agents to run in each machine is not managed by puppet (currently) so changes are persistent. If an agent crashes won't be restarted by puppet. This will be addressed in further versions of the system.
Horizontally:
You need to spawn a new machine in the required hostgroup:
- backup agent:
ceph/restic/agent/backup
- prune agent:
ceph/restic/agent/prune
- restore agent:
ceph/restic/agent/restore
For example, for adding a new backup agent N
(we assume that we have N-1
currently)
[rvalverd@aiadm09 ~]$ eval `ai-rc "IT Ceph Storage Service"`
ai-bs --landb-mainuser ceph-admins --landb-responsible ceph-admins --nova-flavor m2.large --cc7 -g ceph/restic/agent/backup --foreman-environment qa cephrestic-backup-N.cern.ch
Adding the node to the load balanced alias:
openstack server set --property landb-alias=cephrestic-backup--load-N- cephrestic-backup-N
Once the installation is done and puppet is done, you need to log-in to the machine and start the daemon (this will be done automatically in a further version of the system):
[rvalverd@cephrestic-backup-N]$ systemctl start cback-backup@1
After that, the agent should start pulling jobs
Using the log system
The log of any agent could be found on /var/log/cback/cback-<type_agent>.log
You can grep for the job_id
for convenience, for example:
cat /var/log/cback/cback-backup.log | grep 3452
Operating with the backup repository using upstream Restic
As the system uses upstream version of restic, the backup repository could be managed directly. Restic is installed in all backup agents.
- First, you need to source the configuration:
source /etc/cback/restic_env
NOTE: If that file is not available, you can export the contents of /etc/sysconfig/restic_env
And then, you can refer to restic documentation about how to use the tool.
- Here is an example of how to list the available snapshots of one backup:
restic -r s3:s3.cern.ch/cephback-rvalverd snapshots
For convenience or long debugging sessions, you can also seed the repository information as a environmental variable:
export RESTIC_REPOSITORY=s3:s3.cern.ch/cephback-rvalverd
This way you don't need to specify the -r
flag every time.
- Here is another example about how to mount the repository as a filesystem (read-only):
restic -r s3:s3.cern.ch/cephback-rvalverd mount /mnt
Data backup with Restic (manual)
This document describes how to backup your block storage or CephFS with restic. Here we describe backing up to S3, but the tool supports several other backends as well.
Restic/S3 Setup
export RESTIC_REPOSITORY=s3:s3.cern.ch/<my_backup_repo>
export RESTIC_PASSWORD_FILE=<secret_path_of_a_file_with_the_repo_pass_inside>
export AWS_ACCESS_KEY_ID=<s3_access_key>
export AWS_SECRET_ACCESS_KEY=<s3_secret_access_key>
Restic Download / Install
Initialize Backup Repository
restic init
Backup
restic backup <my_share>
NOTE: By default, restic place the cache files on $HOME/.cache, if you want to specify another path for the cache you can use the
--cache-dir <dir>
flag.
Restore
There are two options, directly using the restic restore
command or mounting the backup repository and copy the files
from it.
Directly
- List backup snapshots
restic snapshots
- Restore the selected snapshot
restic restore <snapshot_id> --target <target_path>
NOTE: you can use
restic find
to look for specific files inside a snapshot.
Using the mount option
- You can browse your backup repository using fuse
restic mount /mnt/<my_repo>
NOTE: You can run
restic snapshots
to see the correlation between the snapshot id and the folder.
Delete a snapshot
- List snapshots
restic snapshots
- Forget snapshot
restic forget <snapshot_id>
Interesting flags for restic forget
-l, --keep-last n keep the last n snapshots
-H, --keep-hourly n keep the last n hourly snapshots
-d, --keep-daily n keep the last n daily snapshots
-w, --keep-weekly n keep the last n weekly snapshots
-m, --keep-monthly n keep the last n monthly snapshots
-y, --keep-yearly n keep the last n yearly snapshots
--keep-tag taglist keep snapshots with this taglist (can be specified multiple times) (default [])
- Clean the repo (this will delete all forgotten snapshots)
restic prune
- All-in-one
restic forget <snapshot_id> --prune
Check the repository for inconsistencies
restic check
Crontab job setup
mm hh dom m dow restic backup <my_share>
More info
restic --help