Kubernetes backup
SUSE Observability Self-hosted
Overview
The Kubernetes setup for SUSE Observability has a built-in backup and restore mechanism that can be configured to store backups to the local clusters, to AWS S3 or to Azure Blob Storage.
Backup scope
The following data can be automatically backed up:
Configuration and topology data stored in StackGraph is backed up when the Helm value
backup.stackGraph.enabled
is set totrue
.Metrics stored in SUSE Observability's Victoria Metrics instance(s) is backed up when the Helm value
victoria-metrics-0.backup.enabled
andvictoria-metrics-1.backup.enabled
are set totrue
.Telemetry data stored in SUSE Observability's Elasticsearch instance is backed up when the Helm value
backup.elasticsearch.enabled
is set totrue
.OpenTelemetry data stored in SUSE Observability's ClickHouse instance is backed up when the Helm value
clickhouse.backup.enabled
is set totrue
.
The following data will not be backed up:
In transit topology and telemetry updates stored in Kafka - these only have temporary value and would be of no use when a backup is restored
Master node negotiations state stored in ZooKeeper - this runtime state would be incorrect when restored and will be automatically determined at runtime
Kubernetes configuration state and raw persistent volume state - this state can be rebuilt by re-installing SUSE Observability and restoring the backups.
Kubernetes logs - these are ephemeral.
Storage options
Backups are sent to an instance of MinIO (min.io), which is automatically started by the stackstate
Helm chart when automatic backups are enabled. MinIO is an object storage system with the same API as AWS S3. It can store its data locally or act as a gateway to AWS S3 (min.io), Azure BLob Storage (min.io) and other systems.
The built-in MinIO instance can be configured to store the backups in three locations:
Enable backups
AWS S3
Encryption
Amazon S3-managed keys (SSE-S3) should be used when encrypting S3 buckets that store the backups.
⚠️ Encryption with AWS KMS keys stored in AWS Key Management Service (SSE-KMS) isn't supported. This will result in errors such as this one in the Elasticsearch logs:
Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: sdk_client_exception: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: ZX4D/ZDUzZWRhNDUyZTI1MTc= in base 64) didn't match hash (etag: c75faa31280154027542f6530c9e543e in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@5481a656, bucketName: stackstate-elasticsearch-backup, key: tests-UG34QIV9s32tTzQWdPsZL/master.dat)",
To enable scheduled backups to AWS S3 buckets, add the following YAML fragment to the Helm values.yaml
file used to install SUSE Observability:
Replace the following values:
YOUR_ACCESS_KEY
andYOUR_SECRET_KEY
are the credentials that will be used to secure the MinIO system. These credentials are set on the MinIO system and used by the automatic backup jobs and the restore jobs. They're also required if you want to manually access the MinIO system.YOUR_ACCESS_KEY should contain 5 to 20 alphanumerical characters.
YOUR_SECRET_KEY should contain 8 to 40 alphanumerical characters.
AWS_ACCESS_KEY
andAWS_SECRET_KEY
are the AWS credentials for the IAM user that has access to the S3 buckets where the backups will be stored. See below for the permission policy that needs to be attached to that user.AWS_STACKGRAPH_BUCKET
,AWS_ELASTICSEARCH_BUCKET
,AWS_VICTORIA_METRICS_BUCKET
andAWS_CLICKHOUSE_BUCKET
are the names of the S3 buckets where the backups should be stored. Note: The names of AWS S3 buckets are global across the whole of AWS, therefore the S3 buckets with the default name (sts-elasticsearch-backup
,sts-stackgraph-backup
,sts-victoria-metrics-backup
andsts-clickhouse-backup
) will probably not be available.
The IAM user identified by AWS_ACCESS_KEY
and AWS_SECRET_KEY
must be configured with the following permission policy to access the S3 buckets:
Azure Blob Storage
To enable backups to an Azure Blob Storage account, add the following YAML fragment to the Helm values.yaml
file used to install SUSE Observability:
Replace the following values:
AZURE_STORAGE_ACCOUNT_NAME
- the Azure storage account name (learn.microsoft.com)AZURE_STORAGE_ACCOUNT_KEY
- the Azure storage account key (learn.microsoft.com) where the backups should be stored.
The StackGraph, Elasticsearch and Victoria Metrics backups are stored in BLOB containers called sts-stackgraph-backup
, sts-elasticsearch-backup
, sts-victoria-metrics-backup
, sts-clickhouse-backup
respectively. These names can be changed by setting the Helm values backup.stackGraph.bucketName
, backup.elasticsearch.bucketName
, victoria-metrics-0.backup.bucketName
, victoria-metrics-1.backup.bucketName
and clickhouse.backup.bucketName
respectively.
Kubernetes storage
If MinIO is configured to store its data in Kubernetes storage, a PersistentVolumeClaim (PVC) is used to request storage from the Kubernetes cluster. The kind of storage allocated depends on the configuration of the cluster.
It's advised to use AWS S3 for clusters running on Amazon AWS and Azure Blob Storage for clusters running on Azure for the following reasons:
Kubernetes clusters running in a cloud provider usually map PVCs to block storage, such as Elastic Block Storage for AWS or Azure Block Storage. Block storage is expensive, especially for large data volumes.
Persistent Volumes are destroyed when the cluster that created them is destroyed. That means an (accidental) deletion of your cluster will also destroy all backups stored in Persistent Volumes.
Persistent Volumes can't be accessed from another cluster. That means that it isn't possible to restore SUSE Observability from a backup taken on another cluster.
To enable backups to cluster-local storage, enable MinIO by adding the following YAML fragment to the Helm values.yaml
file used to install SUSE Observability:
Replace the following values:
YOUR_ACCESS_KEY
andYOUR_SECRET_KEY
- the credentials that will be used to secure the MinIO system. The automatic backup jobs and the restore jobs will use them. They're also required to manually access the MinIO storage.YOUR_ACCESS_KEY
should contain 5 to 20 alphanumerical characters andYOUR_SECRET_KEY
should contain 8 to 40 alphanumerical characters.
Configuration and topology data (StackGraph)
Configuration and topology data (StackGraph) backups are full backups, stored in a single file with the extension .graph
. Each file contains a full backup and can be moved, copied or deleted as required.
Disable scheduled backups
When backup.enabled
is set to true
, scheduled StackGraph backups are enabled by default. To disable scheduled StackGraph backups only, set the Helm value backup.stackGraph.scheduled.enabled
to false
.
Disable restores
When backup.enabled
is set to true
, StackGraph restores are enabled by default. To disable StackGraph restore functionality only, set the Helm value backup.stackGraph.restore.enabled
to false
.
Backup schedule
By default, the StackGraph backups are created daily at 03:00 AM server time.
The backup schedule can be configured using the Helm value backup.stackGraph.scheduled.schedule
, specified in Kubernetes cron schedule syntax (kubernetes.io).
Backup retention
By default, the StackGraph backups are kept for 30 days. As StackGraph backups are full backups, this can require a lot of storage.
The backup retention delta can be configured using the Helm value backup.stackGraph.scheduled.backupRetentionTimeDelta
, specified in Python timedelta format (python.org).
Metrics (Victoria Metrics)
Victoria Metrics use incremental backups without versioning of a bucket, it means that the new backup replaces completely the previous one.
In case you run into one of the following situations:
mount an empty volume to
/storage
directory of Victoria Metrics instancesdelete the
/storage
directory or files inside from Victoria Metrics instances
The next (empty) backup created will be labeled with a new version and the previous one, before the volume was emptied, will be preserved. Both backups will be from that moment on listed as available for restore.
Metrics (Victoria Metrics) use instant snapshots to store data in incremental backups. Many instances of Victoria Metrics can store backups to the same bucket, each of them will be stored in separated directory. All files located in the directory should be treated as a single whole and can only be moved, copied or deleted as a whole.
High Available deployments should be deployed with two instances of Victoria Metrics. Backups are enabled/configured independently for each of them.
The following code snippets/commands are provided for the first instance of Victoria Metric victoria-metrics-0
. To backup/configure the second instance you should use victoria-metrics-1
Enable scheduled backups
Backups of Victoria Metrics are disabled by default. To enabled scheduled Victoria Metrics backups, set the Helm value victoria-metrics-0.backup.enabled
to true
.
Victoria Metrics backups requires to enable backups.
Enable restores
Restore functionality of Victoria Metrics are disabled by default. To enabled restore functionality of Victoria Metrics, set the Helm value victoria-metrics-0.restore.enabled
to true
.
Victoria Metrics restore functionality requires to enable backups.
Backup schedule
By default, the Victoria Metrics backups are created every 1h:
victoria-metrics-0
- 25 minutes past the hourvictoria-metrics-1
- 35 minutes past the hour
The backup schedule can be configured using the Helm value victoria-metrics-0.backup.scheduled.schedule
according cronexpr format
OpenTelemetry (ClickHouse)
ClickHouse uses both incremental and full backups. By default, full backups are executed daily at 00:45 am, and incremental backups are performed every hour. Each backup creates a new directory, old backups (directories) are deleted automatically. All files located in a backup directory should be treated as a single whole and can only be moved, copied or deleted as a whole. We recommend to uses clickhouse-backup
tool to manage backups. The tool is available on the stackstate-clickhouse-shard0-0
Pod.
Enable scheduled backups
Backups of the ClickHouse are disabled by default. To enabled scheduled ClickHouse backups, set the Helm value clickhouse.backup.enabled
to true
.
ClickHouse backups requires to enable backups.
Enable restores
Restore functionality of the ClickHouse are disabled by default. To enabled restore functionality of the ClickHouse, set the Helm value clickhouse.restore.enabled
to true
.
ClickHouse restore functionality requires to enable backups.
Backup schedule
By default, the ClickHouse backups are created:
Full Backup - at 00:45 every day
Incremental Backup - 45 minutes past the hour (from 3 am to 12 am)
Backups struggle with parallel execution. If a second backup starts before the first one completes, it will disrupt the first backup. Therefore, it's crucial to avoid parallel execution. For instance, the first incremental backup should be executed three hours after the full one.
The backup schedule can be configured using the Helm value clickhouse.backup.scheduled.full_schedule
and clickhouse.backup.scheduled.incremental_schedule
according cronexpr format
Backup retention
By default, the tooling keeps last 308 backups (full and incremental) what is equal to ~14 days.
The backup retention can be configured using the Helm value clickhouse.backup.config.keep_remote
.
Telemetry data (Elasticsearch)
The telemetry data (Elasticsearch) snapshots are incremental and stored in files with the extension .dat
. The files in the Elasticsearch backup storage location should be treated as a single whole and can only be moved, copied or deleted as a whole.
The configuration snippets provided in the section enable backups will enable daily Elasticsearch snapshots.
Disable scheduled snapshots
When backup.enabled
is set to true
, scheduled Elasticsearch snapshots are enabled by default. To disable scheduled Elasticsearch snapshots only, set the Helm value backup.elasticsearch.scheduled.enabled
to false
.
Disable restores
When backup.enabled
is set to true
, Elasticsearch restores are enabled by default. To disable Elasticsearch restore functionality only, set the Helm value backup.elasticsearch.restore.enabled
to false
.
Snapshot schedule
By default, Elasticsearch snapshots are created daily at 03:00 AM server time.
The backup schedule can be configured using the Helm value backup.elasticsearch.scheduled.schedule
, specified in Elasticsearch cron schedule syntax (elastic.co).
Snapshot retention
By default, Elasticsearch snapshots are kept for 30 days, with a minimum of 5 snapshots and a maximum of 30 snapshots.
The retention time and number of snapshots kept can be configured using the following Helm values:
backup.elasticsearch.scheduled.snapshotRetentionExpireAfter
, specified in Elasticsearch time units (elastic.co).backup.elasticsearch.scheduled.snapshotRetentionMinCount
backup.elasticsearch.scheduled.snapshotRetentionMaxCount
By default, the retention task itself runs daily at 1:30 AM UTC (elastic.co). If you set snapshots to expire faster than within a day, for example for testing purposes, you will need to change the schedule for the retention task.
Snapshot indices
By default, a snapshot is created for Elasticsearch indices with names that start with sts
.
The indices for which a snapshot is created can be configured using the Helm value backup.elasticsearch.scheduled.indices
, specified in JSON array format (w3schools.com).
Restore backups and snapshots
Scripts to list and restore backups and snapshots can be found in the restore directory of the SUSE Observability Helm chart repository (github.com). To use the scripts, download them from GitHub or checkout the repository.
Before you use the scripts, ensure that:
The
kubectl
binary is installed and configured to connect to:The Kubernetes cluster where SUSE Observability has been installed.
The namespace within that cluster where SUSE Observability has been installed.
The following Helm values have been correctly set:
backup.enabled
is set totrue
.backup.stackGraph.restore.enabled
isn't set tofalse
(to access StackGraph backups).backup.elasticsearch.restore.enabled
isn't set tofalse
(to access Elasticsearch snapshots).victoria-metrics-0.restore.enabled
orvictoria-metrics-1.restore.enabled
isn't set tofalse
(to access Victoria Metrics snapshots).
List StackGraph backups
To list the StackGraph backups, execute the following command:
The output should look like this:
The timestamp when the backup was taken is part of the backup name.
Lines in the output that start with Error from server (BadRequest):
are expected. They appear when the script is waiting for the pod to start.
Restore a StackGraph backup
To avoid the unexpected loss of existing data, a backup can only be restored on a clean environment by default. If you are completely sure that any existing data can be overwritten, you can override this safety feature by using the command -force
. Only execute the restore command when you are sure that you want to restore the backup.
To restore a StackGraph backup on a clean environment, select a backup name and pass it as the first parameter in the following command:
To restore a StackGraph backup on an environment with existing data, select a backup name and pass it as the first parameter in the following command next to a second parameter -force
:
Note that existing data will be overwritten when the backup is restored.
Only do this if you are completely sure that any existing data can be overwritten.
The output should look like this:
In case you are running a restore command missing the -force
flag on a non-empty database the output will contain an error like this:
Lines that starts with WARNING:
are expected. They're generated by Groovy running in JDK 11 and can be ignored.
List Victoria Metrics backups
To list the Victoria Metrics backups, execute the following command:
The output should look like this:
where you can see the Victoria metrics instance, the specific backup version and the last time a backup was completed.
Restore a Victoria Metrics backup
Restore functionality always overrides data. You must be careful to avoid the unexpected loss of existing data.
Restore functionality requires to stop an instance of Victoria Metric while the process.
All new metrics will be cached by vmagent
while the restore process, please ensure the vmagent
has enough memory to cache metrics.
To restore a Victoria Metrics backup, select an instance name and a backup version and pass them as parameters in the following command:
The output should look like this:
Then follow logs to check the job status
After completion (ensure if the backup has been restored successfully), it's needed to follow commands printed by the earlier command:
delete the restore job
scale up the Victoria Metrics instance
List ClickHouse backups
The following script needs permission to execute the kubectl exec
command.
To list ClickHouse backups, execute the following command:
The output should look like this:
where is printed:
name, the name started with
full_
- it is a full backup,incremental_
- it is an incremental backupsize,
creation date,
remote
- a backup is upload to a remote storage like S3parent backup - used by incremental backups
format and compression
Restore a ClickHouse backup
Restore functionality always overrides data (all tables in the otel
database). You must be careful to avoid the unexpected loss of existing data.
The following script needs permission to execute the kubectl exec
command.
Restore functionality requires stopping all producers (like OpenTelemetry exporters). The script scales the StatefulSet down and then back up afterward.
To restore a ClickHouse backup, select a backup version and pass them as a parameter in the following command:
The output should look like this:
Error: error can't create table …. code: 57, message: Directory for table data store/…/ already exists
ClickHouse does not permanently delete tables when the DROP DATABASE/TABLE ...
command is executed. Instead, the database is marked as deleted and will be permanently removed after 8 minutes. This delay provides additional time to undo the operation. More details can be found at UNDROP TABLE. If you attempt to restore data during this period, it will fail and produce the aforementioned error. Available solutions:
wait 8 minutes (check table
select * from system.dropped_tables;
)configure
database_atomic_delay_before_drop_table_sec
List Elasticsearch snapshots
To list the Elasticsearch snapshots, execute the following command:
The output should look like this:
The timestamp when the backup was taken is part of the backup name.
Delete Elasticsearch indices
To delete existing Elasticsearch indices so that a snapshot can be restored, follow these steps.
Stop indexing - scale down all
*2es
deployments to 0:Open a port-forward to the Elasticsearch master:
Get a list of all indices:
The output should look like this:
Delete an index with a following command:
Replace
INDEX_NAME
with the name of the index to delete, for example:The output should be:
Restore an Elasticsearch snapshot
When a snapshot is restored, existing indices won't be overwritten.
To restore an Elasticsearch snapshot, select a snapshot name and pass it as the first parameter in the following command line. You can optionally specify a second parameter with a comma-separated list of the indices that should be restored. If not specified, all indices that match the Helm value backup.elasticsearch.scheduled.indices
will be restored (default "sts*"
):
The output should look like this:
The indices restored are listed in the output, as well as the number of failed and successful restore actions.
After the indices have been restored, scale up all *2es
deployments:
Last updated