Kubernetes backup

This page describes StackState v4.4.x.

The StackState 4.4 version range is End of Life (EOL) and no longer supported. We encourage customers still running the 4.4 version range to upgrade to a more recent release.

Go to the documentation for the latest StackState release.

Overview

The Kubernetes setup for StackState has a built-in backup and restore mechanism that can be configured to store backups to the local clusters, to AWS S3 or to Azure Blob Storage.

Backup scope

The following data can be automatically backed up:

  • Configuration and topology data stored in StackGraph is backed up when the Helm value backup.stackGraph.enabled is set to true.

  • Telemetry data stored in StackState's Elasticsearch instance is backed up when the Helm value backup.elasticsearch.enabled is set to true.

The following data will NOT be backed up:

  • In transit topology and telemetry updates stored in Kafka - these only have temporary value and would be of no use when a backup is restored

  • Master node negotiations state stored in ZooKeeper - this runtime state would be incorrect when restored and will be automatically determined at runtime

  • Kubernetes configuration state and raw persistent volume state - this state can be rebuilt by re-installing StackState and restoring the backups.

  • Kubernetes logs - these are ephemeral.

Storage options

StackGraph and Elasticsearch backups are sent to an instance of MinIO (min.io), which is automatically started by the stackstate Helm chart when automatic backups are enabled. MinIO is an object storage system with the same API as AWS S3. It can store its data locally or act as a gateway to AWS S3 (min.io), Azure BLob Storage (min.io) and other systems.

The built-in MinIO instance can be configured to store the backups in three locations:

  • AWS S3

  • Azure Blob Storage

  • Kubernetes storage

Enable backups

Backup to AWS S3

To enable scheduled backups to AWS S3 buckets, add the following YAML fragment to the Helm values.yaml file used to install StackState:

backup:
  enabled: true
  stackGraph:
    bucketName: AWS_STACKGRAPH_BUCKET
  elasticsearch:
    bucketName: AWS_ELASTICSEARCH_BUCKET
minio:
  accessKey: YOUR_ACCESS_KEY
  secretKey: YOUR_SECRET_KEY
  s3gateway:
    enabled: true
    accessKey: AWS_ACCESS_KEY
    secretKey: AWS_SECRET_KEY

Replace the following values:

  • YOUR_ACCESS_KEY and YOUR_SECRET_KEY are the credentials that will be used to secure the MinIO system. The automatic backup jobs and the restore jobs will use them. They are also required to manually access the MinIO storage. YOUR_ACCESS_KEY should contain 5 to 20 alphanumerical characters and YOUR_SECRET_KEY should contain 8 to 40 alphanumerical characters.

  • AWS_ACCESS_KEY and AWS_SECRET_KEY are the AWS credentials for the IAM user that has access to the S3 buckets where the backups will be stored. See below for the permission policy that needs to be attached to that user.

  • AWS_STACKGRAPH_BUCKET and AWS_ELASTICSEARCH_BUCKET are the names of the S3 buckets where the backups should be stored. Note: The names of AWS S3 buckets are global across the whole of AWS, therefore the S3 buckets with the default name (sts-elasticsearch-backup and sts-stackgraph-backup) will probably not be available.

The IAM user identified by AWS_ACCESS_KEY and AWS_SECRET_KEY must be configured with the following permission policy to access the S3 buckets:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListMinioBackupBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::AWS_STACKGRAPH_BUCKET",
                "arn:aws:s3:::AWS_ELASTICSEARCH_BUCKET"
            ]
        },
        {
            "Sid": "AllowWriteMinioBackupBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::AWS_STACKGRAPH_BUCKET/*",
                "arn:aws:s3:::AWS_ELASTICSEARCH_BUCKET/*"
            ]
        }
    ]
}

Backup to Azure Blob Storage

To enable backups to an Azure Blob Storage account, add the following YAML fragment to the Helm values.yaml file used to install StackState:

backup:
  enabled: true
minio:
  accessKey: AZURE_STORAGE_ACCOUNT_NAME
  secretKey: AZURE_STORAGE_ACCOUNT_KEY
  azuregateway:
    enabled: true

Replace AZURE_STORAGE_ACCOUNT_NAME with the Azure storage account name (microsoft.com) and replace AZURE_STORAGE_ACCOUNT_KEY with the Azure storage account key (microsoft.com) where the backups should be stored.

The StackGraph and Elasticsearch backups are stored in BLOB containers called sts-stackgraph-backup and sts-elasticsearch-backup respectively. These names can be changed by setting the Helm values backup.stackGraph.bucketName and backup.elasticsearch.bucketName respectively.

Backup to Kubernetes storage

To enable backups to cluster-local storage, enable MinIO by adding the following YAML fragment to the Helm values.yaml file used to install StackState:

backup:
  enabled: true
minio:
  accessKey: YOUR_ACCESS_KEY
  secretKey: YOUR_SECRET_KEY
  persistence:
    enabled: true

Replace YOUR_ACCESS_KEY and YOUR_SECRET_KEY with the credentials that will be used to secure the MinIO system. The automatic backup jobs and the restore jobs will use them. They are also required to manually access the MinIO storage. YOUR_ACCESS_KEY should contain 5 to 20 alphanumerical characters and YOUR_SECRET_KEY should contain 8 to 40 alphanumerical characters.

Configuration and topology data (StackGraph)

Configuration and topology data (StackGraph) backups are full backups, stored in a single file with the extension .graph. Each file contains a full backup and can be moved, copied or deleted as required.

Disable scheduled backups

When backup.enabled is set to true, scheduled StackGraph backups are enabled by default. To disable scheduled StackGraph backups only, set the Helm value backup.stackGraph.scheduled.enabled to false.

Disable restores

When backup.enabled is set to true, StackGraph restores are enabled by default. To disable StackGraph restore functionality only, set the Helm value backup.stackGraph.restore.enabled to false.

Backup schedule

By default, the StackGraph backups are created daily at 03:00 AM server time.

The backup schedule can be configured using the Helm value backup.stackGraph.scheduled.schedule, specified in Kubernetes cron schedule syntax (kubernetes.io).

Backup retention

By default, the StackGraph backups are kept for 30 days. As StackGraph backups are full backups, this can require a lot of storage.

The backup retention delta can be configured using the Helm value backup.stackGraph.scheduled.backupRetentionTimeDelta, specified in Python timedelta format (python.org).

Telemetry data (Elasticsearch)

The telemetry data (Elasticsearch) snapshots are incremental and stored in files with the extension .dat. The files in the Elasticsearch backup storage location should be treated as a single whole and can only be moved, copied or deleted as a whole.

The configuration snippets provided in the section enable backups will enable daily Elasticsearch snapshots.

Disable scheduled snapshots

When backup.enabled is set to true, scheduled Elasticsearch snapshots are enabled by default. To disable scheduled Elasticsearch snapshots only, set the Helm value backup.elasticsearch.scheduled.enabled to false.

Disable restores

When backup.enabled is set to true, Elasticsearch restores are enabled by default. To disable Elasticsearch restore functionality only, set the Helm value backup.elasticsearch.restore.enabled to false.

Snapshot schedule

By default, Elasticsearch snapshots are created daily at 03:00 AM server time.

The backup schedule can be configured using the Helm value backup.elasticsearch.scheduled.schedule, specified in Elasticsearch cron schedule syntax (elastic.co).

Snapshot retention

By default, Elasticsearch snapshots are kept for 30 days, with a minimum of 5 snapshots and a maximum of 30 snapshots.

The retention time and number of snapshots kept can be configured using the following Helm values:

  • backup.elasticsearch.scheduled.snapshotRetentionExpireAfter, specified in Elasticsearch time units (elastic.co).

  • backup.elasticsearch.scheduled.snapshotRetentionMinCount

  • backup.elasticsearch.scheduled.snapshotRetentionMaxCount

By default, the retention task itself runs daily at 1:30 AM UTC (elastic.co). If you set snapshots to expire faster than within a day, for example for testing purposes, you will need to change the schedule for the retention task.

Snapshot indices

By default, a snapshot is created for all Elasticsearch indices.

This indices for which a snapshot is created can be configured using the Helm value backup.elasticsearch.scheduled.indices, specified in JSON array format (w3schools.com).

Restore backups and snapshots

Scripts to list and restore backups and snapshots can be found in the restore directory of the StackState Helm chart repository (github.com). To use the scripts, download them from the GitHub site or check out the repository.

Before you use the scripts, ensure that:

  1. The kubectl binary has been installed.

  2. The kubectl binary is configured to connect to the Kubernetes cluster and the namespace within that cluster that runs StackState.

  3. The Helm value backup.enabled is set to true.

  4. The Helm value backup.stackGraph.restore.enabled is not set to false (to access StackGraph backups).

  5. The Helm value backup.elasticsearch.restore.enabled is not set to false (to access Elasticsearch snapshots).

List StackGraph backups

To list the StackGraph backups, execute the following command:

./restore/list-stackgraph-backups.sh

The output should look like this:

job.batch/stackgraph-list-backups-20210222t111942 created
Waiting for job to start...
=== Listing StackGraph backups in bucket "sts-stackgraph-backup"...
sts-backup-20210215-0300.graph
sts-backup-20210216-0300.graph
sts-backup-20210217-0300.graph
sts-backup-20210218-0300.graph
sts-backup-20210219-0300.graph
sts-backup-20210220-0300.graph
sts-backup-20210221-0300.graph
sts-backup-20210222-0300.graph
===
job.batch "stackgraph-list-backups-20210222t111942" deleted

The timestamp when the backup was taken is part of the backup name.

Lines in the output that start with Error from server (BadRequest): are expected. They appear when the script is waiting for the pod to start.

Restore a StackGraph backup

When a backup is restored, the existing data in the StackGraph database will be overwritten.

Only execute the restore command when you are sure that you want to restore the backup.

To restore a StackGraph backup, select a backup name and pass it as the first parameter in the following command:

./restore/restore-stackgraph-backup.sh sts-backup-20210216-0300.graph

The output should look like this:

job.batch/stackgraph-restore-20210222t112142 created
Waiting for job to start...
=== Downloading StackGraph backup "sts-backup-20210216-0300.graph" from bucket "sts-stackgraph-backup"...
download: s3://sts-stackgraph-backup/sts-backup-20210216-1252.graph to ../../tmp/sts-backup-20210216-0300.graph
=== Importing StackGraph data from "sts-backup-20210216-0300.graph"...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.vmplugin.v7.Java7$1 (file:/opt/docker/lib/org.codehaus.groovy.groovy-2.5.4.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.vmplugin.v7.Java7$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
===
job.batch "stackgraph-restore-20210222t112142" deleted

Lines that starts with WARNING: are expected. They are generated by Groovy running in JDK 11 and can be ignored.

List Elasticsearch snapshots

To list the Elasticsearch snapshots, execute the following command:

./restore/list-elasticsearch-snapshos.sh

The output should look like this:

job.batch/elasticsearch-list-snapshots-20210224t133115 created
Waiting for job to start...
Waiting for job to start...
=== Listing Elasticsearch snapshots in snapshot repository "sts-backup" in bucket "sts-elasticsearch-backup"...
sts-backup-20210219-0300-mref7yrvrswxa02aqq213w
sts-backup-20210220-0300-yrn6qexkrdgh3pummsrj7e
sts-backup-20210221-0300-p481sih8s5jhre9zy4yw2o
sts-backup-20210222-0300-611kxendsvh4hhkoosr4b7
sts-backup-20210223-0300-ppss8nx40ykppss8nx40yk
===
job.batch "elasticsearch-list-snapshots-20210224t133115" deleted

The timestamp when the backup was taken is part of the backup name.

Restore an Elasticsearch snapshot

When a snapshot is restored, existing indices will NOT be overwritten. Use Elasticsearch's Delete index API (elastic.co) to remove them first. See delete Elasticsearch indices, below.

To restore an Elasticsearch snapshot, select a snapshot name and pass it as the first parameter in the following command line:

./restore/restore-elasticsearch-snapshot.sh sts-backup-20210223-0300-ppss8nx40ykppss8nx40yk

The output should look like this:

job.batch/elasticsearch-restore-20210229t152530 created
Waiting for job to start...
Waiting for job to start...
=== Restoring Elasticsearch snapshot "sts-backup-20210223-0300-ppss8nx40ykppss8nx40yk" from snapshot repository "sts-backup" in bucket "sts-elasticsearch-backup"...
{
  "snapshot" : {
    "snapshot" : "sts-backup-20210223-0300-ppss8nx40ykppss8nx40yk",
    "indices" : [
      ".slm-history-1-000001",
      "ilm-history-1-000001",
      "sts_internal_events-2021.02.19"
    ],
    "shards" : {
      "total" : 3,
      "failed" : 0,
      "successful" : 3
    }
  }
}
===
job.batch "elasticsearch-restore-20210229t152530" deleted

The indices restored are listed in the output, as well as the number of failed and successful restore actions.

Delete Elasticsearch indices

To delete existing Elasticsearch indices so that a snapshot can be restored, follow these steps.

  1. Open a port-forward to the Elasticsearch master:

    kubectl port-forward service/stackstate-elasticsearch-master 9200:9200
  2. Delete an index with a following command:

    curl -X DELETE "http://localhost:9200/INDEX_NAME?pretty"

    Replace INDEX_NAME with the name of the index to delete, for example

    curl -X DELETE "http://localhost:9200/sts_internal_events-2021.02.19?pretty"
  3. The output should be:

    {
    "acknowledged" : true
    }

Last updated