LogoLogo
StackState.comDownloadSupportExplore playground
SUSE Observability
SUSE Observability
  • SUSE Observability docs!
  • Docs for all SUSE Observability products
  • 🚀Get started
    • Quick start guide
    • SUSE Observability walk-through
    • SUSE Rancher Prime
      • Air-gapped
      • Agent Air-gapped
    • SUSE Cloud Observability
  • 🦮Guided troubleshooting
    • What is guided troubleshooting?
    • YAML Configuration
    • Changes
    • Logs
  • 🚨Monitors and alerts
    • Monitors
    • Out of the box monitors for Kubernetes
    • Notifications
      • Configure notifications
      • Notification channels
        • Slack
        • Teams
        • Webhook
        • Opsgenie
      • Troubleshooting
    • Customize
      • Add a monitor using the CLI
      • Derived State monitor
      • Dynamic Threshold monitor
      • Override monitor arguments
      • Write a remediation guide
  • 📈Metrics
    • Explore Metrics
    • Custom charts
      • Adding custom charts to components
      • Writing PromQL queries for representative charts
      • Troubleshooting custom charts
    • Advanced Metrics
      • Grafana Datasource
      • Prometheus remote_write
      • OpenMetrics
  • 📑Logs
    • Explore Logs
    • Log Shipping
  • 🔭Traces
    • Explore Traces
  • 📖Health
    • Health synchronization
    • Send health data over HTTP
      • Send health data
      • Repeat Snapshots JSON
      • Transactional Increments JSON
    • Debug health synchronization
  • 🔍Views
    • Kubernetes views
    • Custom views
    • Component views
    • Explore views
    • View structure
      • Overview perspective
      • Highlights perspective
      • Topology perspective
      • Events perspective
      • Metrics perspective
      • Traces perspective
      • Filters
      • Keyboard shortcuts
    • Timeline and time travel
  • 🕵️Agent
    • Network configuration
      • Proxy Configuration
    • Using a custom registry
    • Custom Secret Management
      • Custom Secret Management (Deprecated)
    • Request tracing
      • Certificates for sidecar injection
  • 🔭Open Telemetry
    • Overview
    • Getting started
      • Concepts
      • Kubernetes
      • Kubernetes Operator
      • Linux
      • AWS Lambda
    • Open telemetry collector
      • Sampling
      • SUSE Observability OTLP APIs
    • Instrumentation
      • Java
      • Node.js
        • Auto-instrumentation of Lambdas
      • .NET
      • SDK Exporter configuration
    • Troubleshooting
  • CLI
    • SUSE Observability CLI
  • 🚀Self-hosted setup
    • Install SUSE Observability
      • Requirements
      • Kubernetes / OpenShift
        • Kubernetes install
        • OpenShift install
        • Alibaba Cloud ACK install
        • Required Permissions
        • Override default configuration
        • Configure storage
        • Exposing SUSE Observability outside of the cluster
      • Initial run guide
      • Troubleshooting
        • Advanced Troubleshooting
        • Support Package (Logs)
    • Configure SUSE Observability
      • Slack notifications
      • E-mail notifications
      • Stackpacks
      • Advanced
        • Analytics
    • Release Notes
      • v2.0.0 - 11/Sep/2024
      • v2.0.1 - 18/Sep/2024
      • v2.0.2 - 01/Oct/2024
      • v2.1.0 - 29/Oct/2024
      • v2.2.0 - 09/Dec/2024
      • v2.2.1 - 10/Dec/2024
      • v2.3.0 - 30/Jan/2025
      • v2.3.1 - 17/Mar/2025
      • v2.3.2 - 22/Apr/2025
      • v2.3.3 - 07/May/2025
    • Upgrade SUSE Observability
      • Migration from StackState
      • Steps to upgrade
      • Version-specific upgrade instructions
    • Uninstall SUSE Observability
    • Air-gapped
      • SUSE Observability air-gapped
      • SUSE Observability Kubernetes Agent air-gapped
    • Data management
      • Backup and Restore
        • Kubernetes backup
        • Configuration backup
      • Data retention
      • Clear stored data
    • Security
      • Authentication
        • Authentication options
        • Single password
        • File-based
        • LDAP
        • Open ID Connect (OIDC)
          • Microsoft Entra ID
        • KeyCloak
        • Service tokens
        • Troubleshooting
      • RBAC
        • Role-based Access Control
        • Permissions
        • Roles
        • Scopes
      • Self-signed certificates
      • External secrets
  • 🔐Security
    • Service Tokens
    • API Keys
  • ☁️SaaS
    • User Management
  • Reference
    • SUSE Observability Query Language (STQL)
    • Chart units
    • Topology Identifiers
Powered by GitBook
LogoLogo

Legal notices

  • Privacy
  • Cookies
  • Responsible disclosure
  • SOC 2/SOC 3
On this page
  • Overview
  • Guidelines
  • Remediation guide example
  • Inserting links
  • Deep links
  • In-page links
  1. Monitors and alerts
  2. Customize

Write a remediation guide

SUSE Observability

PreviousOverride monitor argumentsNextExplore Metrics

Last updated 8 months ago

Overview

SUSE Observability provides , which provide monitoring on common issues that can occur in a Kubernetes cluster. These monitors also contain out of the box remediation guides which are meant to guide users in accurately troubleshooting issues. They are created using the best practices and community knowledge. Follow the indications on this page to learn how to write an effective remediation guide yourself.

Guidelines

  • Provide step by step instructions to guide a user into solving the issue detected by the monitor;

  • Make sure the instructions are ordered by the most likely cause(s).

  • If possible, include links to relevant data and/or resources to speed up the investigation.

  • Keep it short and on point:

    • Avoid over explaining - add links to supporting documentation if that's the case;

    • Avoid using a table of contents and similar content blocks;

    • Avoid having a summary of the same content;

  • Try to format the guide in a structured way. Use:

    • bullet points

    • numbering

    • short sentences

    • paragraphs

    • inline formatted examples

  • If there are open ends (there might be different causes which are still unknown), provide guidance for escalating the issue. E.g. provide the user with a support link/ number, etc.

Remediation guide example

When a Kubernetes container has errors, it can enter into a state called CrashLoopBackOff, where Kubernetes attempts to restart the container to resolve the issue. The container will continue to restart until the problem is resolved.Take the following steps to diagnose the problem:

### Pod Events

Check the pod events to identify any explicit errors or warnings.
1. Go to the "Events" section in the middle of the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}})
2. Check if there is are events like "BackOff", "FailedScheduling", "FailedAttachVolume" or "OOMKilled" in the Alert Category by clicking on 'Alerts'.
3. You can see the details of the event (click on the event) to give more information about the issue.
4. If the 'Show related event' option is enabled all events of resources related to this resource like a deployment will also show up and can give you a clue if any change on them is causing this issue. You can see this by checking if there is a correlation between the time of a deployment and a change of behaviour seen by the metrics and events of this pod.
For easy correlation you can use 'shift'-'click' to add markers to the different graph, log and event widgets.
    
### Container Logs
Check the container logs for any explicit errors or warnings
Inspect the [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) of all the containers in this pod.
Search for hints in the logs by:
1.  Looking for changes in logging pattern, by looking at the number of logs per time unit (The histogram bars).
    In many cases the change in pattern will indicate what is going on.
    You can click-drag on the histogram bars to narrow the logs displayed to that time-frame.
2.  Searching for "Error" or "Fatal" in the search bar.
3.  Looking at the logs around the time that the monitor triggered
    
### Recent Changes
Look at the pod age in the "About" section on the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) to identify any recent deployments that might have caused the issue
1. The "Age" is shown in the "About" section on the left side of the screen
2. If the "Age" and the time that the monitor was triggered are in close proximity then take a look at the most recent deployment by clicking on [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange).

Inserting links

The syntax we use is different for "deep links" vs "in-page links". The "deep links" will redirect the user from the current page, whilst the "in-page links" will keep the user on the same page.

Deep links

To link to any perspective (e.g. "hightlights", "topology", "events", "metrics") of the current resource, use the following syntax:

[highlight page](/#/components/\{{ componentUrnForUrl \}})
[topology](/#/components/{{ componentUrnForUrl }}/topology)
[events](/#/components/{{ componentUrnForUrl }}/events)
[metrics](/#/components/{{ componentUrnForUrl }}/metrics)

In-page links

To link to any additional data (e.g. "show logs", "show last change", "show status", "show configuration") on the current resource, use the following syntax:

[logs](/#/components/\{{ componentUrnForUrl \}}#logs)
[last change](/#/components/\{{ componentUrnForUrl \}}#lastChange)
[status](/#/components/\{{ componentUrnForUrl \}}#status)
[configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)
🚨
monitors out of the box