LogoLogo
StackState.comDownloadSupportExplore playground
SUSE Observability
SUSE Observability
  • SUSE Observability docs!
  • Docs for all SUSE Observability products
  • 🚀Get started
    • Quick start guide
    • SUSE Observability walk-through
    • SUSE Rancher Prime
      • Air-gapped
      • Agent Air-gapped
    • SUSE Cloud Observability
  • 🦮Guided troubleshooting
    • What is guided troubleshooting?
    • YAML Configuration
    • Changes
    • Logs
  • 🚨Monitors and alerts
    • Monitors
    • Out of the box monitors for Kubernetes
    • Notifications
      • Configure notifications
      • Notification channels
        • Slack
        • Teams
        • Webhook
        • Opsgenie
      • Troubleshooting
    • Customize
      • Add a monitor using the CLI
      • Derived State monitor
      • Override monitor arguments
      • Write a remediation guide
  • 📈Metrics
    • Explore Metrics
    • Custom charts
      • Adding custom charts to components
      • Writing PromQL queries for representative charts
      • Troubleshooting custom charts
    • Advanced Metrics
      • Grafana Datasource
      • Prometheus remote_write
      • OpenMetrics
  • 📑Logs
    • Explore Logs
    • Log Shipping
  • 🔭Traces
    • Explore Traces
  • 📖Health
    • Health synchronization
    • Send health data over HTTP
      • Send health data
      • Repeat Snapshots JSON
      • Transactional Increments JSON
    • Debug health synchronization
  • 🔍Views
    • Kubernetes views
    • Custom views
    • Component views
    • Explore views
    • View structure
      • Overview perspective
      • Highlights perspective
      • Topology perspective
      • Events perspective
      • Metrics perspective
      • Traces perspective
      • Filters
      • Keyboard shortcuts
    • Timeline and time travel
  • 🕵️Agent
    • Network configuration
      • Proxy Configuration
    • Using a custom registry
    • Custom Secret Management
      • Custom Secret Management (Deprecated)
    • Request tracing
      • Certificates for sidecar injection
  • 🔭Open Telemetry
    • Overview
    • Getting started
      • Concepts
      • Kubernetes
      • Kubernetes Operator
      • Linux
      • AWS Lambda
    • Open telemetry collector
      • Sampling
      • SUSE Observability OTLP APIs
    • Instrumentation
      • Java
      • Node.js
        • Auto-instrumentation of Lambdas
      • .NET
      • SDK Exporter configuration
    • Troubleshooting
  • CLI
    • SUSE Observability CLI
  • 🚀Self-hosted setup
    • Install SUSE Observability
      • Requirements
      • Kubernetes / OpenShift
        • Kubernetes install
        • OpenShift install
        • Alibaba Cloud ACK install
        • Required Permissions
        • Override default configuration
        • Configure storage
        • Exposing SUSE Observability outside of the cluster
      • Initial run guide
      • Troubleshooting
        • Advanced Troubleshooting
        • Support Package (Logs)
    • Configure SUSE Observability
      • Slack notifications
      • E-mail notifications
      • Stackpacks
      • Advanced
        • Analytics
    • Release Notes
      • v2.0.0 - 11/Sep/2024
      • v2.0.1 - 18/Sep/2024
      • v2.0.2 - 01/Oct/2024
      • v2.1.0 - 29/Oct/2024
      • v2.2.0 - 09/Dec/2024
      • v2.2.1 - 10/Dec/2024
      • v2.3.0 - 30/Jan/2025
      • v2.3.1 - 17/Mar/2025
      • v2.3.2 - 22/Apr/2025
      • v2.3.3 - 07/May/2025
    • Upgrade SUSE Observability
      • Migration from StackState
      • Steps to upgrade
      • Version-specific upgrade instructions
    • Uninstall SUSE Observability
    • Air-gapped
      • SUSE Observability air-gapped
      • SUSE Observability Kubernetes Agent air-gapped
    • Data management
      • Backup and Restore
        • Kubernetes backup
        • Configuration backup
      • Data retention
      • Clear stored data
    • Security
      • Authentication
        • Authentication options
        • Single password
        • File-based
        • LDAP
        • Open ID Connect (OIDC)
          • Microsoft Entra ID
        • KeyCloak
        • Service tokens
        • Troubleshooting
      • RBAC
        • Role-based Access Control
        • Permissions
        • Roles
        • Scopes
      • Self-signed certificates
      • External secrets
  • 🔐Security
    • Service Tokens
    • API Keys
  • ☁️SaaS
    • User Management
  • Reference
    • SUSE Observability Query Language (STQL)
    • Chart units
    • Topology Identifiers
Powered by GitBook
LogoLogo

Legal notices

  • Privacy
  • Cookies
  • Responsible disclosure
  • SOC 2/SOC 3
On this page
  • General troubleshooting approach
  • Overview of subsystems
  • Databases
  • Ingestion services
  • Processing and serving
  • Miscellaneous
  • Relations between subsystems
  1. Self-hosted setup
  2. Install SUSE Observability
  3. Troubleshooting

Advanced Troubleshooting

SUSE Observability Self-hosted

PreviousTroubleshootingNextSupport Package (Logs)

Last updated 4 months ago

When you are a prime customer, reach out to SUSE Observability support at to get help setting up SUSE Observability in your local cluster. Use to collect information about your instance for the support team.

This page provides detailed information on the subsystems of the SUSE Observability platform to troubleshoot deployment and operational issues. This page should only be consulted when the steps in the do not yield a solution.

General troubleshooting approach

The general approach to troubleshooting operational issues of the SUSE Observability platform, is the following:

  • Getting an overview of how the pods are behaving through kubectl get pods

  • Use the detailed subsystem information in this document, together with the symptoms of the problem, to determine which pods/subsystems might be the root cause

  • Inspect the logs/metadata of the suspected pods through:

    • kubectl logs <pod-name> --all-containers=true

    • kubectl describe pod <pod-name>

    • A quick way to get all related logs/description related to SUSE Observability is through the .

  • It might be the logs point to some dependency misbehaving, in this case investigate the dependency.

Overview of subsystems

Databases

SUSE Observability is powered by various databases, whenever a database is misbehaving, this should be investigated first because all other services depend on it

  • Zookeeper: Zookeeper is used for service discovery, orchestration and failover. Zookeeper is deployed using 1 or more pods with the name:

    • suse-observability-zookeeper-<n>

  • Kafka: Kafka is used for message passing between almost all services: Kafka is deployed by the following pods:

    • suse-observability-kafka-<n>: Main kafka deployment

    • <release-name>-kafkaup-operator-kafkaup-*: Helper operator performing kafka upgrades

  • StackGraph: StackGraph stored (user-)settings and the topology. StackGraph is built out of multiple components and has 2 deployment modes. HA and nonHA.

    • Tephra: Manages database transaction starts, commits and conflicts. Served by pod <release-name>-hbase-tephra-<n>

      • <release-name>-hbase-tephra-<n>: Tephra transaction server pod. Keeps track of transactions and conflicts.

    • HBase-HA: Stores the StackGraph data, spread over multiple pods with different responsibilities:

      • <release-name>-hbase-hdfs-nn-0: Name-node for HDFS, keeps track of file index

      • <release-name>-hbase-hdfs-snn-0: Secondary name-node, does cleanup work after the name-node

      • <release-name>-hbase-hdfs-dn-<n>: HDFS Datanode, stores the actual data

      • <release-name>-hbase-hbase-master-<n>: HBase Master, coordinates tables and regions

      • <release-name>-hbase-hbase-rs-<n>: HBase Region Server, serves tables and regions, stores its data on HDFS

    • HBase-non-HA:

      • <release-name>-hbase-stackgraph-0: All StackGraph components deployed as a single pod in non-HA setup. This also includes its own zookeeper instance.

  • VictoriaMetrics: Stores metric data. Is deployed by the pods:

    • suse-observability-victoria-metrics-<n>-0: Main VictoriaMetrics data store/query node

    • suse-observability-vmagent-0: Ingestion agent for VictoriaMetrics. Data is pushed to vmagent before being forwarded and stored.

  • ClickHouse: Stores trace data. Deployed by the following pod(s):

    • suse-observability-clickhouse-shard0-<n>: Main clickhouse store

  • ElasticSearch: Stores events and logs. Deployed by the following pods:

    • suse-observability-elasticsearch-master-<n>: Main Elasticsearch store

    • <release-name>-prometheus-elasticsearch-exporter-*: Exports performance metrics of the Elasticsearch instances

Ingestion services

SUSE Observability platform gets data pushed by the agent and OpenTelemetry (OTEL) agent. The ingestion services perform initial processing and bring the data to storage.

  • Receiver: The receiver implements the collection-side API for the SUSE Observability agent. It accepts and authorizes telemetry data (logs, events, metrics or topology) and forwards it to the corresponding datastore or Kafka. It can be deployed in single or split mode:

    • Receiver-Split:

      • <release-name>-suse-observability-receiver-logs-*: Receives logs and puts them into Elasticsearch

      • <release-name>-suse-observability-receiver-process-agent-*: Receives process and network connectivity information and forwards it to Kafka topics

      • <release-name>-suse-observability-receiver-base-*: All other SUSE Observability Agent data comes through here.

    • Receiver-NonSplit:

      • <release-name>-suse-observability-receiver-*: All SUSE Observability Agent data comes through here.

  • OpenTelemetry Collector: Provides an endpoint OpenTelemetry agents can push OpenTelemetry data to and produces traces, metrics and topology based on the pushed data.

    • suse-observability-otel-collector-0: Single pod implementing the OTEL collector

Processing and serving

SUSE Observability platform preforms correlation and monitoring on the telemetry data it received. The results of the are served to the customer on demand through the API. The core platform can be ran in distributed and non-distributed mode. Distributed allows for higher throughput.

  • Correlator: Correlates tcp connection information to turn it into topology. Implemented by pod:

    • <release-name>-suse-observability-correlate-*

  • Events2Elasticsearch: Processes events and stores them in Elasticsearch: Implemented by pod:

    • <release-name>-suse-observability-e2es-*

  • Anomaly Detection: The SUSE Observability platform does anomaly detection (disabled by default) on metrics, producing health violations:

    • <release-name>-anomaly-detection-spotlight-manager-*: Distributed anomaly detection work

    • <release-name>-anomaly-detection-spotlight-worker-*: Performs anomaly detection on metric streams

  • Platform-Distributed: The platform contains the main processing components and serving api. In distributed mode functional units are split out. The pods that belong to the platform:

    • <release-name>-suse-observability-api-*: Serves all data to the user and manages StackPack installation/deinstallation.

    • <release-name>-suse-observability-checks-*: Runs the monitors

    • <release-name>-suse-observability-health-sync-*: Processes health (violation) information from monitors and the SUSE Observability Agent and attaches it to topology.

    • <release-name>-suse-observability-initializer-*: Coordinates initialization of the datastores and migrations

    • <release-name>-suse-observability-notification-*: Forwards notifications based on health violations and user setting to downstream systems like Slack/Opsgenie.

    • <release-name>-suse-observability-slicing-*: Continuously optimizes the topology history for quick retrieval

    • <release-name>-suse-observability-state-*: Processing health violations and aggregates them into component health

    • <release-name>-suse-observability-sync-*: Processes topology data combined with user settings and turns it into the topology graph.

  • Platform-Mono:

    • <release-name>-suse-observability-server-*: Contains all functionality of the Platform-Distributed setup but in a single pod.

Miscellaneous

  • Routing: Accept connections and route to the right backend service:

    • <release-name>-suse-observability-router-: Router based on Envoy

  • UI: React-based UI

    • <release-name>-suse-observability-ui: Serves just the static UI code and assets, all dynamic behavior is done by the api

  • Backup/Restore: Periodically run jobs to backup the various data stores. Has one continuously running pod:

    • suse-observability-minio-*: Provides an abstract interface for interacting with backup storage.

Relations between subsystems

To effectively find the root cause of a problem, it is important to understand what pods are dependent on others when deployed. The following diagram shows an overview of the pods with TCP connections that can exist between them. When looking for a root cause it makes sense to look to the pod that is 'lowest' in this dependency chain.

The pod name in this diagram are abbreviated for brevity.

🚀
https://scc.suse.com/
Support Package (Logs)
troubleshooting
Support Package (Logs)
Pod TCP Dependencies