LogoLogo
StackState.comDownloadSupportExplore playground
SUSE Observability
SUSE Observability
  • SUSE Observability docs!
  • Docs for all SUSE Observability products
  • 🚀Get started
    • Quick start guide
    • SUSE Observability walk-through
    • SUSE Rancher Prime
      • Air-gapped
      • Agent Air-gapped
    • SUSE Cloud Observability
  • 🦮Guided troubleshooting
    • What is guided troubleshooting?
    • YAML Configuration
    • Changes
    • Logs
  • 🚨Monitors and alerts
    • Monitors
    • Out of the box monitors for Kubernetes
    • Notifications
      • Configure notifications
      • Notification channels
        • Slack
        • Teams
        • Webhook
        • Opsgenie
      • Troubleshooting
    • Customize
      • Add a monitor using the CLI
      • Derived State monitor
      • Override monitor arguments
      • Write a remediation guide
  • 📈Metrics
    • Explore Metrics
    • Custom charts
      • Adding custom charts to components
      • Writing PromQL queries for representative charts
      • Troubleshooting custom charts
    • Advanced Metrics
      • Grafana Datasource
      • Prometheus remote_write
      • OpenMetrics
  • 📑Logs
    • Explore Logs
    • Log Shipping
  • 🔭Traces
    • Explore Traces
  • 📖Health
    • Health synchronization
    • Send health data over HTTP
      • Send health data
      • Repeat Snapshots JSON
      • Transactional Increments JSON
    • Debug health synchronization
  • 🔍Views
    • Kubernetes views
    • Custom views
    • Component views
    • Explore views
    • View structure
      • Overview perspective
      • Highlights perspective
      • Topology perspective
      • Events perspective
      • Metrics perspective
      • Traces perspective
      • Filters
      • Keyboard shortcuts
    • Timeline and time travel
  • 🕵️Agent
    • Network configuration
      • Proxy Configuration
    • Using a custom registry
    • Custom Secret Management
      • Custom Secret Management (Deprecated)
    • Request tracing
      • Certificates for sidecar injection
  • 🔭Open Telemetry
    • Overview
    • Getting started
      • Concepts
      • Kubernetes
      • Kubernetes Operator
      • Linux
      • AWS Lambda
    • Open telemetry collector
      • Sampling
      • SUSE Observability OTLP APIs
    • Instrumentation
      • Java
      • Node.js
        • Auto-instrumentation of Lambdas
      • .NET
      • SDK Exporter configuration
    • Troubleshooting
  • CLI
    • SUSE Observability CLI
  • 🚀Self-hosted setup
    • Install SUSE Observability
      • Requirements
      • Kubernetes / OpenShift
        • Kubernetes install
        • OpenShift install
        • Alibaba Cloud ACK install
        • Required Permissions
        • Override default configuration
        • Configure storage
        • Exposing SUSE Observability outside of the cluster
      • Initial run guide
      • Troubleshooting
        • Advanced Troubleshooting
        • Support Package (Logs)
    • Configure SUSE Observability
      • Slack notifications
      • E-mail notifications
      • Stackpacks
      • Advanced
        • Analytics
    • Release Notes
      • v2.0.0 - 11/Sep/2024
      • v2.0.1 - 18/Sep/2024
      • v2.0.2 - 01/Oct/2024
      • v2.1.0 - 29/Oct/2024
      • v2.2.0 - 09/Dec/2024
      • v2.2.1 - 10/Dec/2024
      • v2.3.0 - 30/Jan/2025
      • v2.3.1 - 17/Mar/2025
      • v2.3.2 - 22/Apr/2025
      • v2.3.3 - 07/May/2025
    • Upgrade SUSE Observability
      • Migration from StackState
      • Steps to upgrade
      • Version-specific upgrade instructions
    • Uninstall SUSE Observability
    • Air-gapped
      • SUSE Observability air-gapped
      • SUSE Observability Kubernetes Agent air-gapped
    • Data management
      • Backup and Restore
        • Kubernetes backup
        • Configuration backup
      • Data retention
      • Clear stored data
    • Security
      • Authentication
        • Authentication options
        • Single password
        • File-based
        • LDAP
        • Open ID Connect (OIDC)
          • Microsoft Entra ID
        • KeyCloak
        • Service tokens
        • Troubleshooting
      • RBAC
        • Role-based Access Control
        • Permissions
        • Roles
        • Scopes
      • Self-signed certificates
      • External secrets
  • 🔐Security
    • Service Tokens
    • API Keys
  • ☁️SaaS
    • User Management
  • Reference
    • SUSE Observability Query Language (STQL)
    • Chart units
    • Topology Identifiers
Powered by GitBook
LogoLogo

Legal notices

  • Privacy
  • Cookies
  • Responsible disclosure
  • SOC 2/SOC 3
On this page
  • Overview
  • Out of the box Kubernetes monitors
  • Available service endpoints
  • Cpu limits resourcequota
  • Cpu requests resourcequota
  • Daemonset desired replicas
  • Deployment desired replicas
  • HTTP - 5xx error ratio
  • HTTP - response time - Q95 is above 3 seconds
  • Kubernetes volume usage trend over 12 hours
  • Kubernetes volume usage trend over 4 days
  • Memory limits resourcequota
  • Memory requests resourcequota
  • Node Disk Pressure
  • Node Memory Pressure
  • Node PID Pressure
  • Node Readiness
  • Out of memory for containers
  • Pod Ready State
  • Pod span duration
  • Pod span error ratio
  • Pods in Waiting State
  • Replicaset desired replicas
  • Restarts for containers
  • Service span duration
  • Service span error ratio
  • Statefulset desired replicas
  • Unschedulable Node
  • Aggregated health state of a Cluster
  • Derived Workloads health state (Deployment, DaemonSet, ReplicaSet, StatefulSet)
  • See also
  1. Monitors and alerts

Out of the box monitors for Kubernetes

SUSE Observability

PreviousMonitorsNextNotifications

Last updated 14 days ago

Overview

This section describes the out-of-the-box monitors delivered with SUSE Observability. Monitors delivered with the product are added constantly. Have a look under the Monitors section in the main menu to find the full list.

Out of the box Kubernetes monitors

Available service endpoints

It is important to ensure that your services are available and accessible to users. To monitor this, SUSE Observability has set up a check that verifies if a service has at least one endpoint available. Endpoints are network addresses that enable communication between different components in a distributed system, and they need to be available for the service to function properly. If there is an occurrence of zero endpoints available within the last 10 minutes, the monitor will remain deviating, indicating that there may be an issue with the service that needs to be addressed. Allows

Cpu limits resourcequota

Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource limits for Cpu defined in a ResourceQuota. The monitor will alert when the total Cpu limits in the namespace gets to 90% or more of the established by the quota. Each resourcequota in the namespace produces a monitor health state.

Cpu requests resourcequota

Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource requests for Cpu defined in a ResourceQuota. The monitor will alert when the total Cpu requests in the namespace gets to 90% or more of the established by the quota. Each resourcequota in the namespace produces a monitor health state.

Daemonset desired replicas

It is important that the desired number of replicas for a Daemonset is being met. Daemonsets are used to manage a set of pods that need to run on all or a subset of nodes in a cluster, ensuring that a copy of the pod is running on each node that meets the specified criteria. This is useful for tasks such as logging, monitoring, and other cluster-level tasks that need to be executed on every node in the cluster. To monitor this, SUSE Observability has set up a check that verifies if the available replicas match the desired number of replicas. This check will only be applied to DaemonSets that have a desired number of replicas greater than zero. - If the number of available replicas is less than the desired number, the monitor will signal a DEVIATING health state, indicating that there may be an issue with the StatefulSet. - If the number of available replicas is zero, the monitor will signal a CRITICAL health state, indicating that the StatefulSet is not functioning at all. To understand the full monitor definition check the details.

Deployment desired replicas

It is important that the desired number of replicas for a Deployments is being met. Deployments are used to manage the deployment and scaling of a set of identical Pods in a Kubernetes cluster. By ensuring that the desired number of replicas is running and available, Deployments can help maintain the availability and reliability of a Kubernetes application or service. To monitor this, SUSE Observability has set up a check that verifies if the available replicas match the desired number of replicas. This check will only be applied to Deployments that have a desired number of replicas greater than zero. - If the number of available replicas is less than the desired number, the monitor will signal a DEVIATING health state, indicating that there may be an issue with the Deployments. - If the number of available replicas is zero, the monitor will signal a CRITICAL health state, indicating that the StatefulSet is not functioning at all. To understand the full monitor definition check the details.

HTTP - 5xx error ratio

HTTP responses with a status code in the 5xx range indicate server-side errors such as a misconfiguration, overload or internal server errors. To ensure a good user experience, the percentage of 5xx responses should be less than 5% of the total HTTP responses for a Kubernetes (K8s) service.

HTTP - response time - Q95 is above 3 seconds

It is important to keep an eye on the HTTP response times for a Kubernetes service. SUSE Observability monitors the 95th percentile response time, or Q95, which is a statistical measure indicating that 95% of the responses are faster than this time. This is a useful measure for identifying outliers and slow requests that may negatively impact the user experience. If the Q95 response time is above 3 seconds during a specified time window, the monitor will produce a notification indicating a deviating state.

Kubernetes volume usage trend over 12 hours

It is important to monitor the usage of Persistent Volume Claims (PVCs) in your Kubernetes cluster. PVCs are used to store data that needs to persist beyond the lifetime of a container, and it's crucial to ensure that they have enough space to store the data. To track this, SUSE Observability has set up a check that uses linear prediction to forecast the Kubernetes volume usage trend over a 12-hour period. If the trend indicates that the PVCs will run out of space within this time frame, you will receive a notification, allowing you to take action to prevent data loss or downtime.

Kubernetes volume usage trend over 4 days

It is important to monitor the usage of Persistent Volume Claims (PVCs) in your Kubernetes cluster over time. PVCs are used to store data that needs to persist beyond the lifetime of a container, and it's crucial to ensure that they have enough space to store the data. To track this, SUSE Observability set up a check that uses linear prediction to forecast the Kubernetes volume usage trend over a 4-day period. If the trend indicates that the PVCs will run out of space within this time frame, you will receive a notification, allowing you to take action to prevent data loss or downtime.

Memory limits resourcequota

Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource limits for memory defined in a ResourceQuota. The monitor will alert when the total memory limits in the namespace gets to 90% or more of the established by the quota. Each resourcequota in the namespace produces a monitor health state.

Memory requests resourcequota

Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource requests for memory defined in a ResourceQuota. The monitor will alert when the total memory requests in the namespace gets to 90% or more of the established by the quota. Each resourcequota in the namespace produces a monitor health state.

Node Disk Pressure

Node Memory Pressure

Node PID Pressure

Node Readiness

Out of memory for containers

Pod Ready State

Checks if a Pod that has been scheduled is running and ready to receive traffic within the expected amount of time.

Pod span duration

Pod span error ratio

Pods in Waiting State

If a pod is within a waiting state and contains a reason of CreateContainerConfigError, CreateContainerError, CrashLoopBackOff, or ImagePullBackOff it will be seen as deviating.

Replicaset desired replicas

It is important to ensure that the desired number of replicas for your ReplicaSet (and Deployment) is being met. ReplicaSets and Deployments are used to manage the number of replicas of a particular pod in a Kubernetes cluster.

To monitor this, SUSE Observability has set up a check that verifies if the available replicas match the desired number of replicas. This check will only be applied to ReplicaSets and Deployments that have a desired number of replicas greater than zero.

  • If the number of available replicas is less than the desired number, the monitor will signal a DEVIATING health state, indicating that there may be an issue with the ReplicaSet or Deployment.

  • If the number of available replicas is zero, the monitor will signal a CRITICAL health state, indicating that the ReplicaSet or Deployment is not functioning at all.

Additionally, the health state of the ReplicaSet will propagate to the Deployment for more comprehensive monitoring.

Restarts for containers

It is important to monitor the restarts for each container in a Kubernetes cluster. Containers can restart for a variety of reasons, including issues with the application or the infrastructure. To ensure that the application is running smoothly, SUSE Observability has set up a monitor that tracks the number of container restarts over a 10-minute period. If there are more than 3 restarts during this time frame, the container's health state will be set to DEVIATING, indicating that there may be an issue that needs to be investigated.

Service span duration

Service span error ratio

Statefulset desired replicas

It is important that the desired number of replicas for a StatefulSet is being met. StatefulSets are used to manage stateful applications and require a specific number of replicas to function properly.

To monitor this, SUSE Observability has set up a check that verifies if the available replicas match the desired number of replicas. This check will only be applied to StatefulSets that have a desired number of replicas greater than zero.

  • If the number of available replicas is less than the desired number, the monitor will signal a DEVIATING health state, indicating that there may be an issue with the StatefulSet.

  • If the number of available replicas is zero, the monitor will signal a CRITICAL health state, indicating that the StatefulSet is not functioning at all.

Unschedulable Node

If you encounter a "NodeNotSchedulable" event in Kubernetes, it means that the Kubernetes scheduler was unable to place a pod on a specific node due to some constraints or issues with the node. This event occurs when the scheduler cannot find a suitable node to run the pod according to its resource requirements and other constraints.

Aggregated health state of a Cluster

Cluster doesn't have any health itself. But a cluster is build from few components, some of them are critical to normal operation. The monitor aggregates states of these components:

  • all pods in 'kube-system' namespace

  • all nodes and then takes the most critical health state.

Derived Workloads health state (Deployment, DaemonSet, ReplicaSet, StatefulSet)

The monitor aggregates states of all top-most dependencies and then returns the most critical health state based on direct observations (e.g., from metrics). This approach ensures that health signals propagate from low-level technical components (like Pods) to higher-level logical components, but only when the component itself lacks an observed health state. To use this monitor effectively, make sure that some or all of following health checks are disabled:

  • Deployment desired replicas

  • DaemonSet desired replicas

  • ReplicaSet desired replicas

  • StatefulSet desired replicas

See also

Node disk pressure refers to a situation where the disks connected to a node experience excessive strain. While encountering node disk pressure is unlikely due to Kubernetes' built-in preventive measures, it can still occur sporadically. There are two primary reasons why node disk pressure may arise. The first reason relates to Kubernetes failing to clean up unused images. Under normal circumstances, Kubernetes regularly checks for and deletes any images that are not in use. Therefore, this is an uncommon cause of node disk pressure, but it should be acknowledged. The more probable issue involves the accumulation of logs. In Kubernetes, logs are typically saved in two scenarios: when containers are running and when the most recently exited container's logs are retained for troubleshooting purposes. This approach aims to strike a balance between preserving important logs and discarding unnecessary ones over time. However, if a long-running container generates an extensive volume of logs, they may accumulate to the point where they overload the node disk's capacity. To understand the full monitor definition check the details. Allows

Node memory pressure refers to a situation where the memory resources on a Kubernetes node are excessively strained. While encountering node memory pressure is uncommon due to Kubernetes' built-in resource management mechanisms, it can still occur under specific circumstances. There are two primary reasons why node memory pressure may arise. The first reason is related to misconfigured or insufficient resource requests and limits for containers running on the node. Kubernetes relies on resource requests and limits to allocate and manage resources effectively. If containers are not accurately configured with their memory requirements, they may consume more memory than expected, leading to node memory pressure. The second reason involves the presence of memory-intensive applications or processes. Certain workloads or applications may have higher memory demands, resulting in increased memory utilization on the node. If multiple pods or containers with substantial memory requirements are scheduled on the same node without proper resource allocation, it can cause memory pressure. To mitigate node memory pressure, it is crucial to review and adjust resource requests and limits for containers, ensuring they align with the actual memory needs of the applications. Monitoring and optimizing memory usage within the applications themselves can also help reduce memory consumption. Additionally, consider horizontal pod autoscaling to dynamically scale the number of pods based on memory utilization. Regular monitoring, analysis of memory-related metrics, and proactive allocation of memory resources can help maintain a healthy memory state on Kubernetes nodes. It's essential to understand the specific requirements of your workloads and adjust resource allocation accordingly to prevent memory pressure and ensure optimal performance. Allows

Node PID pressure occurs when the available process identification (PID) resources on a Kubernetes node are excessively strained. The first reason is related to misconfigured or insufficient resource requests and limits for containers running on the node. Kubernetes relies on accurate resource requests and limits to effectively allocate and manage resources. If containers are not configured correctly with their PID requirements, they may consume more PIDs than expected, resulting in node PID pressure. The second reason is the presence of PID-intensive applications or processes. Some workloads or applications have higher demands for process identification, leading to increased PID utilization on the node. If multiple pods or containers with significant PID requirements are scheduled on the same node without proper resource allocation, it can cause PID pressure. To address node PID pressure, it is important to review and adjust resource requests and limits for containers to ensure they align with the actual PID needs of the applications. Monitoring and optimizing PID usage within the applications themselves can also help reduce PID consumption. Additionally, considering horizontal pod autoscaling can dynamically scale the number of pods based on PID utilization. Regular monitoring, analysis of PID-related metrics, and proactive allocation of PID resources are crucial for maintaining a healthy state of PID usage on Kubernetes nodes. It is essential to understand the specific requirements of your workloads and adjust resource allocation accordingly to prevent PID pressure and ensure optimal performance. Allows

Check if the Node is up and running as expected. Allows

It is important to ensure that the containers running in your Kubernetes cluster have enough memory to function properly. Out-of-memory (OOM) conditions can cause containers to crash or become unresponsive, leading to restarts and potential data loss. To monitor for these conditions, SUSE Observability set up a check that detects and reports OOM events in the containers running in the cluster. This check will help you identify any containers that are running out of memory and allow you to take action to prevent issues before they occur. Allows

Monitors the duration of the server and consumer spans. When the 95th percentile of the duration is greater than the threshold (default 5000ms) the monitor has a Deviating state. This monitor supports overriding settings via .

Monitors the percentage of the server and consumer spans that have an error status. If the percentage of error spans exceeds the threshold (default 5) the monitor has a Deviating state. This monitor supports overriding settings via .

Monitors the duration of the server and consumer spans. When the 95th percentile of the duration is greater than the threshold (default 5000ms) the monitor has a Deviating state. This monitor supports overriding settings via .

Percentage of server and consumer spans that are in an error state for a Kubernetes service. This monitor supports overriding settings via .

If you have a use case where logical components have no direct monitors then you can use the function to infer their health based on the technical components they depend on.

🚨
Override Monitor arguments
Override Monitor arguments
Override Monitor arguments
Override Monitor arguments
Override Monitor arguments
Override Monitor arguments
monitor argument overrides
monitor argument overrides
monitor argument overrides
monitor argument overrides
Derived State Monitor
Monitors