LogoLogo
StackState.comDownloadSupportExplore playground
SUSE Observability
SUSE Observability
  • SUSE Observability docs!
  • Docs for all SUSE Observability products
  • 🚀Get started
    • Quick start guide
    • SUSE Observability walk-through
    • SUSE Rancher Prime
      • Air-gapped
      • Agent Air-gapped
    • SUSE Cloud Observability
  • 🦮Guided troubleshooting
    • What is guided troubleshooting?
    • YAML Configuration
    • Changes
    • Logs
  • 🚨Monitors and alerts
    • Monitors
    • Out of the box monitors for Kubernetes
    • Notifications
      • Configure notifications
      • Notification channels
        • Slack
        • Teams
        • Webhook
        • Opsgenie
      • Troubleshooting
    • Customize
      • Add a monitor using the CLI
      • Derived State monitor
      • Override monitor arguments
      • Write a remediation guide
  • 📈Metrics
    • Explore Metrics
    • Custom charts
      • Adding custom charts to components
      • Writing PromQL queries for representative charts
      • Troubleshooting custom charts
    • Advanced Metrics
      • Grafana Datasource
      • Prometheus remote_write
      • OpenMetrics
  • 📑Logs
    • Explore Logs
    • Log Shipping
  • 🔭Traces
    • Explore Traces
  • 📖Health
    • Health synchronization
    • Send health data over HTTP
      • Send health data
      • Repeat Snapshots JSON
      • Transactional Increments JSON
    • Debug health synchronization
  • 🔍Views
    • Kubernetes views
    • Custom views
    • Component views
    • Explore views
    • View structure
      • Overview perspective
      • Highlights perspective
      • Topology perspective
      • Events perspective
      • Metrics perspective
      • Traces perspective
      • Filters
      • Keyboard shortcuts
    • Timeline and time travel
  • 🕵️Agent
    • Network configuration
      • Proxy Configuration
    • Using a custom registry
    • Custom Secret Management
      • Custom Secret Management (Deprecated)
    • Request tracing
      • Certificates for sidecar injection
  • 🔭Open Telemetry
    • Overview
    • Getting started
      • Concepts
      • Kubernetes
      • Kubernetes Operator
      • Linux
      • AWS Lambda
    • Open telemetry collector
      • Sampling
      • SUSE Observability OTLP APIs
    • Instrumentation
      • Java
      • Node.js
        • Auto-instrumentation of Lambdas
      • .NET
      • SDK Exporter configuration
    • Troubleshooting
  • CLI
    • SUSE Observability CLI
  • 🚀Self-hosted setup
    • Install SUSE Observability
      • Requirements
      • Kubernetes / OpenShift
        • Kubernetes install
        • OpenShift install
        • Alibaba Cloud ACK install
        • Required Permissions
        • Override default configuration
        • Configure storage
        • Exposing SUSE Observability outside of the cluster
      • Initial run guide
      • Troubleshooting
        • Advanced Troubleshooting
        • Support Package (Logs)
    • Configure SUSE Observability
      • Slack notifications
      • E-mail notifications
      • Stackpacks
      • Advanced
        • Analytics
    • Release Notes
      • v2.0.0 - 11/Sep/2024
      • v2.0.1 - 18/Sep/2024
      • v2.0.2 - 01/Oct/2024
      • v2.1.0 - 29/Oct/2024
      • v2.2.0 - 09/Dec/2024
      • v2.2.1 - 10/Dec/2024
      • v2.3.0 - 30/Jan/2025
      • v2.3.1 - 17/Mar/2025
      • v2.3.2 - 22/Apr/2025
      • v2.3.3 - 07/May/2025
    • Upgrade SUSE Observability
      • Migration from StackState
      • Steps to upgrade
      • Version-specific upgrade instructions
    • Uninstall SUSE Observability
    • Air-gapped
      • SUSE Observability air-gapped
      • SUSE Observability Kubernetes Agent air-gapped
    • Data management
      • Backup and Restore
        • Kubernetes backup
        • Configuration backup
      • Data retention
      • Clear stored data
    • Security
      • Authentication
        • Authentication options
        • Single password
        • File-based
        • LDAP
        • Open ID Connect (OIDC)
          • Microsoft Entra ID
        • KeyCloak
        • Service tokens
        • Troubleshooting
      • RBAC
        • Role-based Access Control
        • Permissions
        • Roles
        • Scopes
      • Self-signed certificates
      • External secrets
  • 🔐Security
    • Service Tokens
    • API Keys
  • ☁️SaaS
    • User Management
  • Reference
    • SUSE Observability Query Language (STQL)
    • Chart units
    • Topology Identifiers
Powered by GitBook
LogoLogo

Legal notices

  • Privacy
  • Cookies
  • Responsible disclosure
  • SOC 2/SOC 3
On this page
  • Guidelines
  • Why is this necessary?
  • See also
  1. Metrics
  2. Custom charts

Writing PromQL queries for representative charts

SUSE Observability

PreviousAdding custom charts to componentsNextTroubleshooting custom charts

Last updated 7 months ago

Guidelines

When SUSE Observability shows data in a chart it almost always needs to change the resolution of the stored data to make it fit into the available space for the chart. To get the most representative charts possible follow these guidelines:

  • Don't query for the raw metric but always aggregate over time (using *_over_time or rate functions).

  • Use the ${__interval} parameter as the range for aggregations over time, it will automatically adjust with the resolution of the chart

  • Use the ${__rate_interval} parameter as the range for rate aggregations, it will also automatically adjust with the resolution of the chart but takes into account specific behaviors of rate.

Applying an aggregation often means that a trade-off is made to emphasize certain patterns in metrics more than others. For example, for large time windows max_over_time will show all peaks, but it won't show all troughs. While min_over_time does the exact opposite and avg_over_time will smooth out both peaks and troughs. To show this behavior, here is an example metric binding using the CPU usage of pods. To try it yourself, copy it to a YAML file and use the in your own SUSE Observability (you can remove it later).

nodes:
- _type: MetricBinding
  chartType: line
  enabled: true
  tags: {}
  unit: short
  name: CPU Usage (different aggregations and intervals)
  priority: HIGH
  identifier: urn:custom:metric-binding:pod-cpu-usage-a
  queries:
    - expression: sum(max_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000
      alias: max_over_time dynamic interval
    - expression: sum(min_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000
      alias: min_over_time dynamic interval
    - expression: sum(avg_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000
      alias: avg_over_time dynamic interval
    - expression: sum(last_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000
      alias: last_over_time dynamic interval
    - expression: sum(max_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000
      alias: max_over_time 1m interval
    - expression: sum(min_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000
      alias: min_over_time 1m interval
    - expression: sum(avg_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000
      alias: avg_over_time 1m interval
    - expression: sum(last_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000
      alias: last_over_time 1m interval
  scope: (label = "stackpack:kubernetes" and type = "pod")

After applying it, open the metrics perspective for a pod in SUSE Observability (preferably a pod with some spikes and troughs in CPU usage). Enlarge the chart using the icon in its top-right corner to get a better view. Now you can also change the time window to see what the effects are from the different aggregations (30 minutes vs 24 hours for example).

Why is this necessary?

First of all, why should you use an aggregation? It doesn't make sense to retrieve more data points from the metric store than fit in the chart. Therefore SUSE Observability automatically determines the step needed between 2 data points to get a good result. For short time windows (for example a chart showing only 1 hour of data) this results in a small step (around 10 seconds). Metrics are often only collected every 30 seconds, so for 10 second steps the same value will repeat for 3 steps before changing to the next value. Zooming out to a 1 week time window, will require a much bigger step (around 1 hour, depending on the exact size of the chart on screen).

When the steps become larger than the resolution of the collected data points, a decision needs to be made on how to summarize the data points of the 1 hour time range into a single value. When an aggregation over time is already specified in the query, it will be used to do that. However, if no aggregation is specified, or when the aggregation interval is smaller than the step, the last_over_time aggregation is used, with the step size as the interval. The result is that only the last data point for each hour is used to "summarize" the all data points in that hour.

To summarize, when executing a PromQL query for a time range of 1 week with a step of 1 hour this query:

container_cpu_usage /1000000000

is automatically converted to:

last_over_time(container_cpu_usage[1h]) /1000000000

Often this behavior isn't intended and it's better to decide for yourself what kind of aggregation is needed. Using different aggregation functions it's possible to emphasize certain behavior (at the cost of hiding other behavior). Is it more important to see peaks, troughs, a smooth chart etc.? Then use the ${__interval} parameter for the range as it's automatically replaced with the step size used for the query. The result is that all the data points in the step are used.

The ${__interval} parameter prevents another issue. When the step size and therefore the ${__interval} value, would shrink to a smaller size than the resolution of the stored metric data this would result in gaps in the chart.

Therefore ${__interval} will never shrink smaller than the 2* the default scrape interval (default scrape interval is 30 seconds) of the SUSE Observability agent.

Finally the rate() function requires at least 2 data points to be in the interval to calculate a rate at all. With less than 2 data points the rate won't have a value. Therefore ${__rate_interval} is guaranteed to always be at least 4 * the scrape interval. This guarantees no unexpected gaps or other strange behavior in rate charts, unless data is missing.

There are some excellent blog posts on the internet that explain this in more detail:

See also

Some more resources on understanding PromQL queries:

When the metric binding doesn't specify an aggregation SUSE Observability will automatically use the last_over_time aggregation to reduce the number of data points for a chart. See also for an explanation.

Try it for yourself on the .

📈
SUSE Observability playground
Step and query range
What range should I use with rate()?
Introduction of __rate_interval in Grafana
Anatomy of a PromQL Query
Selecting Data in PromQL
How to join multiple metrics
Aggregation over time
Why is this necessary?
CLI to apply it
The chart for this metric binding for the last 30m, there are only a few lines in the chart visible because most time series are on top of each other
The same chart, same component and same end time, but now for the last 24h. It shows, sometimes completely, different results for the different aggregations
Last over time
Max over time with fixed range
Max over time with automatic range
A fixed range, shorter than the data resolution
Automatic range, based on step but with a lower limit