Operationalizing Kasten K10 Metrics in Splunk

Why bring Kasten metrics into Splunk?

Kasten K10 exposes rich Prometheus metrics across its services (catalog, jobs, etc.). Many teams already standardize on Splunk for operational insights and alerting. Bridging these worlds lets you:

Centralize Kasten observability alongside app, infra, and security data
Reuse existing Splunk practices, RBAC, and alerting
Build durable dashboards mirroring your Prometheus views
Simplify triage: failed backups, job latencies, capacity signals — all in one place

This post explains the approach I use in the kasten-splunk project to collect Kasten metrics via Prometheus’ federate endpoint, ingest them into a Splunk metrics index with OpenTelemetry, and reconstruct key PromQL logic with SPL.

Project: https://github.com/michaelcourcy/kasten-splunk

Architecture at a glance

Kasten → Splunk flow

Every Kasten component emits Prometheus metrics
A Prometheus server scrapes these metrics every few seconds
The Prometheus federate/ endpoint exposes the scraped series
An OpenTelemetry Collector reads from federate/ and writes to a Splunk metrics index (for example, k8s_metrics)
Splunk users query and visualize Kasten metrics with SPL

Setup

You’ll find working manifests in the project repository. The steps below are for OpenShift, using the oc CLI.

1) Deploy a Splunk instance with a metrics index and HEC secrets

oc create -f https://raw.githubusercontent.com/michaelcourcy/kasten-splunk/main/splunk.yaml

This creates a Splunk instance (in its own namespace) and sets up a metrics index (e.g., k8s_metrics) and HEC credentials for ingestion.

2) Deploy an OpenTelemetry Collector in kasten-io

oc create -f https://raw.githubusercontent.com/michaelcourcy/kasten-splunk/main/otel-collector.yaml

The collector scrapes K10’s built-in Prometheus via /federate and sends metrics to Splunk. No sidecars are required on Kasten pods.

Security note: keep the HEC token secret and scope the metrics index narrowly for least privilege.

From PromQL to SPL: model differences

Prometheus uses labels; Splunk uses dimensions. For example, a histogram bucket like:

jobs_duration_bucket{application="k10", service="jobs", status="succeeded", le="60.0"}

In Splunk becomes dimensions available under _dims. You can list them with:

| mcatalog values(_dims) WHERE metric_name=jobs_duration_bucket AND index=k8s_metrics

Focus on the dimensions that actually vary (e.g., status, le) when porting queries.

Counters: don’t sum across bins

Prometheus counters are monotonic, and functions like increase() account for resets. In Splunk, avoid summing the same counter multiple times across time bins. Instead, pick the latest value in each bin, then compute deltas.

Example: number of jobs with duration under 60s by status (failed/succeeded):

| mstats latest(_value) as num_jobs_under_60s 
  WHERE index=k8s_metrics metric_name="jobs_duration_bucket" le=60 span=24h earliest=-1h latest=now 
  BY status

Use a fixed time window (earliest/latest) larger than your span to ensure you get the latest samples per bin.

Key queries for a practical dashboard

Successful backups over the last 24h

PromQL reference:

increase(catalog_actions_count{liveness="live", status="complete", type="backup"}[24h]) > 0

SPL equivalent using deltas on the counter per namespace/policy:

| mstats latest(_value) AS cum
  WHERE index=k8s_metrics metric_name=catalog_actions_count status=complete type=backup liveness=live
  earliest=-24h latest=now span=1m
  BY namespace, policy
| sort 0 _time namespace, policy
| streamstats current=f last(cum) AS prev BY namespace, policy
| eval inc = if(isnull(prev), 0, cum - prev)
| eval inc = if(inc < 0, 0, inc)
| where inc > 0

This captures only the positive deltas (successful backup events). The timing can be off by up to a couple of minutes due to scrape frequency, bin size, and stream processing.

Optionally, add a rollup:

| stats sum(inc) AS complete_24h

Failed backups over the last 24h

| mstats latest(_value) AS cum
  WHERE index=k8s_metrics metric_name=catalog_actions_count status=failed type=backup liveness=live
  earliest=-24h latest=now span=1m
  BY namespace, policy
| sort 0 _time namespace, policy
| streamstats current=f last(cum) AS prev BY namespace, policy
| eval inc = if(isnull(prev), 0, cum - prev)
| eval inc = if(inc < 0, 0, inc)
| where inc > 0
| stats sum(inc) AS failed_24h

Job duration buckets (≤ 60s) by status

| mstats latest(_value) as num_jobs_under_60s 
  WHERE index=k8s_metrics metric_name="jobs_duration_bucket" le=60 span=24h earliest=-1h latest=now 
  BY status

Build the dashboard

With the queries above, you can create a Splunk dashboard that surfaces at-a-glance backup outcomes and trends. For inspiration, see the example in the repo:

Sample dashboard

If you need broader coverage (e.g., exportactions, runactions, retireactions), translate the corresponding PromQL rules you use today into SPL. A comprehensive set of Kasten rules is documented here:

https://github.com/michaelcourcy/kasten-prometheus-stack/blob/main/kasten-rules.yaml

Troubleshooting tips

No data in Splunk? Verify HEC token, index name, and network egress from the collector
Sparse points or off-by-minutes timestamps? Increase scrape frequency and/or reduce span; keep earliest/latest fixed
Spikey counters? Remember counters can reset; guard with inc = if(inc < 0, 0, inc)
Validate against Prometheus graphs for a sanity check (for example, comparing deltas with Prometheus counter graphs)

Counter reference examples from the repo:

Counter graph

cpouthier counter graph

What you get operationally

Single-pane view of Kasten health within your Splunk workflows
Faster incident response for failed backups and slow jobs
Shared language for SRE and SecOps (SPL), while staying truthful to PromQL semantics
A lightweight, vendor-neutral collection path via Prometheus federate + OpenTelemetry

Resources

Project: https://github.com/michaelcourcy/kasten-splunk
Architecture image and dashboard examples are reused from the project repository
Related Kasten Prometheus rules: https://github.com/michaelcourcy/kasten-prometheus-stack/blob/main/kasten-rules.yaml

Recent content

Why Choose Veeam Kasten Over Velero for Kubernetes Backup

Operationalizing Kasten K10 Metrics in Splunk

Protecting an Example AI Application with Veeam Kasten

AI didn't delete your data, and if it did, it isn't sorry

Popular topics

Operationalizing Kasten K10 Metrics in Splunk

Why bring Kasten metrics into Splunk?

Architecture at a glance

Setup

From PromQL to SPL: model differences

Counters: don’t sum across bins

Key queries for a practical dashboard

Successful backups over the last 24h

Failed backups over the last 24h

Job duration buckets (≤ 60s) by status

Build the dashboard

Troubleshooting tips

What you get operationally

Resources

Protecting an Example AI Application with Veeam Kasten

Why Choose Veeam Kasten Over Velero for Kubernetes Backup

Why Choose Veeam Kasten Over Velero for Kubernetes Backup

Airgapped Kasten Operator Installation for OpenShift 4.16

Granular Table Restore with Kasten K10

Automate the installation of the Veeam Kasten Operator on OpenShift with Terrraform

Recent content

Popular topics

Operationalizing Kasten K10 Metrics in Splunk

Why bring Kasten metrics into Splunk?

Architecture at a glance

Setup

From PromQL to SPL: model differences

Counters: don’t sum across bins

Key queries for a practical dashboard

Successful backups over the last 24h

Failed backups over the last 24h

Job duration buckets (≤ 60s) by status

Build the dashboard

Troubleshooting tips

What you get operationally

Resources

You may also like