G

grafana-dashboards

Verified

by wshobson · Updated Apr 16, 2026

opus
1.6Kin31.5Kout
$1.348m 3s

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

grafanadashboardsmonitoringobservabilityprometheussre

See It In Action

1 turns·claude
The project codebase is in your current working directory. Explore the files to understand the code before making changes. need a prod grafana dashboard for the payments api before the 4pm incident review — checkout reliability + host health, vars for env/service/instance and alerts too

Let me explore the codebase first, then build the dashboard.

Good overview. The monitoring stack is set up but the dashboard is completely missing — that's what caused blind spots in the two recent incidents. Let me build it now.

Let me read the existing files, then build both the dashboard and updated alerts.

Now I have full context. Let me create the dashboard JSON and update the alerts in parallel.

Now let me validate the JSON syntax and do a quick sanity check.

Both files validated. Here's what's ready for the 4pm review:

Dashboard (dashboards/payments-api.json) — 23 panels across 5 rows:

RowPanels
Checkout Reliability OverviewRequest Rate, Error Rate %, P95 Latency, In-Flight Requests (stat panels with threshold coloring)
Request & Error DetailRate by endpoint, error rate by endpoint (with 5%/10% threshold lines), HTTP status distribution (stacked bars, color-coded 2xx/4xx/5xx)
Latency DetailP50/P95/P99 overlay with 500ms SLO line, P95 by endpoint, request duration heatmap
Payment ProcessingSuccess/failed/timeout rates, payment success ratio (stat), payment amount percentiles
Host HealthCPU (70%/90% thresholds), memory (80%/95% thresholds), disk I/O, network rx/tx, filesystem usage (85% threshold)

Template variables: env (default: production), service (default: payments-api), instance (multi, includes All).

Alerts (prometheus/alerts.yml) — 11 rules across 3 groups:

  • payments_basic: PaymentsAPIDown (existing)
  • payments_reliability: error rate >5% warn / >10% crit, P95 >500ms warn / >1s crit, payment failure >5%
  • host_health: CPU >70% warn / >90% crit, memory >80% warn / >95% crit, disk <15% free

The error rate and latency alerts directly address the blind spots from the April 8-9 incidents. The memory/CPU critical alerts would have caught the GC pause root cause on payments-prod-03.

Claude Sonnet 4.6

Permissions

ScopeDescription
filesystem:read
filesystem:write

SKILL.md

Grafana Dashboards

Create and manage production-ready Grafana dashboards for comprehensive system observability.

Purpose

Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.

When to Use

  • Visualize Prometheus metrics
  • Create custom dashboards
  • Implement SLO dashboards
  • Monitor infrastructure
  • Track business KPIs

Dashboard Design Principles

1. Hierarchy of Information

┌─────────────────────────────────────┐
│  Critical Metrics (Big Numbers)     │
├─────────────────────────────────────┤
│  Key Trends (Time Series)           │
├─────────────────────────────────────┤
│  Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘

2. RED Method (Services)

  • Rate - Requests per second
  • Errors - Error rate
  • Duration - Latency/response time

3. USE Method (Resources)

  • Utilization - % time resource is busy
  • Saturation - Queue length/wait time
  • Errors - Error count

Dashboard Structure

API Monitoring Dashboard

{
  "dashboard": {
    "title": "API Monitoring",
    "tags": ["api", "production"],
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "Error Rate %",
        "type": "graph",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error Rate"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": { "params": [5], "type": "gt" },
              "operator": { "type": "and" },
              "query": { "params": ["A", "5m", "now"] },
              "type": "query"
            }
          ]
        },
        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
      },
      {
        "title": "P95 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
            "legendFormat": "{{service}}"
          }
        ],
        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
      }
    ]
  }
}

Reference: See assets/api-dashboard.json

Panel Types

1. Stat Panel (Single Value)

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value"
  },
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 80, "color": "yellow" },
          { "value": 90, "color": "red" }
        ]
      }
    }
  }
}

2. Time Series Graph

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "yaxes": [
    { "format": "percent", "max": 100, "min": 0 },
    { "format": "short" }
  ]
}

3. Table Panel

{
  "type": "table",
  "title": "Service Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": { "Time": true },
        "indexByName": {},
        "renameByName": {
          "instance": "Instance",
          "job": "Service",
          "Value": "Status"
        }
      }
    }
  ]
}

4. Heatmap

{
  "type": "heatmap",
  "title": "Latency Heatmap",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap"
    }
  ],
  "dataFormat": "tsbuckets",
  "yAxis": {
    "format": "s"
  }
}

Variables

Query Variables

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 1,
        "multi": false
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
        "refresh": 1,
        "multi": true
      }
    ]
  }
}

Use Variables in Queries

sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))

Alerts in Dashboards

{
  "alert": {
    "name": "High Error Rate",
    "conditions": [
      {
        "evaluator": {
          "params": [5],
          "type": "gt"
        },
        "operator": { "type": "and" },
        "query": {
          "params": ["A", "5m", "now"]
        },
        "reducer": { "type": "avg" },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "1m",
    "message": "Error rate is above 5%",
    "noDataState": "no_data",
    "notifications": [{ "uid": "slack-channel" }]
  }
}

Dashboard Provisioning

dashboards.yml:

apiVersion: 1

providers:
  - name: "default"
    orgId: 1
    folder: "General"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards

Common Dashboard Patterns

Infrastructure Dashboard

Key Panels:

  • CPU utilization per node
  • Memory usage per node
  • Disk I/O
  • Network traffic
  • Pod count by namespace
  • Node status

Reference: See assets/infrastructure-dashboard.json

Database Dashboard

Key Panels:

  • Queries per second
  • Connection pool usage
  • Query latency (P50, P95, P99)
  • Active connections
  • Database size
  • Replication lag
  • Slow queries

Reference: See assets/database-dashboard.json

Application Dashboard

Key Panels:

  • Request rate
  • Error rate
  • Response time (percentiles)
  • Active users/sessions
  • Cache hit rate
  • Queue length

Best Practices

  1. Start with templates (Grafana community dashboards)
  2. Use consistent naming for panels and variables
  3. Group related metrics in rows
  4. Set appropriate time ranges (default: Last 6 hours)
  5. Use variables for flexibility
  6. Add panel descriptions for context
  7. Configure units correctly
  8. Set meaningful thresholds for colors
  9. Use consistent colors across dashboards
  10. Test with different time ranges

Dashboard as Code

Terraform Provisioning

resource "grafana_dashboard" "api_monitoring" {
  config_json = file("${path.module}/dashboards/api-monitoring.json")
  folder      = grafana_folder.monitoring.id
}

resource "grafana_folder" "monitoring" {
  title = "Production Monitoring"
}

Ansible Provisioning

- name: Deploy Grafana dashboards
  copy:
    src: "{{ item }}"
    dest: /etc/grafana/dashboards/
  with_fileglob:
    - "dashboards/*.json"
  notify: restart grafana

Related Skills

  • prometheus-configuration - For metric collection
  • slo-implementation - For SLO dashboards

FAQ

What does grafana-dashboards do?

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

When should I use grafana-dashboards?

Use it when you need a repeatable workflow that produces text report, code diff.

What does grafana-dashboards output?

In the evaluated run it produced text report, code diff.

How do I install or invoke grafana-dashboards?

Ask the agent to use this skill when the task matches its documented workflow.

Which agents does grafana-dashboards support?

Agent support is inferred from the source, but not explicitly declared.

What tools, channels, or permissions does grafana-dashboards need?

It uses no extra tools; channels commonly include text, diff; permissions include filesystem:read, filesystem:write.

Is grafana-dashboards safe to install?

Static analysis marked this skill as medium risk; review side effects and permissions before enabling it.

How is grafana-dashboards different from an MCP or plugin?

A skill packages instructions and workflow conventions; tools, MCP servers, and plugins are dependencies the skill may call during execution.

Does grafana-dashboards outperform not using a skill?

About grafana-dashboards

When to use grafana-dashboards

When you need to create Grafana dashboard JSON for Prometheus-backed service or infrastructure metrics. When you want reusable dashboard templates for API, database, or infrastructure monitoring. When you are provisioning dashboards as code using files checked into a repository.

When grafana-dashboards is not the right choice

When you need the agent to directly operate a live Grafana instance via API or UI, since no Grafana connector is specified. When your main task is collecting or scraping metrics rather than building dashboards.

What it produces

Produces text report and code diff.