Metrics
Metrics are numerical measurements collected over time. Unlike logs (which record discrete events), metrics capture the ongoing state and performance of your application. The Ybor Platform uses Prometheus for metrics collection.
What You'll Learn
- The four Prometheus metric types and when to use each
- Naming conventions and labeling best practices
- What to measure (Golden Signals, RED, USE methods)
- How to expose a
/metricsendpoint
How Metrics Work
┌─────────────────┐ scrape /metrics ┌────────────────┐
│ Your Service │◄────────────────────────│ Prometheus │
│ │ │ │
│ /metrics │────────────────────────►│ Time Series │
│ endpoint │ metrics data │ Database │
└─────────────────┘ └────────────────┘
- Your application exposes a
/metricsendpoint - Prometheus scrapes this endpoint at regular intervals (typically 15-30 seconds)
- Metrics are stored as time series data
- Grafana queries Prometheus to visualize metrics
Metric Types
Prometheus defines four core metric types:
Counter
A value that only increases (or resets to zero on restart).
Use for: Counting events that happen over time.
# Total HTTP requests processed
http_requests_total{method="GET", path="/api/users", status="200"} 1542
http_requests_total{method="POST", path="/api/users", status="201"} 89
http_requests_total{method="GET", path="/api/users", status="500"} 3
Common counters:
requests_total- Total requests processederrors_total- Total errors encountereditems_processed_total- Total items processedbytes_transferred_total- Total bytes sent/received
Gauge
A value that can go up or down.
Use for: Current state or measurements.
# Current number of active connections
active_connections 42
# Current queue depth
queue_size{queue="orders"} 156
# Current memory usage in bytes
memory_usage_bytes 1073741824
Common gauges:
active_connections- Current connectionsqueue_size- Items in queuecache_size- Cache entriestemperature- Current temperature
Histogram
Samples observations and counts them in configurable buckets.
Use for: Measuring distributions of values, especially latency.
# Request duration histogram
http_request_duration_seconds_bucket{le="0.005"} 24054
http_request_duration_seconds_bucket{le="0.01"} 33444
http_request_duration_seconds_bucket{le="0.025"} 100392
http_request_duration_seconds_bucket{le="0.05"} 129389
http_request_duration_seconds_bucket{le="0.1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53.112
http_request_duration_seconds_count 144320
Common histograms:
request_duration_seconds- Request latencyresponse_size_bytes- Response sizesdb_query_duration_seconds- Database query time
Summary
Similar to histogram but calculates quantiles on the client side.
Use for: When you need pre-calculated percentiles.
# Request duration summary
http_request_duration_seconds{quantile="0.5"} 0.023
http_request_duration_seconds{quantile="0.9"} 0.042
http_request_duration_seconds{quantile="0.99"} 0.087
Histograms are more flexible than summaries. They can be aggregated across instances and allow calculating arbitrary percentiles in Prometheus queries.
Naming Conventions
Follow Prometheus naming conventions for consistency:
| Pattern | Example | Description |
|---|---|---|
<namespace>_<name>_<unit> | http_request_duration_seconds | Include unit suffix |
_total suffix for counters | http_requests_total | Indicates counter type |
_info suffix for metadata | app_info | Build/version information |
Good names:
http_requests_total
http_request_duration_seconds
db_connections_active
cache_hits_total
queue_messages_pending
Bad names:
requests # Missing namespace and unit
httpRequestsTotal # CamelCase (use snake_case)
request_time_ms # Use seconds, not milliseconds
num_errors # Use _total suffix for counters
Labels
Labels add dimensions to metrics, allowing filtering and aggregation.
http_requests_total{method="GET", path="/api/users", status="200"}
http_requests_total{method="POST", path="/api/users", status="201"}
http_requests_total{method="GET", path="/api/orders", status="200"}
Label Best Practices
DO use labels for:
- HTTP method, path, status code
- Database operation type
- Queue or topic name
- Error type or code
DO NOT use labels for:
- High cardinality values (user IDs, request IDs)
- Unbounded sets (arbitrary user input)
- Values that change frequently
Every unique combination of labels creates a new time series. Avoid labels with many possible values (user IDs, timestamps) as they can overwhelm Prometheus.
What to Measure
The Four Golden Signals
Google's Site Reliability Engineering book recommends monitoring these four signals:
| Signal | Description | Metric Type |
|---|---|---|
| Latency | Time to service a request | Histogram |
| Traffic | Requests per second | Counter |
| Errors | Rate of failed requests | Counter |
| Saturation | How "full" your service is | Gauge |
RED Method
For request-driven services:
- Rate - Requests per second
- Errors - Failed requests per second
- Duration - Time per request
USE Method
For resources (CPU, memory, disk):
- Utilization - Percentage of resource used
- Saturation - Amount of work queued
- Errors - Error events
The /metrics Endpoint
Your application should expose a /metrics endpoint that returns Prometheus format:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/users",status="200"} 1542
http_requests_total{method="POST",path="/api/users",status="201"} 89
# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 24054
http_request_duration_seconds_bucket{le="0.01"} 33444
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53.112
http_request_duration_seconds_count 144320
# HELP active_connections Current number of active connections
# TYPE active_connections gauge
active_connections 42
Languages
For language-specific metrics implementations:
- Python - prometheus-client
- .NET - prometheus-net, OpenTelemetry
- Java - Micrometer, Spring Boot Actuator
- Rust - prometheus crate
- JavaScript - prom-client
Related
- Logging — Discrete events for debugging
- Tracing — Request flow for latency analysis
- Observability Overview — The four pillars and platform integration