Distributed Tracing
Distributed tracing tracks requests as they flow through multiple services. When a user makes a request, tracing shows exactly which services were called, how long each took, and where failures occurred.
What You'll Learn
- Core tracing concepts: traces, spans, and context propagation
- How OpenTelemetry works on the Ybor Platform
- Auto-instrumentation vs. manual instrumentation
- How to connect traces to logs
Why Tracing Matters
In a microservices architecture, a single user request may touch many services:
User Request
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Gateway │───►│ Orders │───►│ Inventory │
│ Service │ │ Service │ │ Service │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Payment │
│ Service │
└─────────────┘
Without tracing, debugging is difficult:
- Which service caused the 500 error?
- Why did this request take 3 seconds?
- Which database query is the bottleneck?
Tracing answers these questions by connecting all the pieces.
Concepts
Trace
A trace represents the entire journey of a request through your system. Each trace has a unique trace ID that connects all related operations.
Span
A span represents a single unit of work within a trace. Each span has:
- Name: What operation this represents (e.g., "GET /api/users")
- Start time and duration
- Parent span: The span that created this one
- Attributes: Key-value pairs with additional context
- Status: Success, error, or unset
Context Propagation
Trace context (trace ID and span ID) must be passed between services. This is called context propagation. The standard approach uses HTTP headers:
GET /api/orders HTTP/1.1
Host: orders-service
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value
The traceparent header contains:
- Version (
00) - Trace ID (
4bf92f3577b34da6a3ce929d0e0e4736) - Parent span ID (
00f067aa0ba902b7) - Flags (
01= sampled)
OpenTelemetry
The Ybor Platform uses OpenTelemetry (OTel) as the standard for distributed tracing. OpenTelemetry provides:
- Vendor-neutral instrumentation: Write once, export to any backend
- Auto-instrumentation: Automatic tracing for common frameworks
- Manual instrumentation: APIs for custom spans
- Unified API: Consistent approach across languages
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Your Application │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Auto-Instru- │ │ Manual Spans │ │ OTel SDK │ │
│ │ mentation │ │ (your code) │ │ │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ OTel Exporter │ │
│ └────────┬────────┘ │
└──────────────────────────────┼──────────────────────────────────┘
│ OTLP (OpenTelemetry Protocol)
▼
┌─────────────────┐
│ OTel Collector │
└────────┬────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Tempo │ │ Jaeger │ │ Zipkin │
└─────────┘ └─────────┘ └─────────┘
Auto-Instrumentation
OpenTelemetry provides automatic instrumentation for common libraries:
| Category | Libraries |
|---|---|
| HTTP Servers | Express, FastAPI, ASP.NET Core, Spring |
| HTTP Clients | requests, HttpClient, RestTemplate |
| Databases | PostgreSQL, MySQL, Redis, MongoDB |
| Messaging | Kafka, RabbitMQ, SQS |
| gRPC | All major gRPC libraries |
Auto-instrumentation captures:
- Incoming HTTP requests
- Outgoing HTTP requests
- Database queries
- Message queue operations
Manual Instrumentation
For custom business logic, create spans manually:
# Pseudocode - see language reference for actual syntax
with tracer.start_span("process_order") as span:
span.set_attribute("order_id", order.id)
span.set_attribute("customer_id", order.customer_id)
validate_order(order)
process_payment(order)
ship_order(order)
Span Attributes
Add context to spans with attributes:
| Attribute | Example | Description |
|---|---|---|
http.method | GET | HTTP method |
http.url | /api/users | Request URL |
http.status_code | 200 | Response status |
db.system | postgresql | Database type |
db.statement | SELECT * FROM users | Query (sanitized) |
messaging.system | kafka | Message broker |
messaging.destination | orders | Topic/queue name |
OpenTelemetry defines semantic conventions for standard attributes.
Sampling
In high-traffic systems, tracing every request is expensive. Sampling reduces volume while maintaining visibility:
| Strategy | Description | Use Case |
|---|---|---|
| Always On | Trace every request | Development, low traffic |
| Probabilistic | Trace X% of requests | General production use |
| Rate Limiting | Trace N requests/second | High traffic services |
| Tail-based | Keep interesting traces | Error/latency analysis |
In development and early production, trace everything. Add sampling only when volume becomes a concern.
Connecting Logs and Traces
Include trace context in your logs to connect them:
{
"timestamp": "2025-01-15T14:32:01.234Z",
"level": "INFO",
"message": "Order created",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"orderId": "ord_12345"
}
This allows jumping from a log entry directly to its trace in your observability platform.
Best Practices
- Use auto-instrumentation first - It covers most common operations
- Add manual spans for business logic - Capture what auto-instrumentation misses
- Include meaningful attributes - Add context that helps debugging
- Propagate context everywhere - HTTP, messaging, async jobs
- Connect to logs - Include trace/span IDs in log entries
- Use semantic conventions - Follow OpenTelemetry naming standards
Languages
For language-specific tracing implementations:
- Python - opentelemetry-python
- .NET - OpenTelemetry.NET
- Java - OpenTelemetry Java
- Rust - tracing, opentelemetry-rust
- JavaScript - @opentelemetry/sdk-node
Related
- Logging — Connect logs to traces with trace IDs
- Metrics — Aggregate performance measurements
- Observability Overview — The four pillars and platform integration