Skip to main content

Distributed Tracing

Distributed tracing tracks requests as they flow through multiple services. When a user makes a request, tracing shows exactly which services were called, how long each took, and where failures occurred.

What You'll Learn

  • Core tracing concepts: traces, spans, and context propagation
  • How OpenTelemetry works on the Ybor Platform
  • Auto-instrumentation vs. manual instrumentation
  • How to connect traces to logs

Why Tracing Matters

In a microservices architecture, a single user request may touch many services:

User Request


┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Gateway │───►│ Orders │───►│ Inventory │
│ Service │ │ Service │ │ Service │
└─────────────┘ └─────────────┘ └─────────────┘


┌─────────────┐
│ Payment │
│ Service │
└─────────────┘

Without tracing, debugging is difficult:

  • Which service caused the 500 error?
  • Why did this request take 3 seconds?
  • Which database query is the bottleneck?

Tracing answers these questions by connecting all the pieces.

Concepts

Trace

A trace represents the entire journey of a request through your system. Each trace has a unique trace ID that connects all related operations.

Span

A span represents a single unit of work within a trace. Each span has:

  • Name: What operation this represents (e.g., "GET /api/users")
  • Start time and duration
  • Parent span: The span that created this one
  • Attributes: Key-value pairs with additional context
  • Status: Success, error, or unset

Context Propagation

Trace context (trace ID and span ID) must be passed between services. This is called context propagation. The standard approach uses HTTP headers:

GET /api/orders HTTP/1.1
Host: orders-service
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value

The traceparent header contains:

  • Version (00)
  • Trace ID (4bf92f3577b34da6a3ce929d0e0e4736)
  • Parent span ID (00f067aa0ba902b7)
  • Flags (01 = sampled)

OpenTelemetry

The Ybor Platform uses OpenTelemetry (OTel) as the standard for distributed tracing. OpenTelemetry provides:

  • Vendor-neutral instrumentation: Write once, export to any backend
  • Auto-instrumentation: Automatic tracing for common frameworks
  • Manual instrumentation: APIs for custom spans
  • Unified API: Consistent approach across languages

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ Your Application │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Auto-Instru- │ │ Manual Spans │ │ OTel SDK │ │
│ │ mentation │ │ (your code) │ │ │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ OTel Exporter │ │
│ └────────┬────────┘ │
└──────────────────────────────┼──────────────────────────────────┘
│ OTLP (OpenTelemetry Protocol)

┌─────────────────┐
│ OTel Collector │
└────────┬────────┘

┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Tempo │ │ Jaeger │ │ Zipkin │
└─────────┘ └─────────┘ └─────────┘

Auto-Instrumentation

OpenTelemetry provides automatic instrumentation for common libraries:

CategoryLibraries
HTTP ServersExpress, FastAPI, ASP.NET Core, Spring
HTTP Clientsrequests, HttpClient, RestTemplate
DatabasesPostgreSQL, MySQL, Redis, MongoDB
MessagingKafka, RabbitMQ, SQS
gRPCAll major gRPC libraries

Auto-instrumentation captures:

  • Incoming HTTP requests
  • Outgoing HTTP requests
  • Database queries
  • Message queue operations

Manual Instrumentation

For custom business logic, create spans manually:

# Pseudocode - see language reference for actual syntax
with tracer.start_span("process_order") as span:
span.set_attribute("order_id", order.id)
span.set_attribute("customer_id", order.customer_id)

validate_order(order)
process_payment(order)
ship_order(order)

Span Attributes

Add context to spans with attributes:

AttributeExampleDescription
http.methodGETHTTP method
http.url/api/usersRequest URL
http.status_code200Response status
db.systempostgresqlDatabase type
db.statementSELECT * FROM usersQuery (sanitized)
messaging.systemkafkaMessage broker
messaging.destinationordersTopic/queue name

OpenTelemetry defines semantic conventions for standard attributes.

Sampling

In high-traffic systems, tracing every request is expensive. Sampling reduces volume while maintaining visibility:

StrategyDescriptionUse Case
Always OnTrace every requestDevelopment, low traffic
ProbabilisticTrace X% of requestsGeneral production use
Rate LimitingTrace N requests/secondHigh traffic services
Tail-basedKeep interesting tracesError/latency analysis
Start with Always On

In development and early production, trace everything. Add sampling only when volume becomes a concern.

Connecting Logs and Traces

Include trace context in your logs to connect them:

{
"timestamp": "2025-01-15T14:32:01.234Z",
"level": "INFO",
"message": "Order created",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"orderId": "ord_12345"
}

This allows jumping from a log entry directly to its trace in your observability platform.

Best Practices

  1. Use auto-instrumentation first - It covers most common operations
  2. Add manual spans for business logic - Capture what auto-instrumentation misses
  3. Include meaningful attributes - Add context that helps debugging
  4. Propagate context everywhere - HTTP, messaging, async jobs
  5. Connect to logs - Include trace/span IDs in log entries
  6. Use semantic conventions - Follow OpenTelemetry naming standards

Languages

For language-specific tracing implementations:

  • Python - opentelemetry-python
  • .NET - OpenTelemetry.NET
  • Java - OpenTelemetry Java
  • Rust - tracing, opentelemetry-rust
  • JavaScript - @opentelemetry/sdk-node