Distributed Tracing

Distributed tracing tracks requests as they flow through multiple services. When a user makes a request, tracing shows exactly which services were called, how long each took, and where failures occurred.

What You'll Learn

Core tracing concepts: traces, spans, and context propagation
How OpenTelemetry works on the Ybor Platform
Auto-instrumentation vs. manual instrumentation
How to connect traces to logs

Why Tracing Matters

In a microservices architecture, a single user request may touch many services:

User Request
    │
    ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Gateway   │───►│   Orders    │───►│  Inventory  │
│   Service   │    │   Service   │    │   Service   │
└─────────────┘    └─────────────┘    └─────────────┘
                         │
                         ▼
                   ┌─────────────┐
                   │   Payment   │
                   │   Service   │
                   └─────────────┘

Without tracing, debugging is difficult:

Which service caused the 500 error?
Why did this request take 3 seconds?
Which database query is the bottleneck?

Tracing answers these questions by connecting all the pieces.

Concepts

Trace

A trace represents the entire journey of a request through your system. Each trace has a unique trace ID that connects all related operations.

Span

A span represents a single unit of work within a trace. Each span has:

Name: What operation this represents (e.g., "GET /api/users")
Start time and duration
Parent span: The span that created this one
Attributes: Key-value pairs with additional context
Status: Success, error, or unset

Context Propagation

Trace context (trace ID and span ID) must be passed between services. This is called context propagation. The standard approach uses HTTP headers:

GET /api/orders HTTP/1.1
Host: orders-service
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor=value

The traceparent header contains:

Version (00)
Trace ID (4bf92f3577b34da6a3ce929d0e0e4736)
Parent span ID (00f067aa0ba902b7)
Flags (01 = sampled)

OpenTelemetry

The Ybor Platform uses OpenTelemetry (OTel) as the standard for distributed tracing. OpenTelemetry provides:

Vendor-neutral instrumentation: Write once, export to any backend
Auto-instrumentation: Automatic tracing for common frameworks
Manual instrumentation: APIs for custom spans
Unified API: Consistent approach across languages

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Your Application                          │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐     │
│  │ Auto-Instru-   │  │ Manual Spans   │  │ OTel SDK       │     │
│  │ mentation      │  │ (your code)    │  │                │     │
│  └───────┬────────┘  └───────┬────────┘  └───────┬────────┘     │
│          └───────────────────┼───────────────────┘              │
│                              │                                   │
│                     ┌────────▼────────┐                         │
│                     │  OTel Exporter  │                         │
│                     └────────┬────────┘                         │
└──────────────────────────────┼──────────────────────────────────┘
                               │ OTLP (OpenTelemetry Protocol)
                               ▼
                     ┌─────────────────┐
                     │  OTel Collector │
                     └────────┬────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
        ┌─────────┐     ┌─────────┐     ┌─────────┐
        │  Tempo  │     │ Jaeger  │     │ Zipkin  │
        └─────────┘     └─────────┘     └─────────┘

Auto-Instrumentation

OpenTelemetry provides automatic instrumentation for common libraries:

Category	Libraries
HTTP Servers	Express, FastAPI, ASP.NET Core, Spring
HTTP Clients	requests, HttpClient, RestTemplate
Databases	PostgreSQL, MySQL, Redis, MongoDB
Messaging	Kafka, RabbitMQ, SQS
gRPC	All major gRPC libraries

Auto-instrumentation captures:

Incoming HTTP requests
Outgoing HTTP requests
Database queries
Message queue operations

Manual Instrumentation

For custom business logic, create spans manually:

# Pseudocode - see language reference for actual syntax
with tracer.start_span("process_order") as span:
    span.set_attribute("order_id", order.id)
    span.set_attribute("customer_id", order.customer_id)

    validate_order(order)
    process_payment(order)
    ship_order(order)

Span Attributes

Add context to spans with attributes:

Attribute	Example	Description
`http.method`	`GET`	HTTP method
`http.url`	`/api/users`	Request URL
`http.status_code`	`200`	Response status
`db.system`	`postgresql`	Database type
`db.statement`	`SELECT * FROM users`	Query (sanitized)
`messaging.system`	`kafka`	Message broker
`messaging.destination`	`orders`	Topic/queue name

OpenTelemetry defines semantic conventions for standard attributes.

Sampling

In high-traffic systems, tracing every request is expensive. Sampling reduces volume while maintaining visibility:

Strategy	Description	Use Case
Always On	Trace every request	Development, low traffic
Probabilistic	Trace X% of requests	General production use
Rate Limiting	Trace N requests/second	High traffic services
Tail-based	Keep interesting traces	Error/latency analysis

Start with Always On

In development and early production, trace everything. Add sampling only when volume becomes a concern.

Connecting Logs and Traces

Include trace context in your logs to connect them:

{
  "timestamp": "2025-01-15T14:32:01.234Z",
  "level": "INFO",
  "message": "Order created",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "orderId": "ord_12345"
}

This allows jumping from a log entry directly to its trace in your observability platform.

Best Practices

Use auto-instrumentation first - It covers most common operations
Add manual spans for business logic - Capture what auto-instrumentation misses
Include meaningful attributes - Add context that helps debugging
Propagate context everywhere - HTTP, messaging, async jobs
Connect to logs - Include trace/span IDs in log entries
Use semantic conventions - Follow OpenTelemetry naming standards

Languages

For language-specific tracing implementations:

Python - opentelemetry-python
.NET - OpenTelemetry.NET
Java - OpenTelemetry Java
Rust - tracing, opentelemetry-rust
JavaScript - @opentelemetry/sdk-node

Logging — Connect logs to traces with trace IDs
Metrics — Aggregate performance measurements
Observability Overview — The four pillars and platform integration

What You'll Learn​

Why Tracing Matters​

Concepts​

Trace​

Span​

Context Propagation​

OpenTelemetry​

Architecture​

Auto-Instrumentation​

Manual Instrumentation​

Span Attributes​

Sampling​

Connecting Logs and Traces​

Best Practices​

Languages​

Related​