Skip to main content

Observability

This guide covers adding observability to your migrated application. Unlike builds and deployments, observability can be adopted incrementally—start with the basics and add more sophisticated instrumentation over time.

What You'll Learn

  • The observability phases and what to prioritize
  • Minimum requirements for platform integration
  • How to add each observability pillar progressively
  • What to skip initially and what to add later

Requirements Overview

Incremental Adoption

Full observability is not required before you deploy, but Phase 1 health checks (readiness and liveness endpoints) are the minimum expectation for stable deployments. Start there and add additional capabilities progressively based on your operational needs.

PhaseComponentsPriorityReference
Phase 1Health checksHigh — Required for reliable deploymentsHealth Checks
Phase 2Structured loggingHigh — Essential for debuggingLogging
Phase 3MetricsMedium — Important for monitoringMetrics
Phase 4TracingLower — Valuable for distributed systemsTracing

Phase 1: Health Checks

Health checks are the most important observability component for platform deployment. Kubernetes uses them to route traffic and restart unhealthy containers.

Minimum Viable Health Checks

Implement these two endpoints:

EndpointPurposeReturns
/health/live"Is the process running?"200 if alive
/health/ready"Can it handle traffic?"200 if ready

Quick Start Pattern

For most legacy applications, start simple:

# The liveness endpoint - keep it simple
@app.get("/health/live")
def liveness():
return {"status": "ok"}

# The readiness endpoint - can check dependencies
@app.get("/health/ready")
def readiness():
return {"status": "ok"}
Keep Liveness Simple

Never check external dependencies in the liveness probe. If your database is down, restarting your application won't fix it.

Update Your Manifest

spec:
deployment:
readinessProbe:
port: 8080
path: /health/ready
livenessProbe:
port: 8080
path: /health/live

For complete patterns including dependency checks, timing configuration, and language-specific implementations, see Health Checks.


Phase 2: Structured Logging

The platform collects logs from stdout/stderr. Structured (JSON) logs are much more useful than plain text.

Why Structured Logging?

Plain TextStructured JSON
Error connecting to database: timeout{"level":"error","message":"Error connecting to database","error":"timeout","timestamp":"..."}
Hard to filter and searchEasy to filter by level, component, error type

Migration Approach

If your application already logs to stdout, it works with the platform. To improve:

  1. Configure JSON output — Most logging libraries support this
  2. Add standard fields — level, timestamp, message at minimum
  3. Include context — request IDs, user IDs where appropriate

Language-Specific Libraries

LanguageRecommended Library
Pythonstructlog, python-json-logger
Node.jspino, winston
JavaLogback with JSON encoder
.NETSerilog with JSON formatter
Rusttracing with JSON subscriber

For configuration patterns and examples, see Logging and the Language Reference.


Phase 3: Metrics

Metrics provide numerical measurements for monitoring and alerting. The platform scrapes Prometheus-format metrics from your /metrics endpoint.

When to Add Metrics

Add metrics when you need:

  • Dashboards showing application performance
  • Alerts based on error rates or latency
  • Capacity planning data
  • SLI/SLO tracking

What Metrics to Start With

Metric TypeExamples
RED metricsRequest rate, Error rate, Duration
USE metricsUtilization, Saturation, Errors
Business metricsOrders processed, Users active

Exposing Metrics

Your application exposes metrics by implementing a /metrics endpoint that returns Prometheus format. The platform automatically scrapes this endpoint from any HTTP port.

spec:
deployment:
ports:
- port: 8080
protocol: http # Platform scrapes /metrics on HTTP ports

For implementation patterns including auto-instrumentation options, see Metrics.


Phase 4: Tracing

Distributed tracing shows how requests flow through your services. It's most valuable when you have multiple services communicating.

When to Add Tracing

Add tracing when you need:

  • End-to-end request visibility across services
  • Latency breakdown by service
  • Dependency mapping
  • Root cause analysis for distributed failures

Migration Approach

Tracing requires more instrumentation than other pillars. For legacy apps:

  1. Start with auto-instrumentation — OpenTelemetry provides automatic instrumentation for common frameworks
  2. Add manual spans later — Instrument custom business logic after auto-instrumentation works
  3. Propagate context — Ensure trace context passes between services

For implementation details including auto-instrumentation setup, see Tracing.


Prioritization by Application Type

Different application types benefit from different observability investments:

Application TypeRecommended Priority
API/Web ServiceHealth checks → Logging → Metrics → Tracing
Background WorkerHealth checks → Logging → Metrics
CLI/Batch JobLogging → Metrics (if long-running)
Event ConsumerHealth checks → Logging → Metrics → Tracing

High-Traffic Services

Focus on:

  • Accurate metrics for capacity planning
  • Low-overhead tracing (use sampling)
  • Efficient structured logging

Critical Business Services

Focus on:

  • Comprehensive health checks including dependencies
  • Detailed business metrics
  • Full tracing for debugging

Internal Tools

Focus on:

  • Basic health checks
  • Structured logging for debugging
  • Skip metrics and tracing initially

Common Migration Challenges

Logging Library Conflicts

Problem: Application uses multiple logging libraries with different configurations.

Solution: Consolidate to one library or configure all to output JSON to stdout.

Performance Concerns

Problem: Concern about observability overhead.

Solution:

  • Start with low-overhead options (structured logging, basic metrics)
  • Use sampling for tracing in high-traffic services
  • Profile before and after to quantify impact

Missing Request Context

Problem: Logs don't include request IDs or correlation IDs.

Solution: Add middleware to generate and propagate request IDs. See your language's reference page for patterns.


Next Steps

  1. Implement Phase 1 — Add health check endpoints and update your manifest
  2. Deploy and verify — Ensure health checks work in the platform
  3. Add Phase 2 — Configure structured logging
  4. Iterate — Add metrics and tracing as operational needs arise