Distributed Tracing with OpenTelemetry
This guide explains how to configure and use distributed tracing in vLLM Semantic Router for enhanced observability and debugging capabilities.
Overview​
vLLM Semantic Router implements comprehensive distributed tracing using OpenTelemetry, providing fine-grained visibility into the request processing pipeline. Tracing helps you:
- Debug Production Issues: Trace individual requests through the entire routing pipeline
 - Optimize Performance: Identify bottlenecks in classification, caching, and routing
 - Monitor Security: Track PII detection and jailbreak prevention operations
 - Analyze Decisions: Understand routing logic and reasoning mode selection
 - Correlate Services: Connect traces across the router and vLLM backends
 
Architecture​
Trace Hierarchy​
A typical request trace follows this structure:
semantic_router.request.received [root span]
├─ semantic_router.classification
├─ semantic_router.security.pii_detection
├─ semantic_router.security.jailbreak_detection
├─ semantic_router.cache.lookup
├─ semantic_router.routing.decision
├─ semantic_router.backend.selection
├─ semantic_router.system_prompt.injection
└─ semantic_router.upstream.request
Span Attributes​
Each span includes rich attributes following OpenInference conventions for LLM observability:
Request Metadata:
request.id- Unique request identifieruser.id- User identifier (if available)http.method- HTTP methodhttp.path- Request path
Model Information:
model.name- Selected model namerouting.original_model- Original requested modelrouting.selected_model- Model selected by router
Classification:
category.name- Classified categoryclassifier.type- Classifier implementationclassification.time_ms- Classification duration
Security:
pii.detected- Whether PII was foundpii.types- Types of PII detectedjailbreak.detected- Whether jailbreak attempt detectedsecurity.action- Action taken (blocked, allowed)
Routing:
routing.strategy- Routing strategy (auto, specified)routing.reason- Reason for routing decisionreasoning.enabled- Whether reasoning mode enabledreasoning.effort- Reasoning effort level
Performance:
cache.hit- Cache hit/miss statuscache.lookup_time_ms- Cache lookup durationprocessing.time_ms- Total processing time
Configuration​
Basic Configuration​
Add the observability.tracing section to your config.yaml:
observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "stdout"  # or "otlp"
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "always_on"  # or "probabilistic"
      rate: 1.0
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"
Configuration Options​
Exporter Types​
stdout - Print traces to console (development)
exporter:
  type: "stdout"
otlp - Export to OTLP-compatible backend (production)
exporter:
  type: "otlp"
  endpoint: "jaeger:4317"  # Jaeger, Tempo, Datadog, etc.
  insecure: true  # Use false with TLS in production
Sampling Strategies​
always_on - Sample all requests (development/debugging)
sampling:
  type: "always_on"
always_off - Disable sampling (emergency performance)
sampling:
  type: "always_off"
probabilistic - Sample a percentage of requests (production)
sampling:
  type: "probabilistic"
  rate: 0.1  # Sample 10% of requests
Environment-Specific Configurations​
Development​
observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "stdout"
    sampling:
      type: "always_on"
    resource:
      service_name: "vllm-semantic-router-dev"
      deployment_environment: "development"
Production​
observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "otlp"
      endpoint: "tempo:4317"
      insecure: false  # Use TLS
    sampling:
      type: "probabilistic"
      rate: 0.1  # 10% sampling
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"
Deployment​
With Jaeger​
- Start Jaeger (all-in-one for testing):
 
docker run -d --name jaeger \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest
- Configure Router:
 
observability:
  tracing:
    enabled: true
    exporter:
      type: "otlp"
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "probabilistic"
      rate: 0.1
- Access Jaeger UI: http://localhost:16686
 
With Grafana Tempo​
- Configure Tempo (tempo.yaml):
 
server:
  http_listen_port: 3200
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/traces
- Start Tempo:
 
docker run -d --name tempo \
  -p 4317:4317 \
  -p 3200:3200 \
  -v $(pwd)/tempo.yaml:/etc/tempo.yaml \
  grafana/tempo:latest \
  -config.file=/etc/tempo.yaml
- Configure Router:
 
observability:
  tracing:
    enabled: true
    exporter:
      type: "otlp"
      endpoint: "tempo:4317"
      insecure: true
Kubernetes Deployment​
apiVersion: v1
kind: ConfigMap
metadata:
  name: router-config
data:
  config.yaml: |
    observability:
      tracing:
        enabled: true
        exporter:
          type: "otlp"
          endpoint: "jaeger-collector.observability.svc:4317"
          insecure: false
        sampling:
          type: "probabilistic"
          rate: 0.1
        resource:
          service_name: "vllm-semantic-router"
          deployment_environment: "production"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: semantic-router
spec:
  template:
    spec:
      containers:
      - name: router
        image: vllm-semantic-router:latest
        env:
        - name: CONFIG_PATH
          value: /config/config.yaml
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: router-config
Usage Examples​
Viewing Traces​
Console Output (stdout exporter)​
{
  "Name": "semantic_router.classification",
  "SpanContext": {
    "TraceID": "abc123...",
    "SpanID": "def456..."
  },
  "Attributes": [
    {
      "Key": "category.name",
      "Value": "math"
    },
    {
      "Key": "classification.time_ms",
      "Value": 45
    }
  ],
  "Duration": 45000000
}
Jaeger UI​
- Navigate to http://localhost:16686
 - Select service: 
vllm-semantic-router - Click "Find Traces"
 - View trace details and timeline
 
Analyzing Performance​
Find slow requests:
Service: vllm-semantic-router
Min Duration: 1s
Limit: 20
Analyze classification bottlenecks:
Filter by operation: semantic_router.classification
Sort by duration (descending)
Track cache effectiveness:
Filter by tag: cache.hit = true
Compare durations with cache misses
Debugging Issues​
Find failed requests:
Filter by tag: error = true
Trace specific request:
Filter by tag: request.id = req-abc-123
Find PII violations:
Filter by tag: security.action = blocked
Trace Context Propagation​
The router automatically propagates trace context using W3C Trace Context headers:
Request headers (extracted by router):
traceparent: 00-abc123-def456-01
tracestate: vendor=value
Upstream headers (injected by router):
traceparent: 00-abc123-ghi789-01
x-gateway-destination-endpoint: endpoint1
x-selected-model: gpt-4
This enables end-to-end tracing from client → router → vLLM backend.
Performance Considerations​
Overhead​
Tracing adds minimal overhead when properly configured:
- Always-on sampling: ~1-2% latency increase
 - 10% probabilistic: ~0.1-0.2% latency increase
 - Async export: No blocking on span export
 
Optimization Tips​
- 
Use probabilistic sampling in production
sampling:
type: "probabilistic"
rate: 0.1 # Adjust based on traffic - 
Adjust sampling rate dynamically
- High traffic: 0.01-0.1 (1-10%)
 - Medium traffic: 0.1-0.5 (10-50%)
 - Low traffic: 0.5-1.0 (50-100%)
 
 - 
Use batch exporters (default)
- Spans are batched before export
 - Reduces network overhead
 
 - 
Monitor exporter health
- Watch for export failures in logs
 - Configure retry policies
 
 
Troubleshooting​
Traces Not Appearing​
- Check tracing is enabled:
 
observability:
  tracing:
    enabled: true
- Verify exporter endpoint:
 
# Test OTLP endpoint connectivity
telnet jaeger 4317
- Check logs for errors:
 
Failed to export spans: connection refused
Missing Spans​
- Check sampling rate:
 
sampling:
  type: "probabilistic"
  rate: 1.0  # Increase to see more traces
- Verify span creation in code:
 
- Spans are created at key processing points
 - Check for nil context
 
High Memory Usage​
- Reduce sampling rate:
 
sampling:
  rate: 0.01  # 1% sampling
- Verify batch exporter is working:
 
- Check export interval
 - Monitor queue length
 
Best Practices​
- 
Start with stdout in development
- Easy to verify tracing works
 - No external dependencies
 
 - 
Use probabilistic sampling in production
- Balances visibility and performance
 - Start with 10% and adjust
 
 - 
Set meaningful service names
- Use environment-specific names
 - Include version information
 
 - 
Add custom attributes for your use case
- Customer IDs
 - Deployment region
 - Feature flags
 
 - 
Monitor exporter health
- Track export success rate
 - Alert on high failure rates
 
 - 
Correlate with metrics
- Use same service name
 - Cross-reference trace IDs in logs
 
 
Integration with vLLM Stack​
Future Enhancements​
The tracing implementation is designed to support future integration with vLLM backends:
- Trace context propagation to vLLM
 - Correlated spans across router and engine
 - End-to-end latency analysis
 - Token-level timing from vLLM
 
Stay tuned for updates on vLLM integration!