Scaling Distributed Tracing with OpenTelemetry and Adaptive Sampling
Written byPPIL Intelligence Brief
"An exploration of how modern services can collect, process, and analyze trace data without overwhelming storage or network budgets. The piece covers OpenTelemetry’s architecture, adaptive sampling algorithms, and practical patterns for end‑to‑end observability in large‑scale deployments."
The moment a user’s click turned into a 500 ms latency spike, the tracing system logged 1,274 spans across 23 microservices. Within seconds the backend received a burst of 45 MB of JSON payloads, pushing the ingestion pipeline to 90 % of its capacity. That single request illustrates the paradox at the heart of distributed tracing: the richer the data, the harder it is to keep the pipeline flowing.
“If you cannot afford to store every trace, you cannot afford to lose the ones that matter.”
OpenTelemetry, now the de‑facto standard for instrumenting cloud‑native applications, promises vendor‑neutral data collection. Yet the specification alone does not solve the scaling problem. The real challenge lies in deciding which spans to keep, how to route them, and how to turn raw identifiers into actionable insights. This article walks through the technical decisions that let organizations reap the benefits of end‑to‑end visibility while keeping cost and latency in check.
The Anatomy of a Trace
A trace is a directed acyclic graph (DAG) of spans, each representing a timed operation such as an HTTP request, a database query, or a background job. Every span carries a unique trace identifier, a parent identifier, timestamps, and a set of attributes (key‑value pairs). In a typical e‑commerce platform, a single user journey may touch authentication, product catalog, pricing, inventory, and payment services, generating dozens of spans.
When instrumentation follows the OpenTelemetry API, the data is emitted as a Protocol Buffers (protobuf) message, optionally compressed, and handed off to an exporter. Exporters can push to collectors, write to files, or stream directly to a backend such as Jaeger, Zipkin, or a commercial SaaS offering.
Why “Collect Everything” Fails
A naïve approach records every span from every request. Consider a service that handles 10 k requests per second, each producing an average of 12 spans, each span averaging 200 bytes after compression. The raw ingest rate is:
10,000 req/s × 12 spans × 200 B ≈ 24 MB/s ≈ 2 TB per day
Storing 2 TB of trace data daily quickly becomes prohibitive. Moreover, downstream analytics - such as latency heat maps or root‑cause graphs - must process this volume in near real time, stretching compute resources.
The industry response has been twofold: '
(1) sampling, where only a subset of traces is retained, and
(2) aggregation, where detailed spans are collapsed into statistical summaries. Sampling, however, is not a simple “pick 1 % at random” operation. Random sampling discards rare but critical failures, while over‑sampling high‑traffic endpoints can drown out low‑frequency services that matter most during incidents.
Adaptive Sampling: The Core Idea
Adaptive sampling adjusts the probability of retaining a trace based on observable signals. The goal is to keep a steady volume of trace data while preserving coverage of edge cases. Several signals can drive the decision:
| Signal | Typical Source | How It Influences Sampling |
|---|---|---|
| Error rate | Status codes, exception counters | Increase probability when errors exceed a threshold |
| Latency percentile | Histograms from metrics pipelines | Boost sampling for requests in the 99th percentile |
| Service criticality | Configuration metadata | Assign higher baseline rates to core services |
| User segment | Authentication token attributes | Prioritize traces from premium users or high‑value accounts |
| Traffic spikes | Request rate monitors | Temporarily raise sampling during surges to capture anomalies |
A common algorithm is probability‑based adaptive sampling. For each incoming trace, the exporter computes a sampling score s:
s = base_rate × (1 + α × error_factor + β × latency_factor)
base_rate is the minimum sampling probability (e.g., 0.01). α and β are tunable coefficients. error_factor may be 1 if the trace contains an error status, otherwise 0. latency_factor could be the normalized deviation from the median latency. The final probability is capped at 1.0.
If the computed s exceeds a uniformly drawn random number in [0,1), the trace is kept; otherwise it is dropped. This method ensures that the overall ingest rate stays near a target while automatically surfacing problematic requests.
Implementing Adaptive Sampling with OpenTelemetry
OpenTelemetry’s SDKs expose a Sampler interface that decides whether to record a span. The default ParentBased sampler respects the decision of the parent span, while the TraceIdRatioBased sampler implements simple probabilistic sampling. To achieve adaptive behavior, developers can compose custom samplers.
Step‑by‑step Integration
- Collect Runtime Metrics – Deploy a lightweight metrics exporter (e.g., Prometheus) that tracks error counts and latency histograms per service.
- Expose a Configuration Service – Provide an HTTP endpoint that returns JSON with the latest sampling coefficients (
base_rate,α,β). - Create a Dynamic Sampler – Implement a class that, on each trace start, fetches the current coefficients (cached for 30 seconds) and evaluates the adaptive formula.
- Register the Sampler – In each service’s OpenTelemetry initialization code, set the custom sampler as the global default.
// Go example using OpenTelemetry SDK
type AdaptiveSampler struct {
mu sync.RWMutex
baseRate float64
alpha, beta float64
cfgURL string
lastFetch time.Time
}
func (as *AdaptiveSampler) ShouldSample(p trace.SamplingParameters) trace.SamplingResult {
// Refresh config if older than 30s
if time.Since(as.lastFetch) > 30*time.Second {
as.refreshConfig()
}
// Extract error and latency signals from attributes
hasError := false
latencyMs := 0.0
for _, attr := range p.Attributes {
if attr.Key == "http.status_code" && attr.Value.AsInt64() >= 500 {
hasError = true
}
if attr.Key == "http.duration_ms" {
latencyMs = attr.Value.AsFloat64()
}
}
// Compute factors
errorFactor := 0.0
if hasError {
errorFactor = 1.0
}
latencyFactor := latencyMs / 1000.0 // normalize to seconds
// Adaptive probability
prob := as.baseRate * (1 + as.alpha*errorFactor + as.beta*latencyFactor)
if prob > 1.0 {
prob = 1.0
}
if rand.Float64() < prob {
return trace.SamplingResult{Decision: trace.RecordAndSample}
}
return trace.SamplingResult{Decision: trace.Drop}
}
The snippet shows a minimal adaptive sampler written in Go. Similar implementations exist for Java, Python, and Rust.
Deploying Collectors
OpenTelemetry Collector acts as a buffer and processor between instrumented services and back‑ends. It can perform head‑based sampling (deciding before data is persisted) and tail‑based sampling (making decisions after a trace completes). Tail‑based sampling is more precise because it can inspect the final status of the trace, but it requires storing the trace temporarily.
A typical collector pipeline for adaptive sampling looks like:
receivers -> processors -> samplers -> exporters
- Receivers: otlp (gRPC), zipkin, jaeger.
- Processors: batch, memory_limiter, attributes (enrich with service name).
- Samplers: custom extension implementing the adaptive algorithm.
- Exporters: otlphttp to a SaaS backend, or file for local debugging.
Deploy the collector as a sidecar or as a daemonset in Kubernetes. Use horizontal pod autoscaling based on CPU and memory metrics to keep the pipeline responsive during traffic spikes.
Storage Strategies
Even with aggressive adaptive sampling, trace volume can reach hundreds of gigabytes per day in large enterprises. Storage must support fast reads for ad‑hoc queries and efficient compression for long‑term retention.
Columnar vs. Document Stores
- Columnar stores (e.g., ClickHouse, Apache Druid) excel at aggregating numeric fields like latency or error counts. They store each attribute in a separate column, enabling vectorized scans.
- Document stores (e.g., Elasticsearch, MongoDB) preserve the hierarchical nature of spans, making it easier to reconstruct a full trace graph.
A hybrid approach stores raw spans in a document store for a 30‑day retention window, while feeding aggregated metrics into a columnar warehouse for longer periods.
Compression Techniques
OpenTelemetry’s protobuf messages compress well with zstandard (zstd) at level 3, achieving a 2.5× reduction on typical span payloads. For long‑term archives, snappy or lz4 provide faster decompression at the cost of modestly larger files.
Retention Policies
- Hot tier (0‑7 days): full trace data, searchable via UI.
- Warm tier (8‑30 days): only spans with error attributes retained, others dropped.
- Cold tier (>30 days): aggregated latency histograms and error counts, stored in a columnar warehouse.
Automate tier migration with a scheduled job that reads from the document store, applies filter rules, and writes to the appropriate destination.
Correlating Traces with Metrics and Logs
Observability reaches its full potential when traces, metrics, and logs share a common identifier. OpenTelemetry defines a trace_id field that can be injected into log statements and metric labels.
- Logs: Use structured logging libraries that accept a context object containing the trace ID. In Java,
MDC.put("traceId", span.getSpanContext().getTraceId()). - Metrics: Export latency histograms with a
trace_idlabel for high‑resolution debugging, but strip the label in the warm tier to avoid cardinality explosion.
Correlation enables a workflow where an alert on a latency spike automatically surfaces the related trace in the UI, reducing mean time to resolution (MTTR).
Security and Privacy Considerations
Trace data may contain personally identifiable information (PII) in URL parameters, query strings, or payload attributes.
- Sanitization: Apply a processor in the collector that masks or removes sensitive keys (e.g.,
user_email,credit_card). - Access Control: Enforce role‑based policies on the trace backend. Only engineers with “debug” privileges can view full span attributes; others see only high‑level summaries.
- Encryption: Use TLS for all OTLP transports. For storage, enable server‑side encryption with customer‑managed keys (CMK) to satisfy compliance regimes.
Real‑World Example: Scaling Tracing at a Global Retailer
A multinational retailer migrated from a legacy Jaeger deployment to an OpenTelemetry‑based pipeline. Their baseline was 5 k traces per second, each averaging 15 spans. Initial attempts at 100 % capture led to a 12 TB daily ingest rate, saturating network links between data centers.
The engineering team introduced adaptive sampling with the following parameters:
base_rate = 0.02(2 % of all traces)α = 4.0(quadruple the chance for error traces)β = 1.5(increase probability proportionally to latency over 1 s)
Resulting ingest dropped to 1.8 TB per day - a 85 % reduction - while the proportion of error traces rose from 3 % to 12 % of the stored dataset.
Further gains came from tail‑based sampling in the collector, which retained full trace graphs only for requests that exceeded the 99th latency percentile. This cut the number of stored spans by an additional 30 %.
The retailer reported a 40 % reduction in storage costs and a 25 % faster incident investigation time, as engineers could focus on the most relevant traces.
Best Practices Checklist
- Instrument all entry points (HTTP, gRPC, messaging) with OpenTelemetry auto‑instrumentation where possible.
- Deploy collectors as sidecars to keep network hops short and to enforce sampling close to the source.
- Use adaptive sampling formulas that incorporate error rates and latency percentiles.
- Store raw spans for a limited hot window; archive aggregates for long‑term analysis.
- Mask PII early in the pipeline; enforce strict RBAC on trace back‑ends.
- Correlate traces with logs and metrics using a shared
trace_id. - Monitor the ingest rate and adjust
base_ratedynamically via a feedback loop.
Looking Ahead: Emerging Trends
- Probabilistic Data Structures – Sketches such as HyperLogLog can estimate unique trace IDs without storing each ID, enabling smarter sampling decisions.
- Edge Sampling – Running lightweight samplers on edge devices (e.g., IoT gateways) reduces upstream traffic before it reaches the cloud.
- AI‑Driven Anomaly Detection – Machine‑learning models trained on historical trace patterns can flag unusual call graphs, prompting on‑demand trace capture.
These innovations promise to push the limits of observability further, making it feasible to monitor billions of requests per day without drowning in data.
“Observability is not about collecting everything; it is about collecting the right thing at the right time.”
PPIL Takeaway: By treating tracing as a dynamic, data‑driven service - rather than a static logging layer - organizations align with PPIL’s philosophy of building systems that adapt, scale, and stay economical as they grow.
Master Sovereign Infrastructure
Join the elite cohort of engineers building the next generation of resilient data systems. Enroll in our specialized curriculum today.
View CoursesGet the latest Insights in your inbox
Subscribe to receive the latest High-fidelity intelligence delivered to your inbox.