The role of sampling in distributed tracing
What is sampling, which common types are out there, and what are their trade-offs?
Distributed tracing is a technique that produces a high-fidelity observability signal: each data point (trace) represents a concrete execution of a code path. In an HTTP-based service, this typically means that each request would generate a trace containing data representing all the operations that were executed as a result of the request: database calls, message queue interactions, calls to downstream microservices, and so on.
As you can imagine, collecting this level of information for all requests received by a service can quickly generate a seemingly endless amount of data that is hard to manage. Making things even less appealing, the vast majority of this data will represent requests that are not that interesting, given that they’d represent successful operations. In the end, we might end up collecting, transferring, and storing data that will end up being deleted without being used at all.
The holy grail, the ultimate goal for distributed tracing is to collect only data that we’ll need in the future.
While this goal might be very hard to achieve, it’s certainly possible to get close to it by making use of a technique called sampling.
Types of sampling
In the context of distributed tracing, sampling is the decision to capture or discard a specific trace.
Given that a trace represents a transaction involving potentially hundreds of services, it’s impossible to know up front whether a trace will be interesting. Additionally, each piece of a trace is sent independently and asynchronously to the collection backend, without regard for the outcome of the other actions that compose the same trace. The final result is that we have to decide on two basic approaches: a) we determine whether a trace will be sampled when it’s first created (head-based sampling), or b) we collect and transfer all trace data to a central location, where we make the sampling decision (tail-based sampling) once all spans for the same trace are expected to have been received.
Head-based sampling
The most prevalent sampling technique out there is head-based sampling: with it, the tracer creating the very first span in the trace will decide whether all other spans should be stored, and this decision is then propagated down the chain via the regular context propagation mechanism.
In this category, we have sampling strategies such as probabilistic, rate-limiting, constant, among others. In the probabilistic strategy, we specify the probability that a trace will be sampled, such as “1 in 1000” chance, which would yield roughly 1 sampled trace for every 1000 traces. With rate-limiting, we set an upper boundary in how many spans per second we are willing to accept. Constant serves to accept or drop all traces.
This type of sampling reduces the network traffic between the tracer and the collection infrastructure, as spans that aren’t sampled will be discarded locally. The volume of data is now at manageable levels, but just like we are discarding data that we won’t need, we’ll be discarding valuable traces as well.
The sampling strategies you can configure in distributed tracing clients, such as Jaeger’s and the OpenTelemetry SDKs, can be placed in this category.
Tail-based sampling
This sampling technique is used when a better signal-to-noise ratio is desired. A typical tail-based sampling solution would require all spans for the same trace to be sent to the same collector. The collector then awaits a pre-defined amount of time, under the assumption that by then, all spans for the given trace were received by the collector. Once a trace is deemed complete by the collector, a decision is made based on the whole trace’s characteristics: does it contain a span with an error tag? Did the trace take longer than a specific duration? Was a specific service part of the transaction?
In the typical scenario, a tail-based sampling strategy will yield a greater signal-to-noise ratio, but the costs and complexity of the solution are higher than with head-based sampling.
First, we have to estimate how long spans have to be kept in memory before a decision is made. Given that there’s no specific notation signaling that a trace is “complete”, our best bet is to simply wait long enough under the assumption that the trace will be completed by then. However, we risk losing one specific type of highly interesting traces: the ones that take longer than what we’ve been willing to wait. Second, we now have a stateful distributed system to manage, instead of a stateless as we’d have with head-based sampling. With it, we need to prepare answers for questions like: what happens to the spans when a collector instance crashes? How can we scale out the collectors, ensuring that all spans related to the same trace land at the same collector instance?
And finally: how much memory should be allocated to each instance, if they are holding seconds worth of data?
Tail-based sampling can be done with OpenTelemetry Collector by combining . components, such as the "Trace ID aware load-balancing exporter" and the "Group by Trace processor". Once traces are routed to the same collector instance and the spans grouped into complete traces, specific sampling processors can be applied to the data in the pipeline. The "Tail Sampling Processor" can be used for that, and custom sampling processors can be easily built.
Adaptive sampling
While not in the same category as head or tail-based sampling, adaptive sampling deserves to be mentioned here: it is the technique of changing the sampling strategy based on the traced application’s current behavior. For instance, if our inventory service is currently experiencing difficulties, we might want to increase the sampling rate or even replace it with a strategy that samples every request. Once we identify which endpoint is responsible for the problem, we can change it back, leaving only the affected endpoint with the special sampling strategy.
Ideally, an adaptive sampling system would autonomously change the sampling strategies, but figuring out when to perform the change, where, and for how long is another art on its own.
At the moment, distributed tracing solutions like Jaeger provide a way for sampling strategies to be defined in an external configuration file, which can then be changed by a human or process when needed. This configuration file is then propagated down from the collector to the agent, and from the agent to the clients.
Conclusion
We are still trying to reach the ultimate goal of distributed tracing: ensuring that the information we capture, transmit, and store is valuable to us, either now or in the future.
Until a breakthrough happens providing the perfect solution, we have tools at our disposal to make a decision based on our real workload: no sampling, for those who can afford the costs; head-based sampling for a simplified infrastructure; tail-based sampling for scenarios where the extra complexity pays off.
Photo by Chokniti Khongchum from Pexels