Your Telemetry is Garbage – Let’s Talk About Why
You don’t know me, and I don’t know you. Regardless, I’m willing to bet that you—person I’ve never met—have software with garbage, noisy telemetry.
I’m not a genius nor a clairvoyant. I can lead with such brash confidence for two simple reasons:
- You clicked that link, so you self-identified into my trap and,
- Almost everyone’s telemetry is garbage because instrumenting1 software is still really, really time-consuming, and difficult to do well.
It’s not your fault. You’ve been enticed into the warm embrace of a brandname observability platform that promises to do everything. Everything except help you improve your data quality in a sustainable way.
You’ve been told that Open Telemetry (OTel) will solve your problems at the far lefthand side (and will inoculate you against vendor lock-in, says vendor). It hasn’t, unless you designed your services with tracing in mind from the get-go. In the real world, the manual burden to get tracing set up is large and comes with a steep technical learning curve. For now, eBPF pours learning curve fuel on to an already-raging fire.
Today’s observability stack is akin to a sewer network routed into a solid gold pipe, that in turn flows into a magnificent lake. (A metaphor which we all wish was entirely fictional).
There is hope. Language models understand code masterfully, and can shoulder the burden of instrumenting code. Furthermore, the conversation2 about telemetry data quality is gathering momentum, and that’s the drum I want to beat in this post. The rallying cry is: What Do We Want? Structured Event Data! (When do we want it? Now, and without having to do a ton of instrumentation by hand!)
[1] A quick note on nomenclature. For many, the word instrumentation implies “manual instrumentation of code by a human developer”. I will use the word auto-instrumentation to cover, e.g., what you get out of the box with OTel, and what you can achieve with the help of AI.
[2] As I was re-drafting this post based on the latest batch of feedback, this excellent post popped up on my LinkedIn feed and made me chuckle. Timely.
Why observe?
This might seem a bit basic. However, the sprawl and cost of observability tooling and the frustration of buyers is what dominates the discussion today, so it’s worth a quick sanity check. In broad strokes:
- The top-level business objective is to deliver a great software product to customers reliably,
- To deliver software reliability, we must be able to deeply understand the system, and diagnose faults quickly,
- To understand and diagnose, we must be able to collect, store, query and visualise telemetry data,
- To collect telemetry data, we must be able to instrument software (either within the code, by tracing API calls, or by listening at the kernel level).
I put 4 on the bottom for a reason. Think of it as the foundation that the rest is built upon. Instrumenting software is still pretty hard in practice, so that foundation is wobbly and the rest of the stack wobbles with it. In short, it’s still hard to achieve high quality, structured event data.
Today’s observability platforms focus on providing the infrastructure to achieve point 3, in order to enable point 2. The hope is that point 1 (think SLOs) is an emergent property. These platforms are typically organised around the idea of the Three Pillars of Observability.
Do I write a Three Pillars bit?
The world really doesn’t need another description of the Three Pillars of Observability. In short:
- Peter Bourgon coined it,
- Ben Sigelman debunked it,
- Charity Majors hammered in some nails, and offered a vision for the future,
- And if you acquired one of each pillar, you continue to write earnestly about it.
I’ll summarise. Each of the pillars (metrics, logs and traces) has some worthwhile uses and some flaws within the pursuit of understanding ‘the system’. As a mighty sounding set of pillars they come with a bonus flaw — they’re all disparate and the user is left to correlate them by, like, squinting (I guess). My co-founders and I met at Palantir, so you can be assured that we consider disparate data sources to be the root of all evil.
In the post above, Charity makes the case for unifying the pillars (and retiring pre-baked metrics as we know them). Separately, Ben S. recently explained to me that tracing was never meant to be a third pillar. Instead, it was intended to be an automation pattern for defining events in complex software (because manual instrumentation of logging statements is time consuming and tedious). The gotcha is that for software that isn’t architected to a Google-standard with a tracing-first approach, a lot of hand-finishing is required. Without the hand-finishing, the data quality simply won’t be there.
As things stand, these three pillars are more like three silos (and are really “just telemetry”). Whichever analogy you prefer, they are built on a foundational assumption that instrumenting software is a solved problem. It’s not, and so the silos are full of garbage.
The future is –
Opinions are divided on the aspirational end state for Observability. A few dimensions of the debate include:
- Collect and index everything vs. collect everything, filter most (to cheap storage or to the bin) and store the rest vs. collect strictly on demand3,
- Instrument throughout the code vs. instrument around the code (API-level tracing) vs. probe at the kernel level.
- Extract structure from noisy data4 vs. invest in structure up front (in the code).
And, recently:
- Teach the LLMs to decipher
internal error
(Agentic SREs) vs. make it possible to auto-instrument software to increase data quality.
We believe that making it possible to auto-instrument software and in doing so emit high data quality is the play across ii-iv—shoring up the foundations. We love LLMs and use them every day but much like us human engineers, they aren’t miracle workers. They benefit from high quality inputs as much as we do.
The future is: high quality telemetry data. To get there requires:
- A robust definition of what constitutes an event AND the ability to apply that definition to imperfect (aka typical) software automatically. As a foundation for:
- Structured data,
- Clear, actionable log messages.
[3] In this dimension, the trade-offs are difficult. Phillip Carter did a great write-up on this.
[4] Security engineers and SREs do not in fact love writing regex parsers for your logs all day.
Today’s problem and a shared goal
The shared goal is high quality, structured event data, available for quick analysis, when you need it. The rest is up for grabs, in our opinion.
For now we are starting by tackling a humble problem — making it easy for engineering teams to achieve structured logging within their application code, new and old. Our chosen path there is auto-instrumentation of logging statements, augmented by language models. In practice, this means solving for:
- A robust definition of an event (answering the question of where to log),
- Logger usage (how to log),
- Consistent use of a structured logger class (not a fruit salad of default, imported, custom and printf),
- Appropriate logging levels (e.g. rather than error for everything),
- Logging statement contents (what to log),
- A structured payload that contains all useful context and avoids sensitive data,
- Clear, actionable log messages.
- Inclusion of a context object, where desired.
Whether you plan to store the emitted log messages in the conventional event-centric way or burn them into overarching request-centric traces using Open Telemetry, the argument and need for high quality data remains unchanged.
This has natural overlap with the challenges that faced when instrumenting the average, imperfect piece of software for tracing, for example using OTel. That is, the need to manually (re-)define events (add/remove spans), add attributes, and other hand-finishing tasks. More on that to come.
If you’d like to learn how Invaria (formerly Patchwork) could help you reach this goal, get in touch!
Contact
Fill out the contact form or email us at: info@invaria.io