Telemetry as a First‑Class Design Decision
Why timing and boundaries matter more than dashboards
Recently, I’ve been spending a lot of time trying to reason about systems without the data required to do so.
The data we have can answer questions, but not the ones we need to answer. It gives the appearance of understanding without the substance. That experience has forced me to return to a well‑worn path of advocating for when telemetry is designed, not just whether it exists.
Nearly every system I’ve worked on agrees telemetry matters. But most systems begin with only minimal signals, trusting they can add more later. As real debugging starts, the system grows in ways no one originally intended.
Early on, the work is about making something exist at all. Shipping a capability. Proving it works. Telemetry still feels abstract at that stage. Speculative. Easy to defer. When it does show up, it often arrives as a scatter of ideas, half‑owned and loosely connected. That chaos is treated as temporary.
It rarely is.
Over time, I’ve learned that telemetry has a temporal obligation. If it is not designed early, it will still arrive later. What changes is the shape it arrives in, and the kinds of questions it can support. By then, it is much harder to reason about. Much harder to trust.
Why telemetry gets deferred, even when we know better
Early system design optimizes for forward motion. It rewards momentum, not restraint. Telemetry work asks something different of us. It asks us to anticipate questions we have not yet earned. To choose constraints before the pressure to measure feels urgent. To accept that early answers will be incomplete and sometimes uncomfortable.
That combination makes telemetry easy to postpone. Ownership stays fuzzy. Design conversations remain high‑level. Ideas get tossed around without structure. Everyone agrees telemetry matters, but no one is quite sure what “good” looks like yet.
So we tell ourselves we’ll add it later.
The myth of “we’ll add it later”
By the time teams realize which questions they actually need to answer, the system has already taught itself what kind of truth it can tell.
Interfaces have hardened. Data shapes have become sticky. Early shortcuts now define what is observable and what is not. When telemetry is added at this stage, it inherits those assumptions. New questions are forced through old forms. Gaps get papered over. Signals are joined in ways no one originally intended.
This is not just technical debt.
It’s epistemic debt.
Epistemic debt is what accrues when a system looks measurable, but the measurements do not support reliable interpretation. The system begins answering questions it was never designed to answer, with a confidence it has not earned.
Chaos is normal. Lack of boundaries is not.
Even when telemetry is pushed for early, the work is rarely clean. Ideas are half‑formed. Coverage may initially be worse than what a legacy system provided. Signal quality fluctuates. This is fairly normal.
The mistake is not chaos.
The mistake is letting chaos harden without containment.
Treating telemetry as a first‑class design concern does not mean demanding perfect schemas on day one. It does not mean complete coverage or instant clarity. It means something more modest, and more important.
It means someone is accountable for intent.
It means there is a shared understanding of what the system is allowed to observe.
And, critically, what it is not.
Boundaries matter more than coverage
Not all telemetry serves the same purpose. Some signals exist to understand system health and behavior. Others exist to support learning or evaluation in the system itself, where context and consent are part of the design. When these signals are casually collapsed, things go wrong quietly.
Metrics get reused outside their design context. Signals acquire authority they were never meant to carry. Teams begin optimizing for what is visible rather than what is true.
Once these boundaries are crossed under pressure, they are extremely difficult to restore. Trust erodes. Interpretation becomes unstable. The system starts teaching people the wrong lessons.
The absence of boundaries is not neutral.
It actively shapes behavior.
What “first‑class” actually means
Calling telemetry a first‑class design decision is not about elevating dashboards or adding more data. It is about designing telemetry alongside system intent.
It means anticipating categories of questions, even if the answers are not yet known. It means being explicit about which joins are disallowed, not just which ones are convenient. It means treating telemetry as part of the system’s moral contract, not just its exhaust.
First‑class does not mean finished.
It means intentional.
It also does not mean maximal. First‑class telemetry includes deliberate absence, constraints, and boundaries that protect interpretation.
I’ve seen what this makes possible.
In one system, telemetry was treated as a first‑class design concern from the beginning. Not because the team knew exactly what questions they would need to answer later, but because they knew they would need to learn from real behavior. That early intent made it possible to reconstruct executions when something went wrong, test hypotheses against what actually happened, and improve the system without guessing.
The difference wasn’t the tooling.
It was the decision to encode intent early, while the system was still forming.
The cost of getting this wrong
When telemetry is added late and without boundaries, systems start to look more observable than they actually are. Visibility is mistaken for understanding. Decisions get made with misplaced confidence. Teams chase signals that feel authoritative but are poorly grounded.
None of this requires bad intent. It emerges naturally when measurement is treated as an afterthought rather than a design choice.
Designing for the questions you haven’t earned yet
Telemetry asks us to care earlier than we want to. Before success is obvious. Before failure is painful. Before the pressure to explain ourselves arrives.
But the systems we build will only ever tell the kinds of truth we allow them to tell. That decision is made earlier than most teams realize.
Treating telemetry as a first‑class design concern does not eliminate chaos. It gives chaos a container.
And sometimes, that difference is everything.
What changes in practice once you accept this is a separate conversation.
Alison + Wiggins

