Built isn’t done: operability as a first‑class requirement
Shipping code is the easy part. Keeping it running, debuggable, and supportable under real traffic is where projects either mature… or quietly bleed time. Here’s how I think about operability as a requirement, not an afterthought.
Most teams don’t ship “a system”. They ship a thing that works right now, in one environment, with one person holding the context in their head.
And look, that’s normal. It’s how software gets made. But at some point—usually when the stakes go up, or the traffic goes up, or the number of people touching the code goes up—you hit the real dividing line:
Does this thing keep working when nobody’s watching?
That’s operability. And it’s one of those topics that sounds like infrastructure navel-gazing until you’ve lived through a Friday afternoon incident where the only clue you have is a generic 500 and a vague sense of dread.
I’m writing this because I keep seeing the same pattern: teams treat operability as “nice to have”, or as something you do once you’ve shipped the feature work. But in high-stakes systems—commerce is the obvious example, but it applies everywhere—operability is a product requirement.
Not because it’s fashionable. Because it’s cheaper than chaos.
What I mean by “operability”
Operability is the set of properties that lets a team:
- understand what the system is doing (without guessing)
- detect problems early (before customers tell you)
- fix problems quickly (without heroics)
- make changes safely (without fear)
- recover when something breaks (because it will)
It’s not just monitoring. It’s not just logging. And it’s definitely not “we have a dashboard somewhere”.
Operability is about running the system as a service, even if you don’t call it that.
Why “works in staging” doesn’t count
I’ve been on projects where the feature scope was “done”, QA signed off, UAT was green, and we still weren’t ready to go live.
Because nobody could answer basic questions like:
- If orders stop flowing, how will we know?
- If a webhook fails, can we replay it?
- If an integration creates duplicates, can we detect them automatically?
- If a job runs late, who gets alerted?
- If we deploy a change, how do we roll it back?
- If we get a spike in errors, what’s the first place we look?
If the answers are “we’ll check manually” or “we’ll just look in the logs” (where logs are a pile of uncorrelated text), then it isn’t actually ready. It’s just… hopeful.
Hope is not a plan. You know this.
Operability as a requirement: how I frame it in discovery
When I’m doing discovery (or even just early delivery alignment), I like to add a very explicit “operability” section alongside functional requirements.
Functional requirements are the things the system must do.
Operability requirements are the things the system must make possible for the team running it.
Here’s a simple set of prompts I use to pull this out of stakeholders and the delivery team:
The “how will we know?” prompts
- How will we know this flow is healthy?
- What’s the acceptable delay? (e.g. orders to ERP within 2 minutes vs within 2 hours)
- What does “partially working” look like?
The “how will we fix it?” prompts
- What’s the recovery path when it fails?
- Can we rerun jobs safely?
- Can we reprocess messages/events safely?
- Who can do this? Engineers only, or ops/support too?
The “how will we prove it?” prompts
- What metrics prove this is working?
- What evidence do we need for stakeholders? (often finance/ops)
- What audit trail is required? (especially with payments/refunds)
When you ask these early, two things happen:
- You uncover hidden complexity while you can still design for it.
- You stop the “we’ll add monitoring later” fantasy before it becomes a deadline problem.
The operability stack (what I actually want in place)
This is the practical bit: what I typically want a project to include, at minimum. Not because I love checklists, but because I hate discovering gaps under pressure.
1) Observability: logs, metrics, and traces (with correlation)
At a minimum:
- Structured logs (not just strings)
- Correlation IDs that propagate across boundaries (request → webhook handler → job → downstream API)
- Metrics for key flows (counts, latency, error rates)
- Dashboards that answer real questions (“are orders flowing?” not “CPU is 12%”)
- Alerting that pages the right humans at the right time
A note on correlation IDs: they’re boring until they’re the difference between “we fixed it in 10 minutes” and “we spent 3 hours trying to reproduce it”.
If you do one thing, do that.
2) Control points: feature flags, kill switches, and safe rollbacks
If the system touches money, customer experience, or operational flow, I want:
- feature flags for risky changes (even simple ones)
- a kill switch for known-dangerous flows (e.g. stop a broken integration from duplicating orders)
- a clear rollback plan that somebody has actually rehearsed, at least mentally
This is not pessimism. It’s professionalism.
3) Idempotency and replay for integration work
In integration-heavy systems, the ability to replay safely is an operability requirement.
That usually means:
- idempotency keys
- dedupe stores
- event logs / message persistence
- dead-letter queues (or at least a “failed events” table)
- admin tooling or scripts to reprocess within defined bounds
If the only way to recover is “ask the third party to resend it”, you’ve built a hostage situation, not an integration.
4) Runbooks: short, specific, usable
A runbook should be something a tired person can follow at 2am.
I like runbooks to be:
- short
- linked to dashboards
- explicit about ownership and escalation
- written like instructions, not essays
Example runbook headings:
- Symptoms
- Likely causes
- Checks (with links)
- Safe actions (what won’t make it worse)
- Escalation path
- How to confirm recovery
5) Data quality and reconciliation
This is where systems quietly rot.
If you have:
- orders, refunds, settlements
- inventory movements
- fulfilments and tracking
- returns and exchanges
…then you need a plan for:
- detecting mismatches
- explaining them
- fixing them
I treat reconciliation like a feature. Not a finance chore.
Because when it’s missing, the team ends up doing manual “daily reconciliation” in spreadsheets. Every day. Forever.
And that is exactly how you lose time “one day at a time”.
A simple model: the Operability Loop
This is the mental model I keep coming back to: operability is a loop, not a toolkit.
If you’re missing a step, you’ll feel it.
- No Observe? You find out from customers.
- No Detect & Triage? You argue about what’s happening.
- No Mitigate? You can only “fix forward”, which is risky.
- No Recover & Validate? You don’t know if it’s actually resolved.
- No Learn & Improve? Same incidents repeat, just with different names.
The uncomfortable truth: operability is a delivery responsibility
This is where things get slightly political.
Operability often dies in the gaps:
- product thinks it’s “engineering stuff”
- engineering thinks it’s “platform stuff”
- platform thinks it’s “not in scope”
- delivery thinks it’s “post-launch hardening”
And so it becomes nobody’s responsibility… until it’s everyone’s emergency.
My preference is simple:
- Make operability explicit in scope.
- Give it tickets.
- Put it on the project plan.
- Demo it, like a feature.
Because it is a feature. It’s just a feature for the team running the product.
Practical checklist: what I’d ask for on almost any project
If you’re in discovery or early delivery, here’s a starter checklist you can steal. It’s intentionally short.
Health and visibility
- [ ] Do we have a dashboard that answers “is the system healthy?”
- [ ] Do we have alerts for the top 3 revenue/ops-critical flows?
- [ ] Can we trace one transaction end-to-end with a correlation ID?
Recovery
- [ ] Can we replay failed events/jobs safely (idempotent)?
- [ ] Do we have a “stop the bleeding” mechanism (flag/kill switch)?
- [ ] Do we have a rollback plan for the last release?
Ownership and support
- [ ] Who is on point during launch?
- [ ] Who is on point after launch?
- [ ] What’s the escalation path when it breaks?
Evidence and auditability
- [ ] What do we log that proves the flow worked?
- [ ] What do we store that allows reconciliation later?
- [ ] Do we know what “correct” looks like for money and inventory?
If you can’t answer these, that doesn’t mean you’ve failed. It means you’ve found the next set of requirements.
Which is the whole point.
Closing thought
Shipping code is satisfying. But the real win is shipping something that keeps working when conditions are messy: peak traffic, vendor outages, partial failures, human error, rushed changes, all of it.
That’s not magic. It’s design.
And if you treat operability as a first-class requirement from day one, you don’t eliminate incidents, but you dramatically reduce the cost of them.
One day at a time, in the good way.
