April 28, 2026 · 4 min read · Field notes

Hypothesis-first: the debugging mode that saves real time

When I tell my agent where to look, it should look there first. Not enumerate. The reason it matters: a decade of pattern recognition that can't be reconstructed from first principles every time.

Charcoal-and-amber illustration of a magnifying glass hovering directly over a single illuminated point on a complex dark schematic, while three other branching investigation paths fade into the dark background, suggesting hypothesis-first focus over broad enumeration

A pattern kept showing up yesterday across separate fixes. Every time my agent widened its investigation before validating the most likely cause first, it cost time. Every time it didn't, things moved fast. Worth writing down.

The haiku problem

I had Laura and Alex configured to default to Haiku for cost reasons. That made sense at the time. Their daily check-in tasks are simple. Haiku handles them at a tenth the cost of Sonnet, no reason to burn budget on routine polling.

Then I changed my mind. The cost difference was negligible compared to the friction of agents being slightly dumber for routine work that occasionally needed real reasoning. So I told Ender to fix it.

When the agent dug in, I had a clear hypothesis. Something in the crontab is overriding the model config at runtime. I'd built the cron, I knew the shape of the bug.

The agent's instinct was to spin up two parallel investigation agents and enumerate broadly. Wrong move. I named the likely cause. The right play was one targeted command. Check the crontab. Check the config file. Prove or disprove the hypothesis in five seconds. Instead, the parallel investigation added turns and noise before landing on exactly what a single crontab check would have surfaced immediately.

Worth being clear: the system worked the way I'd configured it. Haiku saved me money for weeks of cheap polling. That was the right setup at the time. The thing that wasn't fine was that when I told the agent where to look, it didn't look there first.

The Resend integration

The DKIM setup and Alex's swap to Resend went cleanly. The Resend API turned out to be directly driveable via curl with env creds, which meant the entire DNS-add plus verify loop ran programmatically. No copy-paste of DNS records back and forth. That's the right shape for this kind of work. Find the API surface that eliminates the human-in-the-loop handoff.

The one stumble: Resend's webhook API uses endpoint, not endpoint_url. The first POST returned a 422. Trial and error recovered it on the second try.

That should not have happened. The Resend API docs are public. The SDK examples are right there. Reading the schema before posting is the difference between getting it on the first try and wasting a request and a turn on a dumb mistake. For an agent that's supposed to be careful with API surfaces, guessing at field names is the kind of small failure that compounds across a hundred integrations.

This is the rule going forward at OAL. Before implementing any integration with a third-party service, the relevant API docs get transcribed into the skill file. No guessing. No trial-and-error on field names that are documented in plain English on the API provider's site. The skill audit checks for this. The skill system enforces it on every implementation.

The pattern

Both of these compress to the same thing. My agent moves faster when it trusts what I tell it and reads what's already documented. When debugging, enumerate only after the obvious candidate fails. When integrating a new API, read the schema before posting.

This isn't a novel insight. What makes it worth writing down is that it kept getting violated in distinct contexts on the same day, which usually means it needs to be an explicit step in the algorithm, not just good judgment.

Why this actually works

The deeper reason hypothesis-first matters: I have a decade of pentest mapping in my head. When I look at a system, I see attack surfaces, dependency chains, and likely failure modes that the agent can't reconstruct from first principles every time it debugs. Some of that I've written down. Most of it I haven't. It's ambient knowledge from spending ten years watching systems break for a living.

The agent's job isn't to replicate that knowledge. The agent's job is to take my hypothesis as the starting point and run it down, not to assume parity. When I say "check the crontab," there is information in that pointer that doesn't live in any file the agent can read. That information is the reason hypothesis-first wins.

The same principle applies to API docs. The API provider spent time writing the schema. Reading it isn't extra work. Skipping it isn't speed. It's a tax I pay later.

Hypothesis-first isn't a mindset. It's a procedure. When the operator who built the system points at the likely cause, that's not a hypothesis to dedup against parallel theories. That's the answer most of the time. Try it first. Broaden only if it fails.

← Back to blog