Agentic Debt
In my second week at Tano, I tried to add a simple feature. I got stuck because Claude made changes in the wrong place. Turns out, there were three different places in the codebase using almost the same frontend code, displaying slightly different UIs. It made the change to one I wasn’t aware of, so when I tried running it locally, nothing changed where I expected.
I took a step back, refactored and reconciled the three different copies to create something I could understand. I deleted about 2,000 lines of code written just a few weeks prior. More critically, the agent one-shotted the feature after.
This feels like a new kind of problem. It’s not the classic technical debt - no human consciously chose to re-create three different UIs, all almost the same. It’s something more specific to how agents write code.
I’ve been calling it agentic debt.
The feedback loop
Unlike tech debt, agentic debt is self-reinforcing.
An agent writes code. It works. You ship it. The agent writes more code. Also works. You ship that too. But each time, it’s optimising for the task at hand: the PR, the feature, the immediate ask. It doesn’t have a model of “everything else” the way you do.1
Over time, you get locally optimal choices that add up to global architectural drift. Patterns get duplicated. Abstractions get half-implemented. Three different files handle the same concept in three slightly different ways.
And then the next agent comes along and tries to work with this codebase. It gets confused. It makes sub-optimal choices. It picks a whacky connection to the “jungle” side of the codebase - code that’s not touched anymore. Because the first agent’s slop makes it harder for the second to reason clearly.
This is the agent slop feedback loop. Decay that feeds itself.
The context window trap
A potential counter-argument is that this is unnecessary when context windows are large enough to dump the entire codebase in. An agent can then figure everything out, since code is deterministic.
In a year or two? Maybe. Today, I don’t think so. Even with infinite context, I expect noise and inconsistency break reasoning. Three duplicated patterns with subtle differences don’t become clearer with more context. They become three times as confusing. The model has to figure out which one is “right”, or if they’re all slightly different for good reasons, or if the last agent just copy-pasted without checking.
Human-understandable is agent-actionable
At least for now, the counter-intuitive finding: The best way to improve agent performance is to make sure the code is simple enough for humans to model.
When I reconciled those three duplicated frontends into one clean pattern, I wasn’t doing it for the agent. I was doing it because I couldn’t build a clear mental model of the codebase with all that fragmentation. But once the code was clean, the agent could reason about it better too. Future changes became faster. Future refactors became easier.
This isn’t a coincidence. Agents struggle with the same things humans struggle with: inconsistency, implicit assumptions, and duplicated logic with subtle differences. A codebase that’s easy for a human to hold in their head is also easy for an agent to work with. The properties that make code maintainable haven’t changed just because the writer is an LLM.2
Tending the garden
This changes how I think about working fast. Are we trading shipping something new tomorrow for slowing down later?
Especially at startups, this can be the right trade-off. I think it was at Tano too. The point is to make it explicitly, not let it be the default. Sometimes, slower is smoother. And smoother is faster.
Some refactors become much easier with agents. Deduplication, for instance, is cheap labour now. But agility depends on whether you actually do the refactoring, or just keep sprinting.
A clean codebase means new people get onboarded quickly, which lets them steer agents better. We aren’t at a point yet where you can stop steering.
As the codebase grows, and you go from one to many engineers, stewardship becomes more important. Sure, you can hit approve, ship it, move on. But that slows down every future change. The slop shows up in how much attention the steerer pays.
You’re not necessarily writing the code anymore. But you are the maintainer. The gardener, if you will. Weeds will grow. Sometimes they’re needed to enrich the soil. Other times you need to trim them.
This is a different kind of engineering than what I was doing a few years ago. I used to write the code. Since I had to do the grunt work, I’d naturally make it easier for me to continue contributing. A lot of software best practices spun out of this.3
Now I’m the person making sure the code stays coherent. It’s less about “can I implement this feature?” and more about “does this feature fit into the system in a way that won’t confuse the next agent, or the next human?”4
Open questions
Model capabilities are improving pretty quickly. The scaffolding of today might be unnecessary tomorrow. There are a few things I’m not sure about.
-
Can a gardener agent work? A daily monitor that audits every PR and flags duplicated logic, unnecessary API endpoints, architectural drift, etc. etc. Codex reviews Claude pretty well, so I think this is plausible. I don’t know if it will be good enough though. It’s one experiment I want to try soon.
-
Code reviewer agents? With today’s models, I’m sure this doesn’t work well for stewardship. But in the future, I’m not sure. I think this arrives before agents that need no steering.
-
Does gardening scale? At Tano, we’re a small team. The gardening is manageable. But what happens when you have 50 agents writing code across a large codebase? Does the gardening overhead grow linearly, or does it compound? I suspect it compounds, which means the gardener role becomes even more important as we scale up agentic development.
-
When does refactoring pay for itself? Taking a step back to reconcile slop feels like lost momentum. But it was necessary preparation for the next burst of agent-driven velocity. I don’t have a good heuristic yet for when to pause and garden versus when to keep sprinting.
Until I figure it out, I garden between the sprints.
-
Even with enormous context windows. More on this later. ↩
-
I suspect this is because the models are trained on human-written code and human explanations. Their “reasoning” mirrors human reasoning patterns. What’s legible to us is legible to them. ↩
-
Which seems to suggest the best practices are changing. I think some core principles will remain, and some others will endure because humans steer the agent to such practices. What about a post-stewarding world, when agents are much smarter at coding than me? I don’t know ↩
-
There’s a parallel here to how choosing what to build has become more important than just building. Taste is more important, and more on this in a future post! ↩
You might also like
- What I learned about burnout and anxiety at 30
- How to setup duration based profiling in Sentry
- How to simulate a broken database connection for testing in Django
- The "People fuck up because they're not like me" Fallacy