Anton May, April 2026
It has now gotten to the point where every fifth post I see on my social platforms is someone complaining about Claude usage limits. The second post in every five is then inevitably some idiot's suggestion on how to make your agent more efficient. I'm about to be that idiot, hello!
Let me introduce you to my biggest gripe with all modern LLM-based agent systems: the lack of reuse.
Every time your agent handles a user request it spins up an LLM. The same request, the same reasoning, the same API calls, the same output (if you're lucky), and you pay for all of it again. You are re-deriving an answer you already have. Most agent frameworks treat inference as a runtime cost, like electricity, when it should be a BUILD cost: a one-time investment that can be reused. This is the N+1 query problem of the AI era.
I remember being so disappointed at OpenClaw when I finally looked into its structure. It's an incredibly wasteful tool and doesn't deserve the kind of reverence it's receiving. Making an unsupervised LLM call every 15 minutes to check everything's ok by passing in your entire system prompt and allowing retries? And then folks are SURPRISED to wake up to $1000 bills overnight...
Skills are all the rage. Apparently if you bloat your LLM's context even more, it won't have to go through reasoning stages again to complete the same task it (or someone else) has done previously. There's now a million skills marketplaces to match the millions of crappy OpenClaw clones. There are even tools like hermes-agent which are reasonably well-formulated and semantically fetch skills as needed, but also constantly fix them so they are ready for reuse and reflect what the user has already said they want. Great!
However, I firmly believe skills (i.e. paragraphs of instructions) as a module of pre-defined intelligence were a misconception from the start. They were supposed to be a good complement to small tools, which benefit from the capability to be chained together, and if a tool fails there's no issue because the LLM can "improvise". But what is actually stopping us from just making bigger tools using the LLM itself on-demand and recursively correcting them when needed?
The vision is that people can build a shared library of auto-healing and expanding variable-parameter mini-apps (say in QuickJS) which are retrieved, modified, and generated on demand. Having worked with local models, I am positive this is the way things will move. It turns the majority of executions into RAG-based tasks with argument filling, and shares the load for difficult tasks between members of a broader user base.
A user describes what they want in natural language. They interact with a cheap model, e.g. even whatever they can run on their local hardware. This local model decides if the user is asking for something actionable, and if so performs an embedding-based retrieval of relevant mini-apps from the library to approach their request (say it is fed the top 100). Its context on these only needs to be a set of typed input arguments, a typed output, and a short description of the tool. The cheap model then only has to decide a) if the correct mini-app exists in the exact form the user requested (in which case, fill in arguments and execute), or b) if it doesn't (in which case, defer to some self-supervised agentic programming setup like Claude Code in a Ralph loop, which are far from perfect but work fine for under 1000 lines in a single file). You can even get the coding agent to expand the capabilities of existing mini-apps rather than creating them fresh. Either way, it spits out a deterministic script, not a prompt or a chain-of-thought, but an actual program with hardcoded values. The script gets tested in a sandbox. From then on it runs without any LLM involvement (unless explicitly outlined as app logic): no tokens, no reasoning, no hallucinations. The LLM was the compiler and the script is the compiled artifact. Nobody re-runs gcc every time they execute a program, but when the source language is English and the target is JavaScript we seem to have forgotten that.
Generation is maybe 20% of the problem though. The interesting part is what comes after: you validate the script in a sandbox before it touches anything real, then you monitor it in production with standard observability. When it fails, you classify why. Only if it's a genuine code bug do you call the LLM back to read the error, patch the script, test the patch, and deploy it. One LLM call fixes every future execution, and if the fix makes things worse you roll back. When the user needs something the script doesn't cover yet, you expand the existing code rather than generating from scratch. The codebase grows organically, each piece tested and deterministic.
Out of that full lifecycle (generate, validate, deploy, monitor, self-heal, expand) the LLM participates in three stages and is absent from the other three, which are the ones that run constantly.
AI writes code fast but someone still has to verify it. If you're verifying on every execution you're drowning; if you verify once at compile time and the artifact is deterministic, you're done. Your compiled scripts also don't care when the provider silently downgrades the model or tweaks the safety filters, because they never call the model at all. And as the library of compiled scripts grows, the match rate for new requests climbs, so inference cost amortises over time rather than scaling linearly with usage.
If you take this seriously, the LLM's role in your architecture shifts. It's not the brain of your system; it's the brain of your build pipeline. Most agent frameworks today are interpreters that parse, reason about, and execute every request from source at full cost with no guarantee of consistency. What you probably want is a compiler that does that work once, emits a deterministic artifact, and gets out of the way.
I'm calling it: significantly higher dependence on much larger tools, with ever-decreasing use of skills, over the course of the next year. Because broken tools can be fixed.
The fastest LLM call is the one you made last month.