I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.
Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.
In this new world, why stop there? It would be even better if engineers were also medical doctors and held multiple doctorate degrees in mathematics and physics and also were rockstar sales people.
It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.
There's a balance to strike between micro-management and no steering at all.
Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.
The same applies to verification: it's fundamentally an information problem.
You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.
Having prompts be information deficient is the whole point of LLMs. The only complete description of a typical programming problem is the final code or an equivalent formal specification.
Does the AI agent know what your company is doing right now, what every coworker is working on, how they are doing it, and how your boss will change priorities next month without being told?
If it really knows better, then fire everyone and let the agent take charge. lol
For me, it still asks for confirmation at every decision when using plans. And when multiple unforeseen options appear, it asks again. I don’t think you’ve used Codex in a while.
A significant portion of engineering time is now spent ensuring that yes, the LLM does know about all of that. This context can be surfaced through skills, MCP, connectors, RAG over your tools, etc. Companies are also starting to reshape their entire processes to ensure this information can be properly and accurately surfaced. Most are still far from completing that transformation, but progress tends to happen slowly, then all at once.
This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.
You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.
If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.
> If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
Trying to get my company to realize this right now.
Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.
You could probably do it async but it’s so much faster to not have to keep waiting for one another.
Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.
You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.
I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.
Every mistake is a chance to fix the system so that mistake is less likely or impossible.
I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.
This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
> You're no longer developing software, you're doing therapy for robots.
Or, really, hacking in "learning", building your knowhow-base.
> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:
Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.
For those following along at home:
This is the return of the "expert system", now running on a generalized "expert system machine".
I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.
It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment
Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.
True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.
Some of the things it occasionally does:
- Ignores conventions (even when emphasized in the CLAUDE.md)
- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)
- Writes badly performing code (N+1)
- Does more than you asked (in a bad way, changing UIs or adding cruft)
- Makes generally bad assumptions
I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.
Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.
I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.
you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.
I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.
(I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)
There should never have been an "artisan era". We use computers to solve problems. You should have always getting stuff done instead of bikeshedding over nitty-gritty details, like when in the office people have been spending weeks on optimizing code... just to have the exact same output, exact same time, but now "nicer".
> Plenty of people are writing code without being paid for it.
This is rhetorically a non sequitur. As in, if you get paid (X) then you get stuff done (Y). But if you're not paid (~X), then, ?
Not being paid doesn't mean one does or doesn't get stuff done, it has no bearing on it. So the parent wasn't saying anything about people who don't get paid, they can do whatever they want, but yes, at a job if you're paid, then you better get stuff done over bikeshedding.
It depends how much money and energy in the form of manhours were spent to write it in an artisan way in the first place. I've been in a lot of PR reviews where it was clear that the amount of back and forth we had was simply not worth it for the code we wrote.
I think you're both right. There's a time and place for beautifully crafted code, but there's also a place for a hot mess that barely passes its own non-existing tests, and for anything in between.
> there's also a place for a hot mess that barely passes its own non-existing tests
For a long time that place has been "the commercial software marketplace". Let's all stop pretending that the code coming out of shops until now has been something you'd find at a guild craft expo. It's always been a ball of spit and duct tape, which is why AI code is often spit and duct tape.
Yeah. Exactly the same as there should never be an “artisan era” for chairs, tables, buildings, etc.
Hell even art! Why should art even be a thing? We are machine driven by neurons, feelings do not exist.
Might be your life, it ain’t mine. I’m an artisan of code, and I’m proud to be one. I might finally use AI one of these days at work because I’ll have to, but I’ll never stop cherishing doing hand-crafted code.
>> Yeah. Exactly the same as there should never be an “artisan era” for chairs, tables, buildings, etc.
That's funny you bring up those examples, because they have all moved on to the mass manufacturing era. You can still get artisan quality stuff but it typically costs a lot more and there's a lot less of it. Which is why mass-manufacturing won. Same is going to happen with software. LLMs are just the beginning.
I live in a city where there are new houses being built. They are ugly. Meanwhile, the ones that exist since a long time ago have charm and feel homely.
I don’t know, I‘m probably just a regular old man yelling at clouds, but I still think we’re going in the wrong direction. For pretty much everything. And for what? Money. Yay!
You're continuing to make good arguments for why mass-production should exist _alongside_ artisanal craftsmanship. Broad availability of housing which is functional, albeit of questionable aesthetic appeal, is a good thing to improve housing availability[0]; and also it is a good thing for (fewer) well-built, charming, individual homes to be available for those who want to spend more and to get more.
[0] I'm extremely aware that there are other contributing factors to housing shortages. Tax Billionaires, etc. My metaphor still works despite not being total.
The difference is that end users don't interact with the code that the artisan created, and don't care what it "feels like". One type of code that I do agree should be artisanal is the interface end of libraries.
Yes, it's like artisanal plumbing or electrical wiring... all hidden behind walls. A plumber might take pride in the quality of his soldered joints, but artisanal? Who wants to pay for that?
> just to have the exact same output, exact same time, but now "nicer".
The majority of code work is maintaining someone else's code. That's the reason it is "nicer".
There is also the matter of performance and reducing redundancy.
Two recent pulls I saw where it was AI generated did neither. Both attempted to recreate from scratch rather than using industry tested modules. One was using csv instead of polars for the intensive work.
So while they worked, they became an unmaintainable mess.
You use computers to solve problems. I use computers to communicate and create art. For me, the code I write is first and foremost a form of self expression. No one paid me to write 99% of the code I've written in my life.
For a long time computers were so expensive they could only be used to do things that generate enough money to justify their purchase. But those days are long gone so computers are for much much more than just solving problems and getting stuff done. Code can be beautiful in its own right.
The exact mindset is what has led to the transition from quality products to commercialized crapware, not just with software, but across all industries.
It sounds like you hate your job? To be sure, I've done plenty of grinding over my career as a software engineer but in fact I coded as a hobby before it turned into a career, I then continued to code on the side, now I am retired and code still.
I love my job FWIW. I work at performance engineering and we work with the most complex systems in the world (GB200/B300/...). Couldn't be happier.
But I just don't care if I have 5 layers of abstraction and SOLID principles and clean code and.... bah. I get it. I have an MSc in it and I've been doing this as a hobby and then professionally for decades now. It just doesn't matter. At the end of the day, we get paid to ship something that solves a problem.
It might be a novel problem. And it might be at the frontier of what we can do today. But it's still a problem that needs solving and the path we take is irrelevant from a user's perspective as long as it solves the problem.
I don't think they hate their job, just seem to be frustrated at slow bureaucratic processes and long code reviews which I've experienced too. After a while it can get aggravating as to why some people want to nitpick minute details of the code which slows down development overall. I am talking about cases where the initially submitted PR is perfectly fine, not grossly incorrect.
Oh wow, if we're talking about code reviews that's a different topic. I've never, FWIW, encountered "artisans" in code reviews. More like "that's not how I would have coded itsans" and "let me show you some new tricksans".
Yeah, to hell with code reviews. The best years of my career were when I was given carte blanche control over an entire framework, etc. When code reviews came along coding at work sucked.
If anything, the code reviews killed the artisanship.
90% of the CRs I've ever gotten have been "artisanal" just because nitpicking superficial nonsense is easier than meaningful critique, and even when the code is perfectly fine it looks more productive from a managers perspective if you're nitpicking a function name than if you just respond with lgtm.
Yeah that's what I understood them to mean from "like when in the office people have been spending weeks on optimizing code... just to have the exact same output, exact same time, but now "nicer"." There does come such a time either way when the juice isn't worth the squeeze so to speak in terms of optimization of code.
It's not that simple. That's how I started as well but now I have hooked up Gemini and GPT 5.2 to review code and plans and then to do consensus on design questions.
And then there's Ralph with cross LLM consensus in a loop. It's great.
8xGPUs per box. this has been the data center standard for the last 8ish years.
furthermore usually NVLink connected within the box (SXM instead of PCIe cards, although the physical data link is still PCIe.)
this is important because the daughter board provides PCIe switches which usually connect NVMe drives, NICs and GPUs together such that within that subcomplex there isn't any PCIe oversubscription.
since last year for a lot of providers the standard is the GB200 I'd argue.
Fascinating! So each GPU is partnered with disk and NICs such that theres no oversubscription for bandwidth within its 'slice'? (idk what the word is) And each of these 8 slices wire up to NVLink back to the host?
Feels like theres some amount of (software) orchestration for making data sit on the right drives or traverse the right NICs, guess I never really thought about the complexity of this kind of scale.
I googled GB200, its cool that Nvidia sells you a unit rather than expecting you to DIY PC yourself.
usually it's 2-2-2 (2 GPUs, 2 NICs and 2 NVMe drivers on a PCIe complex). no NVLink here, this is just PCIe - under this PCIe switch chip there is full bandwidth, above it's usually limited BW. so for example going GPU-to-GPU over PCIe will walk
GPU -> PCIe switch -> PCIe switch (most likely the CPU, with limited bw) -> PCIe switch -> GPU
NVLink comes into the picture as a separate, 2nd link between the GPUs: if you need to do GPU-to-GPU, you can use NVLink.
you never needed to DIY your stuff, at least not for the last 10 years: most hardware vendors (Supermicro, Dell, ...) will sell you a complete system with 8 GPUs.
what's nice on GH200/GBx00/VR systems, is that you can use chip-to-chip NVLink between the CPU and GPU, so the CPU can access GPU memory coherently and vica versa.
reply