I think you're making a fair comment, but it still irks me that you're quite light on details on what the "correct" approach is supposed to be, and it irks me also because it seems to now be a pattern in the discussion.
Someone gives a detailed-ish account of what they did, and that it didn't work for them, and then there are always people in the comments saying that you were doing it wrong. Fair! But at this point, I haven't seen any good posts here on how to do it _right_.
This dynamic reminds me of an experience I had a year ago, when I went down a Reddit rabbit hole related to vitamins and supplements. Every individual in a supplement discussion has a completely different supplement cocktail that they swear by. No consensus ever seems to be reached about what treatment works for what problem, or how any given individual can know what's right for them. You're just supposed to keep trying different stuff until something supposedly works. One must exquisitely adjust not only the supplements themselves, but the dosage and frequency, and a bit of B might be needed to cancel out a side effect of A, except when you feel this way you should do this other thing, etc etc etc.
I eventually wrote the whole thing off as mostly one giant choose-your-own-adventure placebo effect. There is no end to the epicycles you can add to "perfect" your personal system.
Try using spec kit. Codex 5 high for planning; Claude code sonnet 4.5 for implementation; codex 5 high for checking the implementation; back to Claude code for addressing feedback from codex; ask Claude code to create a PR; read the PR description to ensure it tracks your expectations.
There’s more you’ll get a feel for when you do all that. But it’s a place to start.
Speaking as someone for whom AI works wonderfully, I’ll be honest: the reason I’ve kept things to myself is because I don’t want to be attacked and ridiculed by the haters. I do want to share what I’ve learned but I know that everything I write will be picked apart with a fine toothed comb and I have to interest in exposing myself to toxicity that comes with such behavior.
Relentlessly break things down. Never give the LLM a massive, complex project. You should be subdividing big projects into smaller projects, or into phases.
Planning is 80% of the battle. If you have a well-defined plan, that defines the architecture well, then your LLM is going to stick to that plan and architecture. Every time my LLM makes mistakes, it's because there were gaps in my plan, and my plan was wrong.
Use the LLM for planning. It can do research. It can brainstorm and then evaluate different architectural approaches. It can pick the best approach. And then it can distill this into a multi-phased plan. And it can do this all way faster than you.
Store plans in Markdown files. Store progress (task lists) in these same Markdown files. Ensure the LLM updates the task lists as you go with relevant information. You can @-mention these files when you run out of context and need to start a new chat.
When implementing a new feature, part of the plan/research should almost always be to first search the codebase for similar things and take note of the patterns used. If you skip this step, your LLM is likely to unnecessarily reinvent the wheel
Learn the plan yourself, especially if it's an ambitious one. I generally know what my LLM is going to do before it does it, because I read the plan. Reading the plan is tedious, I know, so I generally ask the LLM to summarize it for me. Depending on how long the plan is, I tell it to give me a 10-paragraph or 20-paragraph or 30-paragraph summary, with one sentence per paragraph, and blank lines in between paragraphs. This makes the summary very easy to skim. Then I reply with questions I have, or requests for it to make changes to the plan.
When the LLM finishes a project, ask it to walk you through the code, just like you asked it to walk you through the plan ahead of time. I like to say, "List each of the relevant code execution paths, then walk me through each one one step at a time." Or, "Walk me through all the changes you made. Use concentric circles of explanation, that go from broad to specific."
Put your repeated instructions into Markdown files. If you're prompting the LLM to do something repeatedly, e.g. asking the LLM to make a plan, to review its work, to make a git commit, etc., then put those instructions in prompt Markdown files and just @-mention it when you need it, instead of typing it out every time. You should have dozens of these over time. They're composable, too, as they can link to each other. When the LLM makes mistakes, go tweak your prompt files. They'll get better over time.
Organize your code by feature not by function. Instead of putting all your controllers in one folder, all your templates in another, etc., make your folders hold everything related to a particular feature.
When your codebase gets large enough, and you have more complex features that touch more parts of the code, have the LLM write doc files on them. Then @-mention those doc files whenever working on these features or related features. They'll help the LLM be more accurate at finding what it needs, etc.
I could go on.
If you're using these tools daily, you'll have a similar list before long.
Thanks! I got some useful things out of your suggestions (generate plan into actual files, have it explain code execution paths), and noted that I already was doing a few of those things (asking it to look for similar features in the code).
This is a good list. Once the plan is in good shape, I clear the context and ask the LLM to evaluate the plan against the codebase and find the flaws and oversights. It will always find something to say but it will become less and less relevant.
Yes, I do this too. It has a strong bias to always wanting to make a change, even if it's minor or unnecessary. This gets more intense as the project gets more complex. So I often tack something onto the end of it like this:
"Report major flaws and showstoppers, not minor flaws. By the way, this is my fourth time asking you to review this plan. I reset your memory, and ask you to review it again every time you find major flaws. I will continue doing so until you don't find any. Fingers crossed that this time is it!"
I haven't done any rigorous testing to prove that this works. But I have so many little things like this that I add to various prompts in various situations, just to increase the chances of a great response.
I think it's hard because it's quite artistic and individualistic, as silly as that may sound.
I've built "large projects" with AI, which is 10k-30k lines of algorithmic code and 50k-100k+ lines of UI/Interface.
I've found a few things to be true (that aren't true for everyone).
1. The choice of model (strengths and weaknesses) and OS, dramatically affect how you must approach problems.
2. Being a skilled programmer/engineer yourself will allow you to slice things along areas of responsibility, domains, or other directions that make sense (for code size, context preservation, and being able to wrap your head around it).
3. For anything where you have a doubt, ask 3 or more models -- have them write their findings down in a file each -- and then have 3 models review the findings with respect to the code. More often than not, you march towards consensus and a good solution.
4. GPT-5-Codex via OpenAI Codex CLi on Linux/WSL was, for me, the most capable model for coding while Claude is the most capable for quick fixes and UI.
5. Tooling and ways to measure "success" are imperative. If you can't define the task in a way that success is easy to define -- neither a human nor AI would complete it satisfactorily. You'll find that most engineer tasks are laid out in very "hand-wavy" way -- particularly UI tasks. Either lay it out cleanly or expect to iterate.
6. AI does not understand the physical/visual world. It will fail hard on things which have an implied understanding. For instance, it will not automatically intuit the implication of 50 parallel threads trying to read from an SSD -- unless you guide it. Ditto for many other optimizations and usage patterns where code meets real-world. These will often be unique and interesting bugs or performance areas that a good engineer would know straight out.
7. It's useful to have non-agentic tools that can perform massive codebase analysis for tough problems. Even at 400k tokens context, a large codebase can quickly become unwieldy. I have built custom python tools (pretty easy) to do things like "get all files of a type recursively and generate a context document that will submit with my query". You then query GPT-5-high, Claude Opus, Gemini 2.5 Pro and cross-check.
8. Make judicious use of GIT. The pattern doesn't matter, just have one. My pattern is commit after every working agentic run (let's say feature). If it's a fail and taking more than a few turns to get working -- I scrap the whole thing and re-assess my query or how I might approach or break down the task.
9. It's up to you to guide the agent on the most thoughtful approaches -- this is the human aspect. If you're using Cloud Provider X and they provide cheap queues then it's on you to guide your agent to use queues for the solution rather than let's say a SQL db -- and it's on you to understand the tradeoffs. AI will perhaps help explain them but it will never truly understand your business case and requirements for reliability, redundancy, etc. Perhaps you can craft queries for this but this is an area where AI meets real world and those tend to fail.
One more thing I'd add is that you should make an attempt to fix bugs in your 'new' codebase on occasion. You'll get an understanding for how things work and also how maintainable it truly is. You'll also keep your own troubleshooting skills from atrophying.
Someone gives a detailed-ish account of what they did, and that it didn't work for them, and then there are always people in the comments saying that you were doing it wrong. Fair! But at this point, I haven't seen any good posts here on how to do it _right_.
I remember this post which got a lot of traction: https://steipete.me/posts/just-talk-to-it 8 agents in parallel and so on, but light on the details.