I personally found Gemini 3.0 to step on my toes in Agentic coding. I tried it around 10 or so times but it quickly became apparent that it was somehow coming to its own conclusions about what needs to be done instead of following instructions.
Like files I didn't mention being edited and read and stuff of that nature. Sometimes this is cute in fixing typos in docs but when its changing things where it clearly doesn't even understand the intentionality behind something it's annoying.
Gemini 3.1 is clearly much better when trying it today. It stayed focused and found its way around without getting distracted.
The only cases where I've had gemini step on my toes like that is when a) I realized my instructions were unclear or missing something b) my assumptions/instructions were flawed about how/why something needed to be done.
Instruction following has improved a lot since a few years ago but let's not pretend these things are perfect mate.
There's a certain capacity of instructions, albiet its quite high, at which point you will find them skipping points and drifting. It doesn't have to be ambiguity in instructions.
So strange. I switched from claude few months ago to gemini3 and didn’t look back. Speed is big one, code quality just vastly better, all while far cheaper. I do need to try latest claude models tho.
Well the important concept missing there that makes everything sort of make sense is due diligence.
If your company screws up and it is found out that you didn't do your due diligence then the liability does pass through.
We just need to figure out a due diligence framework for running bots that makes sense. But right now that's hard to do because Agentic robots that didn't completely suck are just a few months old.
I agree with you, but also the APIs are proper expensive to be fair.
What people probably get messed up on as being the loss leader is likely generous usage limits on flat rate subscriptions.
For example GitHub Copilot Pro+ comes with 1500 premium requests a month. That's quite a lot and it's only $39.00. (Requests ~ Prompts).
For some time they were offering Opus 4.6 Fast at 9x billing (now raised to 30x).
That was upto 167 requests of around ~128k context for just $39. That ridiculous model costs $30/$150 Mtok so you can easily imagine the economics on this.
What I really want is to be able to search through the training dataset to see the n closest hits (cosine distance or something). I think the illusion would very quickly be dispelled that way.
Answering questions in the positive is a simple kind of bias that basically all LLMs have. Frankly if you are going to train on human data you will see this bias because its everywhere.
LLMs have another related bias though, which is a bit more subtle and easy to trip up on, which is that if you give options A or B, and then reorder it so it is B or A, the result may change. And I don't mean change randomly the distribution of the outcomes will likely change significantly.
Dependencies aren't free. If you have a library that has less than a thousand lines of code total that is really janky. Sometimes it makes sense like PicoHTTPParser but it often doesn't.
Not saying left pad is a good idea; I’m not a Javascript programmer, but my impression has always been that it desperately needs something along the lines of boost/apache commons etc.
EDIT: I do wonder if some of the enthusiastic acceptance of this stuff is down to the extreme terribleness of the javascript ecosystem, tbh. LLM output may actually beat leftpad (beyond the security issues and the absurdity of having a library specifically to left pad things, it at least used to be rather badly implemented), but a more robust library ecosystem, as exists for pretty much all other languages, not so much.
Its possibly label noise. But you can't tell from a single number.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
Like files I didn't mention being edited and read and stuff of that nature. Sometimes this is cute in fixing typos in docs but when its changing things where it clearly doesn't even understand the intentionality behind something it's annoying.
Gemini 3.1 is clearly much better when trying it today. It stayed focused and found its way around without getting distracted.
reply