More

kingstnap · 2026-02-20T06:26:21 1771568781

I personally found Gemini 3.0 to step on my toes in Agentic coding. I tried it around 10 or so times but it quickly became apparent that it was somehow coming to its own conclusions about what needs to be done instead of following instructions.

Like files I didn't mention being edited and read and stuff of that nature. Sometimes this is cute in fixing typos in docs but when its changing things where it clearly doesn't even understand the intentionality behind something it's annoying.

Gemini 3.1 is clearly much better when trying it today. It stayed focused and found its way around without getting distracted.

arnorhs · 2026-02-20T08:56:21 1771577781

The only cases where I've had gemini step on my toes like that is when a) I realized my instructions were unclear or missing something b) my assumptions/instructions were flawed about how/why something needed to be done.

kingstnap · 2026-02-20T11:04:12 1771585452

Instruction following has improved a lot since a few years ago but let's not pretend these things are perfect mate.

There's a certain capacity of instructions, albiet its quite high, at which point you will find them skipping points and drifting. It doesn't have to be ambiguity in instructions.

dzhiurgis · 2026-02-20T07:47:54 1771573674

So strange. I switched from claude few months ago to gemini3 and didn’t look back. Speed is big one, code quality just vastly better, all while far cheaper. I do need to try latest claude models tho.

kingstnap · 2026-02-20T04:37:53 1771562273

Well the important concept missing there that makes everything sort of make sense is due diligence.

If your company screws up and it is found out that you didn't do your due diligence then the liability does pass through.

We just need to figure out a due diligence framework for running bots that makes sense. But right now that's hard to do because Agentic robots that didn't completely suck are just a few months old.

hvb2 · 2026-02-20T05:58:12 1771567092

> If your company screws up and it is found out that you didn't do your due diligence then the liability does pass through.

In theory, sure. Do you know many examples? I think, worst case, someone being fired is the more likely outcome

gostsamo · 2026-02-20T05:05:14 1771563914

No, it isnot hard. You are 100% responsible for the actions of your AI. Rather simple, I say.

jacquesm · 2026-02-20T09:46:39 1771580799

Exactly.

jacquesm · 2026-02-20T09:46:22 1771580782

It's easy: your bot: your liability.

kingstnap · 2026-02-18T17:32:17 1771435937

I agree with you, but also the APIs are proper expensive to be fair.

What people probably get messed up on as being the loss leader is likely generous usage limits on flat rate subscriptions.

For example GitHub Copilot Pro+ comes with 1500 premium requests a month. That's quite a lot and it's only $39.00. (Requests ~ Prompts).

For some time they were offering Opus 4.6 Fast at 9x billing (now raised to 30x).

That was upto 167 requests of around ~128k context for just $39. That ridiculous model costs $30/$150 Mtok so you can easily imagine the economics on this.

kingstnap · 2026-02-18T02:22:49 1771381369

Search on YouTube has been broken for at least 4 to 5 years.

Some time back it used to find stuff. Now its just recommends shorts.

kingstnap · 2026-02-16T16:58:26 1771261106

It would be interesting to see the answer parametrically change.

An equally strange trip question is to say the car wash is 0m, 1m, -10m, 1000000m, orange m, etc.

MadxX79 · 2026-02-16T19:16:24 1771269384

What I really want is to be able to search through the training dataset to see the n closest hits (cosine distance or something). I think the illusion would very quickly be dispelled that way.

kingstnap · 2026-02-16T16:51:58 1771260718

Answering questions in the positive is a simple kind of bias that basically all LLMs have. Frankly if you are going to train on human data you will see this bias because its everywhere.

LLMs have another related bias though, which is a bit more subtle and easy to trip up on, which is that if you give options A or B, and then reorder it so it is B or A, the result may change. And I don't mean change randomly the distribution of the outcomes will likely change significantly.

kingstnap · 2026-02-16T00:02:18 1771200138

pip is really bad though so UV has a long ways to fall before you aren't still net better off :^)

kingstnap · 2026-02-14T21:24:30 1771104270

The idea of exploiting someone else's server to store files is incredibly old.

https://en.wikipedia.org/wiki/GMail_Drive

When Google launched Gmail (2004) with a huge 1GB storage quota, Richard Jones released GMailFS to mount a Gmail account as a standard block device.

kingstnap · 2026-02-14T18:28:39 1771093719

Dependencies aren't free. If you have a library that has less than a thousand lines of code total that is really janky. Sometimes it makes sense like PicoHTTPParser but it often doesn't.

Left-pad isn't a success story to be reproduced.

rsynnott · 2026-02-14T19:34:24 1771097664

Not saying left pad is a good idea; I’m not a Javascript programmer, but my impression has always been that it desperately needs something along the lines of boost/apache commons etc.

EDIT: I do wonder if some of the enthusiastic acceptance of this stuff is down to the extreme terribleness of the javascript ecosystem, tbh. LLM output may actually beat leftpad (beyond the security issues and the absurdity of having a library specifically to left pad things, it at least used to be rather badly implemented), but a more robust library ecosystem, as exists for pretty much all other languages, not so much.

Izkata · 2026-02-15T08:47:09 1771145229

Left-pad was plain bad, we already had well-known/tested/reliable utility libraries like lodash that had it among the functions they provided.

kingstnap · 2026-02-12T20:07:37 1770926857

Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.