Hacker Newsnew | past | comments | ask | show | jobs | submit | lukasb's commentslogin

Wish I'd read this before I started at Google.


I'm playing around with it, and it's very cool! One issue is that fingerprint expansion doesn't always work, e.g. I have a memory "Going to Albania in January for a month-long stay in Tirana" and asking "Do I need a visa for my trip?" didn't turn up anything, using expansion "visa requirements trip destination travel documents..."

What would you think about adding another column that is used for matching that is a superset of the actual memory, basically reusing the fingerprint expansion prompt?


One difference is that clichéd prose is bad and clichéd code is generally good.


Depends on what your prose is for. If it's for documentation, then prose which matches the expected tone and form of other similar docs would be clichéd in this perspective. I think this is a really good use of LLMs - making docs consistent across a large library / codebase.


A problem I’ve found with LLMs for docs is that they are like ten times too wordy. They want to document every path and edge case rather focusing on what really matters.

It can be addressed with prompting, but you have to fight this constantly.


> A problem I’ve found with LLMs for docs is that they are like ten times too wordy

This is one of the problems I feel with LLM-generated code, as well. It's almost always between 5x and long and 20x (!) as long as it needs to be. Though in the case of code verbosity, it's usually not because of thoroughness so much as extremely bad style.


I think probably my most common prompt is "Make it shorter. No more than ($x) (words|sentences|paragraphs)."


I've never been able to get that to work. LLMs can't count; they don't actually know how long their output is.


I have been testing agentic coding with Claude 4.5 Opus and the problem is that it's too good at documentation and test cases. It's thorough in a way that it goes out of scope, so I have to edit it down to increase the signal-to-noise.


The “change capture”/straight jacket style tests LLMs like to output drive me nuts. But humans write those all the time too so I shouldn’t be that surprised either!


What do these look like?


  1. Take every single function, even private ones.
  2. Mock every argument and collaborator.
  3. Call the function.
  4. Assert the mocks were  called in the expected way.
These tests help you find inadvertent changes, yes, but they also create constant noise about changes you intend.


These tests also break encapsulation in many cases because they're not testing the interface contract, they're testing the implementation.


Juniors on one of the teams I work with only write this kind of tests. It’s tiring, and I have to tell them to test the behaviour, not the implementation. And yet every time they do the same thing. Or rather their AI IDE spits these out.


You beat me to it, and yep these are exactly it.

“Mock the world then test your mocks”, I’m simply not convinced these have any value at all after my nearly two decades of doing this professionally


If the goal is to document the code and it gets sidetracked and focuses on only certain parts it failed the test. It just further proves llm's are incapable of grasping meaning and context.


Docs also often don’t have anyone’s name on them, in which case they’re already attributed to an unknown composite author.


Yeah I'm terrified that TPUs will get cheaper, that would be awful.


I asked about the Peninsula campaign during the Civil War and it gave me an overview, a map, profiles (with photos) of the main military commanders, a relevant Youtube video ... rough edges but overall love the format.

Rough edges: - aspect ratios on photos (maybe because I was on mobile, cropping was weird) - map was very hard to read (again, mobile) - some formatting problems with tables - it tried to show an embedded Gmap for one location but must have gotten the location wrong, was just ocean


Thanks for the feedback, this is helpful!


"Critically, the language one speaks or signs can have downstream effects on ostensibly nonlinguistic cognitive domains, ranging from memory, to social cognition, perception, decision-making, and more."

Can they really distinguish between the impact of language on these domains rather than culture? It could be the language you speak, or it could be that you're surrounded exclusively by other people that operate this way.


French, Spanish, and Portuguese are spoken across multiple cultures. So there should be enough data to test the theory.

French is a second language for many countries. So that may provide data as well.


I mean I'm not really qualified or rigourous enough to prove this but if you have learned chinese and english it should be pretty damn obvious that it is linguistic. But in any case, human language and culture are intractable if you start trying to speak idiomatically.

Sure maybe you could isolate a bunch of scholars and give them a specification of Chinese and ask them to go at it, which is maybe what we do with Latin and Greek.

I would struggle to see how someone could earnestly argue the opposite, that language doesn't shape thought, when Chinese doesn't use conjugation, has looser notions of tense, has no direct/indirect article, uses glyphs instead of an alphabet, can be read top to bottom, right to left, left to right and doesn't use spaces to delimit words. That's even before we talk about tones or the highly monosyllabic nature of the language alters things like memorisation. (ever notice how Chinese people are often good at memorising numbers?)


What do you use for evaluation? gemini-2.5-pro is at the top of MMLU and has been best for me but always looking for better.


Recently I've found myself getting the evaluation simultaneously from to OpenAI gpt-5, Gemini 2.5 Pro, and Qwen3 VL to give it a kind of "voting system". Purely anecdotal but I do find that Gemini is the most consistent of the three.


I am running similar experiment but so far, changing the seed of openai seems to give similar results. Which if that confirms, is concerning to me on how sensitive it could be


I found the opposite. GPT-5 is better at judging along a true gradient of scores, while Gemini loves to pick 100%, 20%, 10%, 5%, or 0%. Like you never get a 87% score.


Interesting, I'll give voting a shot, thanks.


This is funny to me because when someone asked re: Better Auth "better than what?" my off-the-cuff response was "better than Auth.js" and here we are.


I almost never see these. Maybe issue is your network?


Can it find my files now?


At a minimum, it can not find them faster!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: