Why are we using LLMs as calculators?

krackers · on Dec 24, 2024

"We don’t actually want them to do math for the sake of replacing calculators" - couldn't the article just end here? People aren't giving it multiplication problems or asking it to count letters because they want to know the answer. Given that you can "patch" the issue by having it invoke python for computation, the real value is in seeing whether current models can learn to follow a simple step-by-step procedure.

The linked tweet https://twitter.com/yuntiandeng/status/1836114401213989366 is far more interesting to me, gpt models clearly _can_ learn to multiply with intermediate tokens, but even o1 currently doesn't. And yet this would be a case where generating synthetic data is almost trivial. And moreover, being able to perform computations in this fashion would be valuable for many types of benchmarks (e.g. FrontierMath, since I'm sure at the end of the day you'll have to grind through some computation).

So why hasn't it been a priority? I remember some NeurIPS presentation claiming that heavily training on math in this fashion hurt language scores. But then the follow-up would be to have specialized models for each and route between them...

belZaah · on Dec 24, 2024

I think the smart people working on them understand, that a llm can fundamentally never be trusted with the math. It does not do calculation, it guesses the next token. Mathematically, the linear algebra behind a llm does not analytically translate to all feasible math. Therefore, there’s always a chance of error in the output. And the very purpose of a calculator is to be precise. If I need to second-guess the result, I might as well not use the llm in the first place.

basch · on Dec 24, 2024

I could see a situation where you want the LLM to take a guess, compare it to a calculation, and if they differ, analyze whether it misinterpreted the question. Obviously, it could misinterpret the question, feed the wrong info to the calculator, and then arrive at the same wrong answer as well, so still not foolproof.

aeonik · on Dec 24, 2024

Author didn't really cover the reason I use it as a calculator, They are good at translating a request into into hard numbers.

Example: "I have a 120 square foot room, and I want to store liquid nitrogen in it. How many liters would it take to displace enough air to be a concern? What kind of CFM should a ventilation system use to clear the room?"

I sanity check the numbers, but it's really nice to have such an interdisciplinary calculator like this.

bagels · on Dec 24, 2024

I tried it for engineering calculations. It'll give you numbers, with the right units, but often times they are the wrong numbers.

aeonik · on Dec 24, 2024

Agreed, but I'm still learning it all, and it does a great job defining my curriculum.

SOLAR_FIELDS · on Dec 24, 2024

> We don’t actually want them to do math for the sake of replacing calculators, we want to understand if they can reason their way to AGI

Speak for yourself. I’d like them to do math for the sake of replacing calculators. Well, not really. But I’d like them to be a really good natural language interface for a calculator

IAmGraydon · on Dec 24, 2024

I’m starting to get the feeling that we’ve created a human simulator, not a device for artificial reasoning. It’s a highly searchable database of the aggregate of most publicly accessible human knowledge.

wkat4242 · on Dec 24, 2024

It's not even a database. It's exactly what you say, a human simulator. That's a good way to describe an LLM. It's not accurate or predictable enough in its retrieval of training material to be a database.

A reasoning engine it certainly is not either. Vendors try to shoehorn it into that purpose like OpenAI with their o models. But I think that's more because of the popularity of LLMs. Other types of AI models will be better at this.

I think eventually the LLM will just handle the human interaction and behind it will be a host of more specialised models to provide the actual content. For stuff like reasoning, knowledge, calculation etc.

pkoird · on Dec 24, 2024

We already have a device for artificial reasoning, it's called Prolog.

stogot · on Dec 24, 2024

I saw the use case yesterday of using LLMs for spreadsheets and I paused myself to think “would I ever be foolish enough to trust the output?” I’d have to check everything myself, so what’s the point?

If o4 can’t go beyond 4x4 accurately, then anyone using LLMs for business spreadsheets or science is a serious mistake

impure · on Dec 24, 2024

Yes, but it gives the answer in fancy latex formatting and it tells you the formulas and sequence of computations that got you there. You need to double-check its answers but LLMs are right most of the time.