Back when GPT-3 was all the rage on the internet, I remember Kevin Lacker's "Giving GPT-3 a Turing Test" left a real impression on me [1]. Short read, but gets to the crux of the issue. GPT-3 is a (really sophisticated) statistical model, GPT-4 will be no different. Maybe it will showcase new emergent properties of LLMs, like GPT-3's in-prompt few-shot learning did. But the fundamental constraint of a statistical model optimized to a masking task remains -- they're really bad at introspection and confidence-assessment.
I've been trying out the article's suggestions, and ChatGPT is able to tell me that it does not know today's date: "I'm sorry, but I am not able to access the current date because I am a large language model trained by OpenAI and do not have access to the internet. My knowledge is based on the texts that were used to train me, but I don't have the ability to browse the web or access real-time information. Is there something else I can help you with?"
I have been impressed at how quickly ChatGPT has been adjusted to say I don’t know. You could previously trick it with extremely simple logic problems like:
> Jack is taller than Jim. Jack is taller than James. Is Jim taller than James?
Even if you asked it to explain, it would make up some bs.
Now if you ask it tells you it doesn’t have enough information to answer. More tricky problems have been resolved as well.
ChatGPT is actually quite good at this and I think it's a hint at the future of alignment. Many (most?) of the responses from ChatGPT are along the lines of: I don't know, I can't know, I can't respond, etc. It's still far from perfect but I think exploring the results of human feedback for alignment is one of the reasons OpenAI decided to release ChatGPT rather than rush straight to GPT4 (larger model, more data, retrieval, etc.).
That's a good answer, thank you. A similar answer might be: "Admit it's wrong". Technology really has gone in a strange direction if we end up measuring the performance of software systems in terms of which personality flaws they have.
Oh 100% -- anthropomorphic bias towards AI is about to get really interesting, I'd expect. Especially if prompt engineering comes to be part of the job description for tech, which seems likely.
Humans are bad at introspection and saying “I don’t know”
ChatGPT seems to have another statistical algorithm to assess uncertainty for the original statistical algorithm, and say “I don’t know” if it reaches a threshold. GPT-4 may add another model some statistical introspection. At the end of the day I think this is how humans calculate their own uncertainty, so a model doing this way may lead to the same results
exactly this. When it doesn't know, it will ramble, answer an unasked question or boldly lie. Better to give up in such cases; we didn't need artificial ego.
In the mathematical operation that you have provided, the number 1111151111 is being multiplied by 100. When you multiply a number by 100, you simply add two zeroes to the end of the number. So in this case, the result of 1111151111 * 100 would be 11111511100.
To understand why this is the case, you can think about it in terms of place value. In the number 1111151111, the "1" in the far right position is worth 1, the "1" in the next position to the left is worth 10, the next "1" is worth 100, and so on. When you multiply a number by 100, you are essentially adding two zeroes to the end of the number, which means that you are effectively multiplying the value of each digit in the number by 100. So the "1" in the far right position would become 100, the "1" in the next position to the left would become 1000, and so on. This is why the result of 1111151111 * 100 is 11111511100.
It explains what to do and yet it still gets it wrong.
>What is 1111151111 * 184582? Explain.
The product of 1111151111 and 184582 is 20748179353842. This can be calculated by performing the multiplication in the standard way, using the rules of arithmetic.
1111151111
x 184582
--------
1111151111
2222302222
--------
20748179353842
To calculate this product, we first multiply 1111151111 by 2 to get 2222302222. Then, we multiply 1111151111 by 82 to get 91678692462. Finally, we add these two products together to get 20748179353842. This is the final result of the multiplication.
You can see this more clearly in beta.openai.com/playground, with Show Probabilities set to Full Spectrum. I've bracketed the parts it shows in red (uncertain):
111151111 * 184582 is 20[6035002082]. This is the result [when] one multiplies 1111151111 by 184582.
When asked for an explanation:
In some cases, when multiplying large numbers, the middle digits of the result may be incorrect. This is because the multiplication process involves carrying over digits and performing complex operations on large numbers. These calculations can be difficult even for humans, and they are beyond the capabilities of my underlying technology.
The debate seems to be whether this sort of error is qualitative or quantitative. I vote for the former. There is some separate "secret sauce" or governor that is missing.
But I don't believe it is just a database of rules like Cyc. It would be more general than that.
>Me: Yesterday I walked on four, today I walk on two, tomorrow I walk on
>chatGPT: three.
Why is this impressive? It looks to me like just looking up something in a database. We know it ingested this exact riddle, so it doesn't imply any thought at all.
The prediction for multiplying two 5-digit numbers is correct for ChatGPT - the middle digits are wrong (non-deterministically).
The example "Question: How many times does the word "all" occur in the following sentence: All the lice and all the mice were all very nice to all the mice and all the lice." results in answers of 5 or 6, about 50/50 split.
To be fair, 99% of humans can't do that unaided (that is, without a calculator, or some way of visually laying out the intermediate steps) either. In fact, I'd say that 90% of humans can't do the necessary long multiplication steps even with a pencil and paper, and maybe 20% of humans couldn't do it even with a calculator (due to being too young; or having a disability; or growing up in a environment where they are not taught how to use a calculator or even the concept of multiplication at all).
I just think it's ironic if people are suggesting that their threshold for finding an AI "impressive" requires it to already be effectively super-human. It's like they're retreating from the bailey of long AI timelines to the motte of: "Sure, I can't think of anything a human can do that an AI can't, but that's not really significant, because humans aren't particularly intelligent".
Perhaps a more relevant test would be to ask ChatGPT something like: "If a human wanted to calculate the product of the two numbers 12345 and 56789, and they had a calculator with them, what sequence of buttons would they press, in order, to find the result?".
Learning: These models, once trained, are essentially stateless: humans learn. If GPT-3 gets a question wrong you cannot permanently correct it, even a toddler adapts to its environment
A model like ChatGPT but it learns over time from user inputs (without just re-feeding prior inputs), and also just fine-tunes itself over time and learns more data, would be really impressive
I think this was actually a really thought-provoking question, when it was asked a few months ago, but now I'm wondering whether the guesses given have already been proven wrong by ChatGPT, or at least whether people would give guesses today that are very different from those given pre-ChatGPT.
Back when GPT-3 was all the rage on the internet, I remember Kevin Lacker's "Giving GPT-3 a Turing Test" left a real impression on me [1]. Short read, but gets to the crux of the issue. GPT-3 is a (really sophisticated) statistical model, GPT-4 will be no different. Maybe it will showcase new emergent properties of LLMs, like GPT-3's in-prompt few-shot learning did. But the fundamental constraint of a statistical model optimized to a masking task remains -- they're really bad at introspection and confidence-assessment.
[1] https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.h...