I think the real takeaway is that LLMs are very good at tasks that closely resemble examples it has in its training. A lot of things written (code, movies/TV shows, etc.) are actually pretty repetitive and so you don't really need super intelligence to be able to summarize it and break it down, just good pattern matching. But, this can fall apart pretty wildly when you have something genuinely novel...
Is anyone here aware of LLMs demonstrating an original thought? Something truly novel.
My own impression is something more akin to a natural language search query system. If I want a snippet of code to do X it does that pretty well and keeps me from having to search through poor documentation of many OSS projects. Certainly doesn't produce anything I could not do myself - so far.
Ask it about something that is currently unknown and it list a bunch of hypotheses that people have already proposed.
Ask it to write a story and you get a story similar to one you already know but with your details inserted.
I can see how this may appear to be intelligent but likely isn't.
If I come up with something novel while using an LLM, which I wouldn't have come up with had I not had the LLM at my bidding, where did the novelty really come from?
If I came up with something novel while watching a sunrise, which I wouldn't have come up with had I not been looking at it, where did the novelty really come from?
Well that's the tricky part: what is novel? There are varying answers. I think we're all pretty unoriginal most of the time, but at the very least we're a bit better than LLMs at mashing together and synthesizing things based on previous knowledge.
But seriously, how would you determine if an LLM's output was novel? The training data set is so enormous for any given LLM that it would be hard to know for sure that any given output isn't just a trivial mix of existing data.
That's because midterms are specifically supposed to assess how well you learned the material presented (or at least directed to), not your overall ability to reason. If you teach a general reasoning class, getting creative with the midterm is one thing, but if you're teaching someone how to solve differential equations, they're learning to the very edge of their ability in a given amount of time, and you present them with an example outside of what's been described, it kind of makes sense that they can't just already solve it. I mean, that's kind of the whole premise of education, that you can't just present someone with something completely outside of their experience and expect them to derive from first principles how it works.
I would argue that on a math midterm it's entirely reasonable to show a problem they've never seen before and test whether they've made the connection between that problem and the problems they've seen before. We did that all the time in upper division Physics.
A problem they've never seen before, of course. A problem that requires a solving strategy or tool they've never seen before (above and beyond synthesis of multiple things they have seen before) is another matter entirely.
It's like the difference between teaching kids rate problems and then putting ones with negative values or nested rates on a test versus giving them a continuous compound interest problem and expecting them to derive e, because it is fundamentally about rates of change, isn't it?
I honestly think that reflects more on the state of education than it does human intelligence.
My primary assertion is that LLMs struggle to generalize concepts and ideas, hence why they need petabytes of text just to often fail basic riddles when you muck with the parameters a little bit. People get stuck on this for two reasons: one, because they have to reconcile this with what they can see LLMs are capable of, and it's just difficult to believe that all of this can be accomplished without at least intelligence as we know it; I reckon the trick here is that we simply can't even conceive of how utterly massive the training datasets for these models are. We can look at the numbers but there's no way to fully grasp just how vast it truly is. The second thing is definitely the tendency to anthropomorphize. At first I definitely felt like OpenAI was just using this as an excuse to hype their models and come up with reasons for why they can never release weights anymore; convenient. But also, you can see even engineers who genuinely understand how LLMs work coming to the conclusion that they've become sentient, even though the models they felt were sentient now feel downright stupid compared to the current state-of-the-art.
Even less sophisticated pattern matching than what humans are able to do is still very powerful, but it's obvious to me that humans are able to generalize better.
And what truly novel things are humans capable of? At least 99% of the stuff we do is just what we were taught by parents, schools, books, friends, influencers, etc.
Remember, humans needed some 100, 000 years to figure out that you can hit an animal with a rock, and that's using more or less the same brain capacity we have today. If we were born in stone age, we'd all be nothing but cavemen.
Look. I get that we can debate about what's truly novel. I never even actually claimed that humans regularly do things that are actually all that novel. That wasn't the point. The point is that LLMs struggle with novelty because they struggle to generalize. Humans clearly are able to generalize vastly better than transformer-based LLMs.
Really? How do I know that with such great certainty?
Well, I don't know how much text I've read in one lifetime, but I can tell you it's less than the literally multiple terabytes of text fed into the training process of modern LLMs.
Yet, LLMs can still be found failing logic puzzles and simple riddles that even children can figure out, just by tweaking some of the parameters slightly, and it seems like the best thing we can do here is just throw more terabytes of data and more reinforcement learning at it, only for it to still fail, even if a little more sparingly each time.
So what novel things do average people do anyways, since beating animals with rocks apparently took 100,000 years to figure out? Hard call. There's no definitive bar for novel. You could argue almost everything we do is basically just mixing things we've seen together before, yet I'd argue humans are much better at it than LLMs, which need a metric shit load of training data and burn tons of watts. In return, you get some superhuman abilities, but superhuman doesn't mean smarter or better than people; a sufficiently powerful calculator is superhuman. The breadth of an LLM is much wider than any individual human, but the breadth of knowledge across humanity is obviously still much wider than any individual LLM, and there remain things people do well that LLMs definitely still don't, even just in the realm of text.
So if I don't really believe humans are all that novel, why judge LLMs based on that criteria? Really two reasons:
- I think LLMs are significantly worse at it, so allowing your critical thinking abilities to atrophy in favor of using LLMs is really bad. Therefore people need to be very careful about ascribing too much to LLMs.
- Because I think many people want to use LLMs to do truly novel things. Don't get me wrong, a lot of people also just want it to shit out another React Tailwind frontend for a Node.js JSON HTTP CRUD app or something. But, a lot of AI skeptics are no longer the types of people that downplay it as a cope or out of fear, but actually are people who were at least somewhat excited by the capabilities of AI then let down when they tried to color outside the lines and it failed tremendously.
Likewise, imagine trying to figure out how novel an AI response is; the training data set is so massive, that humans can hardly comprehend the scale. Our intuition about what couldn't possibly be in the training data is completely broken. We can only ever easily prove that a given response isn't novel, not that it is.
But honestly maybe it's just too unconvincing to just say all of this in the abstract. Maybe it would better to at least try to come up with some demonstration of something I think I've come up with that is "novel".
There's this sort-of trick I came up with when implementing falling blocks puzzle games for handling input that I think is pretty unique. See, in most implementations, to handle things like auto-repeating movements, you might do something like have a counter that increments, then once it hits the repeat delay, it gets reset again. Maybe you could get slightly more clever by having it count down and repeat at zero: this would make it easier to, for example, have the repeat delay be longer for only the first repeat. This is how DAS normally works in Tetris and other games, and it more or less mirrors the key repeat delay. It's easier with the count down since on the first input you can set it to the high initial delay, then whenever it hits zero you can set it to the repeat delay.
I didn't like this though because I didn't like having to deal with a bunch of state. I really wanted the state to be as simple as possible. So instead, for each game input, I allocate a signed integer. These integers are all initialized to zero. When a key is pressed down, the integer is set to 1 if it is less than 1. When a key is released, it is set to -1 if it is greater than 0. And on each frame of game logic, at the end of the frame, each input greater than 0 is incremented, and each input less than 0 is decremented. This is held in the game state and when the game logic is paused, you do nothing here.
With this scheme, the following side effects occur:
- Like most other schemes, there's no need to special-case key repeat events, as receiving a second key down doesn't do anything.
- Game logic can now do a bunch of logic "statelessly", since the input state encodes a lot of useful information. For example, you can easily trigger an event upon an input being pressed by using n == 1, and you can easily trigger an event upon an input being released using n == -1. You do something every five frames an input is held by checking n % 5 == 0, or slightly more involved for a proper input repeat with initial delay. On any given frame of game logic, you always know how long an input has been held down and after it's released you know how many frames it has been since it was pressed.
Now I don't talk to tons of other game developers, but I've never seen or heard of anyone doing this, and if someone else did come up with it, then I discovered it independently. It was something I came up with when playing around with trying to make deterministic, rewindable game logic. I played around with this a lot in highschool (not that many years ago, about 15 now.)
I fully admit this is not as useful for the human race as "hitting animals with a rock", but I reckon it's the type of thing that LLMs basically only come up with if they've already been exposed to the idea. If I try to instruct LLMs to implement a system that has what I think is a novel idea, it really seems to rapidly fall apart. If it doesn't fall apart, then I honestly begin to suspect that maybe the idea is less novel than I thought... but it's a whole hell of a lot more common, so far, for it to just completely fall apart.
Still, my point was never that AI is useless, a lot of things humans do aren't very novel after all. However, I also think it is definitely not time to allow one's critical thinking skills to atrophy as today's models definitely have some very bad failure modes and some of the ways they fail are ways that we can't afford in many circumstances. Today the biggest challenge IMO is that despite all of the data the ability to generalize really feels lacking. If that problem gets conquered, I'm sure more problems will rise to the top. Unilaterally superhuman AI has a long way to go.
I guess disagreement about this question often stems from what we mean by "human", even more than what we mean by "intelligence".
There are at least 3 distinct categories of human intelligence/capability in any given domain:
1) average human (non-expert) - LLMs are already better (mainly because the average human doesn't know anything, but LLMs at least have some basic knowledge),
2) domain expert humans - LLMs are far behind, but can sometimes supplement human experts with additional breadth,
3) collective intelligence of all humans combined - LLMs are like retarded cavemen in comparison.
So when answering if AI has human-level intelligence, it really makes sense to ask what "human-level" means.