I dont think it is equivalent. If you assume it has the same modal properties, s...

FeepingCreature · on April 5, 2022

I mean, how would you discover that you're in world W? If you ask "what do you think about my red shoes?" and I say "I think your red shoes are pretty", then you will say this is just me completing the pattern. But if I have no idea what shoes you're wearing, then even I, surely agreed to be an agent, could not compliment your clothing. So I'm not sure how this distinction works.

> It doesnt come from being an implementation of the (Q, A, W) pattern

Well, isn't this just a (Q, A, W, H) pattern though? You have a hidden state that you draw upon in order to map Qs onto As, in addition to the worldstate that exists outside you. But inasmuch as this hidden state shows itself in your answers, then GPT has to model it in order to efficiently compress your pattern of behavior. And inasmuch as it doesn't ever show itself in your answers, or only very rarely, it's hard to see how it can be vital to implementing agency.

And, of course, teaching GPT this multi-step approach to problem solving is just prompting it to use a "hidden" state, by creating a situation in which the normally hidden state is directly visualized. So the next step would be to allow GPT to actually generate a separate window of reasoning steps that are not directly compared against the context window being learnt, so it can think even when not prompted to. I'm not sure how to train that though.

mjburgess · on April 5, 2022

Sure, GPT has to model H -- that's a way of putting it. However think of how the algorithm producing GPT works (and thereby how GPT models QAWH) -- it produces a set of weights which interpolate between the training data --- even if we give it QAWH as training data, implementing the same QAWH patterns would require more storage capacity than is physically possible.

I think there's a genuine ontological (practical, empirical, also) difference between how a system scales with these "inputs". In otherwords if a machine is a `A = m(Q | World, Hidden)`, and a person is a `A = p(Q | World, Hidden)` then their complexity properties *matter*.

We know that the algorithm which produces `m` does so with exponential complexity; and we know that the algorithm producing `p` doesnt. In otherwords, for a person to answer `A` in the relevant ways, does not require exponential space/time. We know that NNs are already exponential scaling in their parameters in their even fairly radically stupid solutions (ie., ones which are grossly insensitive even to W).

So whilst `m` and `p` are equivalent if all we want is an accurate mapping of `Q`-space to `A`-space, they arent equivalent in their complexity properties. This inequivalence makes `m` physically impossible, but i also think, just not intelligent.

As in, it was intelligent to write the textbook; after its written, the HDD space which stores it isnt "intelligent". Intelligence is that capacity which enables low-complexity systems to do "high-complexity" stuff. In other words, that we can map-out QAWH with physically-possible, indeed, ordinary capacities -- our-doing-that is intelligence.

I think this is a radically empirical question, rather than a merely philosophical one. No algorithm which relies on interpolation of training data will have the right properties; it just wont, as a matter of fact, answer questions correclty.

You cannot encode the whole QAWH-space in parameters. Interpolation, as a strategy, is exponential-scaling; and cannot therefore cover even a tiny fraction of the space.

Ie., if I ask "what did you think of will smith hitting christopher walken?" it is unlikely to reply, "I think you mean Chris Rock" firstly; and then if will does hit walken, to reply, "I think Walken deserved it!".

Interpolation, as a strategy, cannot deal with the infinities that counter-factuals require. We are genuinely able to perform well in an infinite number of worlds. We do that by not modelling QA pairs, at all; nor even the W-infinity.

Rather, we implement "taste, imagination, curiosity" etc. and are able to simulate (and much else) everything we need. We arent an interpolation through relevant hisotry, we are a machine direclty responsible to the local environment in ways that show a genuine deep understanding of the world and abiliyt to similate it.

This ability enables `p` to have a lower complexity than `m`, and thereby be actually intelligent.

As an empirical matter, i think you just can't build a system which actually succeeds in answering the-right-way. It isnt intelligent; but likewise, it also just doesnt work.

FeepingCreature · on April 5, 2022

The notion that GPT "interpolates between the training data" is a widespread misconception. There is no evidence that that's what's going on. GPT seems to be capable of generalizing, in ways that let it mix features of training samples at least, and even generalize to situations that it has never seen.

It seems to me your entire argument derives from this. If GPT is not exponential, then the m/p distinction falls apart. And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.

Neural networks learn features, not samples. Layered networks learn features of features (of features of features...). Intelligence works because for many practical tasks, the feature recursion depth of reality is limited. For instance, we can count sheep by throwing pebbles in a bucket for every sheep that enters the pasture, because the concept of items generalizes both sheep and pebbles, and the algorithm ensures that sheep and pebbles move as one. So to come up with this idea, you only need to have enough layers to recognize sheep as items, pebbles as items, those two conceptual assignments as similar, and to notice that when two things are described by similar conceptual assignments in the counting domain, you can use a manual process that represents a count in one domain to validate the other domain. Now I don't think this is actually what our brain is literally doing when we work out this algorithm, it probably involves more visual imagination and looking at systems coevolve in our worldmodel to convince us that the algorithm works. But I also don't think that working this out on purely conceptual grounds needs all that many levels of abstraction/Transformer layers of feature meta-recognition. And once you have that, you get it.

mjburgess · on April 5, 2022

All GD learners are interpolators (cf. https://arxiv.org/abs/2012.00152) we also know theyre exponential in parameter count ( cf. https://www.researchgate.net/figure/Number-of-parameters-ie-... )

> If GPT is not exponential, then the m/p distinction falls apart.

Yes, I think if you have a system which implements QAWH with a similar compelxity to a known intelligent system -- at that point I have no empirical issues. I think, at that point, you have a workiung system.

We then ask if it is thinking about anything, and I think that'd be an open question as to how its implemented. I dont think the pattern alone would mean the system had intentionality -- but my issue at this stage is the narrower empirical one. Without something like a "tractable complexity class", your system is broken.

> And GPT has way too much world-knowledge, IMO, to be storing things in such a costly fashion.

This is an illusion. Knowledge here is deterministic, to the same question, the same answer. GPT generates answers across runs which are self-contradictory, etc. "the same question" (even literally, or if you'd like, with some rephrasing) is given quite radically different answers.

I think all we have here is evidence of the (already known) tremendous compressibility of text data. We can, in c. 500bn numbers, compress most of the histoy of anything ever said. With such a databank, a machine can appear to do quite a lot.

This isnt world knowledge... it is a symptom of how we, language users, position related words near each other for the sake of easy comprehension. By doing this one can compress our text into brute statstical associations which appear to be meaningful.

As much as Github's AI is basically just copy/pasting code from github repos, GPT is just copy/pasting sentences from books.

All the code in github, compressed into billions of numbers, and decompressed a little -- that's a "statical space of tricks and coincidences" so large we cannot by intution alone fathom it. It's what makes these systems useful, but also easy illusions.

We can, by a scientific investigation of these systems as objects of study, come up with trivial hypothesis that expose their fundamentally dumb coincidental character. There are quite a few papers now which do this, I dont have one to hand.

But you know, investigate a model of this kind yourself: permute the input questions, investigate the answers.. and invalidate your hypothesis (like a scientist might do)... can you invalidate your hypothesis?

I think with only a little thoguh you will find it fairly trivial to do so.

FeepingCreature · on April 6, 2022

> All GD learners are interpolators (cf. https://arxiv.org/abs/2012.00152) ,

If the paper is substantially correct I concede the point. But what I've read of reactions leads me to believe the conclusion is overstated.

Regarding compression vs intelligence, I already believe that intelligence, even human intelligence, is largely a matter of compressing data.

Regarding "knowledge is deterministic", ignoring the fact that it's not even deterministic in humans, so long as GPT can instantiate agents I consider the question of whether it "is" an agent academic. If GPT can operate over W_m and H_n, and I live in W_1 and have H_5, I just need to prompt it with evidence for the world and hidden state. Consider for example, how GAN image generators have a notion of image quality but no inherent desire to "draw good images", so to get quality out you have to give them circumstantial evidence that the artist they are emulating is good, ie. "- Unreal Engine ArtStation Wallpaper HQ 4K."

FeepingCreature · on April 6, 2022

Also, of course, it's hard to see how DALL-E can create "a chair in the shape of an avocado" by interpolating between training samples, none of which were a chair in the shape of an avocado nor anywhere close. The orthodox view of interpolating between a deep hierarchy of extracted features and meta-features readily explains this feat.