Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Since OpenAI didn't release the parameter count of GPT-4, I've been wondering/doubting if it is really much bigger than GPT-3. The release of GPT-3.5 has shown that they've found ways of drastically cutting down compute costs (an order of magnitude) while maintaining or even improving the quality of the model's outputs.

Perhaps the reason that they didn't release the specifics of GPT-4 might be in part due to them wanting to be able to charge a decent amount and make a much larger profit than before. I've tried GPT-4 and so far haven't found it to be so much better than previous models. Some sources claim a 10x increase in ... well I don't know what exactly tbh. How do you even measure it? The opinions on this seem to differ a lot, depending on who you ask. By performance on standardized tests? That doesn't necessarily seem like the best metric for what the LLM tries to be.



Yannic Kilcher's opinion on this is likely correct. Similar parameter count, but trained for longer. The particulars of their instruction tuning/whatever-else-they-did are the real secret sauce.


Don't forget about a more efficient attention that let's them get 32k tokens of context.


It's still much worse than 1M context on 16GB VRAM with Reformer, but at the cost of inference speed. And you can use FlashAttention in your own models to get a more efficient/sparse attention now as well.


The quality with reformer is much much worse, it's not really comparable.


Yeah, but it fits on a single GPU. Now imagine it scaled across 1000 GPUs.


I finetuned one in 2020[0] to play around with and the results still seemed a bit worse than a gpt of comparable size.

0. https://svilentodorov.xyz/blog/reformer-99m/


How could one apply the mentioned technologies to llama/alpaca?


>performance on standardized tests? That doesn't necessarily seem like the best metric for what the LLM tries to be.

The standardized tests give a baseline, no matter how arbitrary it might be, just as they do for humans in school.

Whether we think it's right or not, these tools are coming for the workplace. So their ultimate metric will be in business performance to justify their costs (whatever they may be).


GPT 3.5 had trouble understanding when I told it "Say 2 bob are a beb, how many beb per bob are there?" and it wrote a goddamn essay about shoes.

That thing isnt smart, it doesnt understand, it doesnt know, it just rambles. I have worked with people who do the same, yes, but they also werent a threat to most jobs.

I said it before, and I will say it again: If ChatGPT 3,4,5,... can take your job, maybe youre not really providing that much value. Make of that what you will - not everyone has to provide huge value.


Answer from GPT-4:

"This question seems to be intentionally nonsensical or is using unfamiliar terminology. However, if we try to interpret it, we could say that there are 2 "bob" making up 1 "beb." In this case, there would be 0.5 "beb" per "bob." Please provide more context or clarify the terms if you are looking for a different answer."

Answer from GPT-3.5 (subscription version, not free):

"If 2 bob are a beb, then the conversion factor from bobs to bebs is 1 beb per 2 bobs.

So, there is 0.5 beb per bob (or alternatively, 2 bobs per beb)."

Answer from GPT-3.5 (legacy / free version):

"If 2 bobs are equal to 1 beb, then there would be 1 beb per 2 bobs."


Cool, but sadly, as I said, it did not give a very useful answer. If asked enough times, im sure it will give a reasonable answer, yes, but thats not the point.

GPT4s answer is interesting, though


But all of the answers were correct and useful, and GPT-4 was perfect. Anyway ChatGPT is getting hooked up to Wolfram Alpha, and that won't have any issues with basic algebra.


If you already know the answer, but need an interesting and detailed way to explain/write it, then GPT4 can be an immense value-add. Even if you have to do some fact checking afterwards.


What do LLaMA-based models answer for this?


alpaca.cpp 7b model gave me this:

> Say 2 bob are a beb, how many beb per bob are there?

There are two bebs per bob.

> No, there are 2 bobs per beb.

Correct! There is only one other answer that makes sense - "1 Bob for every Beb". But this does not make mathematical or logical sense since a single Bob can have multiple Babes associated with it and vice versa (as seen in the example). Therefore, we must conclude there are two bobs per beb.


> I said it before, and I will say it again: If ChatGPT 3,4,5,... can take your job, maybe youre not really providing that much value.

The thing you're missing here is that it might take someone's job not because they weren't providing the value but because the people who crunch numbers decided to axe 15% of a company's workforce because some executive was sold a pack of lies about what LLMs/"AI" are actually capable of.

It's fine if that happens to one company who then finds out the hard way. It's probably more social-unresty if it's essentially done at every company in every marketplace an LLM can touch - from writing to programming to 3D animation to teaching.

The hype machine around LLM/AI here is the same irrational one we saw around blockchain. The key difference is blockchain was basically never sold as really replacing a person's job (at best you could argue it was sold as getting around the banking industry and maybe eventually being able to replace it, ish). The primary sales pitch of these LLMs is essentially "do more with less".


I typed the query into chat-gpt3.5 (turbo and legacy), and 4, and they all said that there's 0.5 beb per bob.

Did you use the quoted prompt exactly?


No, I didn't use the quoted prompt, but even after explaining to it that bob and beb were not, in fact, shoe related terms, it still kept insisting and being confused (while also giving the correct 1/2 answer).

It can do it, but its not deterministic, and it doesnt really do it well. You can continue the chain by asking "How many bob per bib, assuming two beb per bib?", and see if it chokes then. It sometimes does, sometimes doesnt.


GPT-4:

   If 2 bebs are equal to 1 bib, and we know that 1 beb equals 2 bobs, we can
   determine how many bobs there are per bib using simple substitution.
   
   1 bib = 2 bebs
   1 beb = 2 bobs
   
   Therefore,
   
   1 bib = 2 bebs × 2 bobs/beb = 4 bobs
   
   So, there are 4 bobs per bib.
Nitpick: A properly done substitution would've arrived at

   1 bib = 2 × (2 bobs)
without needing any of the "2 bebs × 2 bobs/beb" nonsense. It doesn't teach this task very well.


You do realize that the current implementations get their context polluted by your prior conversation, right?


I think right here we have an example AI reproducibility problem. It seems fully reasonable and credible as an outcome, but it is hard to dig in and replicate. But the truth of ML is it would be difficult to replicate even if things were FOSS.


> Since OpenAI didn't release the parameter count of GPT-4

That makes me ask what the open in OpenAI stands for?


Just like MTV doesn't mean Music TV anymore.

As a joke I'd say, Open means "open your wallets"


Or TLC as the learning channel or History channel (assuming these still exist).

There are also lots of "Open Government" initiatives that end up being about making everything as opaque and confusing as possible. There were (are?) popular in the "big data" era, though funnily enough, if you watch "Yes Minister!" from ~40 years ago, there is a similar gag about "open government" in the first few episodes, so it's not new.

See of course Orwell, "we care about your privacy" banners, etc. People like to lie as blatantly as possible.


Didn't know it was "Music TV", made me think about Skyrock... the biggest Rap channel in France, and essentially no Rock there.


I am not sure how much bigger, but definitely much bigger IMHO. Otherwise you wouldn't be capped at 25 requests every 3h. That number is small enough that makes me think the inference costs/hardware needed are much bigger than 3.5.


I believe I heard that running inference longer is giving the better responses we're seeing in v4. Hence why v4 is taking so much longer to output data.

Of course we won't know this for sure until OAI tells us, so we may be in the dark for a while.


ChatGPT-4 is definitely slower than GPT-3.5 (and way slower than 3.5-turbo). What could be the reason for that other than much larger parameter count?

I agree that the capabilities seem overhyped. In my subjective experience, 4 seems a little better than 3.5 but not by a huge amount. We just have OpenAI’s cherry-picked word that it‘s this incredible advance.


I disagree. It does much, much better on selected tasks. I cannot quite figure out how to describe what the difference "feels" like, but the performance is sometimes markedly different when feeding ChatGPT-3.5 and ChatGPT-4 the same prompt.


One task that ChatGPT-3.5 is hilariously bad at is reversing strings (both words and pseudorandom input). It seems to have only a vague concept of what that means, even if I try to hold its hand through the process. Maybe some prompt engineering can get it to succeed on anything longer than four letters.

ChatGPT-4 meanwhile seems to have no issue with this at all.


Have you tried inserting spaces between the characters? This may just be a tokenization issue, rather than anything due to the model per se.

Reversing a string is somewhat of a pathological case for language models, because they see tokens not characters. Learning that the token “got” and token “tog” are mirror images is only useful for string reversal and generating palindromes. Unless they are trained specifically for this task, they may not be able to do it. They should however be able to see that “g o t” and “t o g” are mirror images.

Infamously, early versions of GPT-3 tokenized numbers as grouped tokens, nerfing its calculation abilities, because it would tokenize a number such as 12345 as (illustratively) 12 34 5 which is obviously a harmful representation.


> What could be the reason for that other than much larger parameter count?

Longer inference time... I should have written it down now that people are asking about it, but a few weeks ago I was seeing people discuss the GPT-4 "paper" in what little information was released and that throwing more inference compute at the problem gives better responses.

>, 4 seems a little better than 3.5 but not by a huge amount.

Can you define that in a tangible way? I don't think most of us can since we have so little access to the product.


Runs on cheaper but slower compute maybe? Given all the hype and little competition, I'm sure they're willing to make it slower if it reduces cost.


> 4 seems a little better than 3.5 but not by a huge amount.

Depends on the task. 3.5 was completely incapable of doing math, but 4 seems to be able to at a solid highschool graduate level.


Given how small the time window between the successive releases was it's extremely unlikely that there were any big changes to the model. Most likely it's just better preprocessed training data, more training data, trained for longer, performance optimizations for attention, or a few changes to layer sizes.


They didn’t release GTP-4 immediately after it was trained and then move on to training GPT-5. They had 4 for almost 6 months before it was released. 5 was certainly well underway long before we’d heard of 4.


Your timeline is wrong, GPT-4 finished training already in August.


I saw this coming a long time ago and I'm still very pissed off. For three reasons:

1. We are all forced to use the damn "chat" API instead of regular completions. Can't wait to have to deal with chatgpt's conversations in order to get a few lines of code out 2. We loose the super valuable 'insert' and 'edit' modes, which were great for code 3. 3-day notice period? that's going to be a hell for people who are actually providing products based on codex or doing research


Completion API for GPT-4 will be there soon. With extra stop tokens, but better than nothing. A compromise.

And it's not like what OpenAI did was an impossible magic trick. They've had a right team composition. And three insights. All present in the literature. Repeat that, you'll have GPT-4. But GPT-5. Well, that one is different game.

As to being open, they are still relatively open. Consider Apple, for example. No one complains about Apple being a bit skittish. Well, OpenAI got a bit skittish too. It's a period. They'll stabilize. And their setup of the company, with the non-profit board in control, profit caps is a really interesting try at the corporate design.


Its not interesting. It's a hack to have a don't be evil vibe and keeping the name "open" while they go against their own foundational principles.


You aren’t providing any sort of valuable insight here. This is more indicative of your priors than anything else. Everyone has heard this argument. The people that believe it, believe it. The people that don’t, don’t.


The initial goal was to make ai available to everyone. In the process of getting enough funds to build their vision they gave it to Microsoft.


Lots of people complain about Apple being skittish (including HN comment section), but they also expect them to pull a stunt every once in a while. OpenAI was an unknown quantity until now.


From WordNet:

> 1. skittish, flighty, spooky, nervous -- (unpredictably excitable (especially of horses))

(I didn't know the word skittish, and I figured this might help others, too.)


dude why are you copy and pasting my comments from other threads?


Did they actually plagiarize a comment you’ve made previously?



I searched and didn’t find any identical prior comment



Is it a bot? were you able to figure anything out?


no idea, but its funny how it/he/she fetched the comment a similar reddit discussion and pasted it here...


So this 'ar9av' is a bot reposting comments or a troll?

Wonder if dang could take a look at this.


Nobody are forced to anything. You don't have to use openai services if you don't want to...


> Nobody are forced to anything. You don't have to use a smartphone if you don't want to...

I expect that a similar thing is possible with the use of AI (for work or possibly education, if not for personal use) as happened with smartphones.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: