What good bits did you find? (I'm not sure how fruitful the "OpenAI is a Microsoft department" debate is given that they are almost one and everybody knows it, but I am curious if anyone has found anything good in those many pages.)
I think the most interesting thing is the their ability to predict performance from loss and on a wide range of tasks using a much smaller model - this lets them fine tune their architecture and hypers, then run a single large training run to get full scale gpt4 - from the paper it sounds like they only trained the large model once, then did a Reinforcement learning with human feedback finetune.
Disclaimer - I work at Microsoft, in AI, and have no internal knowledge about gpt4.
This isn’t that interesting imo. This is the basic outcome of the scaling laws from Kaplan, Chinchilla papers pushed to a larger final model delta.
They likely did extensive small model building on the gpt-4 architecture to establish hyperparameter scaling laws and then did a predicted build in exactly the same way chinchilla did.