Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succinctly as possible.

One big architectural tweak that comes to mind and isn't in the article is QK norm: https://arxiv.org/pdf/2010.04245

> Cosine Schedule

A lot (most?) of new training runs actually don't use cosine schedule anymore; instead they keep the learning rate constant and only decay it at the very end, which gives equivalent or better results. See:

https://arxiv.org/pdf/2405.18392 https://arxiv.org/pdf/2404.06395

> There is a highly optimized implementation of AdamW in PyTorch.

A fun tidbit - it's actually not highly optimized from my experience. Imagine my surprise when I reimplemented it in Triton (because I needed to tweak a few things) and I got better performance than the built-in PyTorch implementation.



RE: optimizer performance - any thoughts on heavyball?


...oh, I didn't know about this library, thanks!

I still probably wouldn't be able to use it because I need a bunch of custom functionality for my optimizers (like for example custom quantization support and incremental gradient accumulation directly in optimizers' state), but I might borrow some of their techniques if they make things even faster.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: