> Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succinctly as possible.
A lot (most?) of new training runs actually don't use cosine schedule anymore; instead they keep the learning rate constant and only decay it at the very end, which gives equivalent or better results. See:
> There is a highly optimized implementation of AdamW in PyTorch.
A fun tidbit - it's actually not highly optimized from my experience. Imagine my surprise when I reimplemented it in Triton (because I needed to tweak a few things) and I got better performance than the built-in PyTorch implementation.
I still probably wouldn't be able to use it because I need a bunch of custom functionality for my optimizers (like for example custom quantization support and incremental gradient accumulation directly in optimizers' state), but I might borrow some of their techniques if they make things even faster.
One big architectural tweak that comes to mind and isn't in the article is QK norm: https://arxiv.org/pdf/2010.04245
> Cosine Schedule
A lot (most?) of new training runs actually don't use cosine schedule anymore; instead they keep the learning rate constant and only decay it at the very end, which gives equivalent or better results. See:
https://arxiv.org/pdf/2405.18392 https://arxiv.org/pdf/2404.06395
> There is a highly optimized implementation of AdamW in PyTorch.
A fun tidbit - it's actually not highly optimized from my experience. Imagine my surprise when I reimplemented it in Triton (because I needed to tweak a few things) and I got better performance than the built-in PyTorch implementation.