Magic methods are not that "hard" to optimize (as long as you don't overload the add,etc operators of f.ex. the Number class in JS). I'm gonna use numbers and addition as an example here.
First off is the value model, the Python runtime handles ALL values as objects and that's fine for an initial naive runtime. All fast/modern language runtimes however use value models/encodings that fits "fast" values directly into machine register at the lowest level.
V8 has(had?) "small-ints" and objects (doubles,strings,etc) by setting the lowest bit in a register for pointers and otherwise dealing with them as numbers. So a+b when JIT'ed has a check (or stored knowledge from a previous opertion to elide the check) that both a and b are integers, if that is true then the actual addition is one single machine addition. if that ISN'T true then a more complex machinery is invoked that could methods like double-dispatch to see if more complex processing (like a "magic" method) is needed. This is how as JS engine handles that + behaves differently between numbers, strings, BigInt's, Date object's,etc.
(Other JS engines and LuaJIT use something called NaN/NuN tagging that also allows for quick passing of numbers w/o allocations and only a few small extra checks)
Re-implementing Python, you'd probably choose a small-int optimization (to better support Pythons seamless bigints) for values, put a runtime specific magic to the number add and make some kind of hook that detects writes to it from user code. Patching that from user code would trigger de-optimizations but for most applicataions it could continue running with optimized paths.
And even with larger objects (like heap-allocated BigInt's) a JS runtime can use inline caching to direct the runtime to fast direct dispatches, and then teams like the V8 team can detect commonly used objects that are often used and create fast-paths. A list addition for example will use common "slow" paths for dispatch but that's ok since it's an inherently slow operation that often involves allocations of some sort so the _relative_ overhead is fairly small in the big picture.
All this naturally assumes that you have the machinery in place, once in place though you can make simple code (numeric additions) fast while retaining magic for more complex objects (bigint, list,string,etc).
Tl;Dr; once you have that kind of optimizing in place, expensive processing can be allowed in special cases in slow paths thanks to type-guards, but 95% of the code will run the fast paths and having that handling in places with speed will give you most wins.
Instead of individually replying your comments, let me answer them at once here, because I agree you are correct in principle but think you are still missing my points.
Modern tracing JIT engines indeed work by (heavy) specialization, often using multiple underlying representations for single runtime type. I think V8 has at least four Array representations? After many enough specializations it is possible to get a comparable performance even for Python. The question is how many, however.
For a long time, most dynamically typed languages and implementations didn't even try to do JIT because of its high upfront cost. The cost is much lower today---yet still not insignificant enough to say it's no-brainer to do so---, but that fact was not that obvious 20 years ago. Ruby was also one of them, and YJIT was only possible thanks to Shopify's initial works. Given an assumption that JIT is not feasible, both CPython developers and users did a lot of things that further complicate eventual JIT implementations. C API is one, which is indeed one of the major concern for CPython, but a highly customized user class is another. Herein lies the problem:
> Magic methods are not that "hard" to optimize (as long as you don't overload the add,etc operators of f.ex. the Number class in JS).
Indeed, it is very unusual to subclass `Number` in JS, however it is less unusual to subclass `int` in Python, because it is allowed and Python made it convenient. I still think a majority of `int` will use the built-in class and not subclasses, but if it's the only concern, Psyco [1] should have been much popular when it came out because it should have handled such cases perfectly. In reality Psyco was not enough, hence PyPy.
At this point I want to clarify that magic methods in Python are much more than mere operator overloading. For example, properties in JS are more or less direct (`Object.defineProperty` and nowadays a native syntax), but in Python they are implemented via descriptors, which are a nested object with yet another dunder methods. For example this implements the `Foo.bar` property:
class Foo:
class Bar:
def __get__(self, obj, objtype=None): return 42
bar = Bar()
In reality everyone will use `bar = property(lambda self: 42)` or equivalent instead, but that's how it works underneath. And the nested object can do absolutely anything. You can specialize for well-known descriptor types like `property`, but that wouldn't be enough for complex Python codebases. This is why...
> This is how as JS engine handles that + behaves differently between numbers, strings, BigInt's, Date object's,etc.
...is not the only thing JS engines do. They also have hidden classes (aka shapes) that are recognized and created in runtime, and I think it was one of innovations pioneered by V8---outside of the PL academia of course. Hidden classes in Python would be more complex than those in JS for this added flexibility and resulting uses. And JS hidden classes are not even that simple to implement.
After decades of JIT not in sight, and a non-trivial amount of work to get a working JIT even after that, it is not unreasonable that CPython didn't try to build JIT for a long time and the current JIT work is still quite conservative (it uses a copy-and-patch compilation to reduce the upfront cost). CPython did do lots of optimizations possible in interpreters though, many things mentioned above are internally cached for performance. One can correctly argue that such optimizations were not steady enough---for example, adaptive opcodes in 3.11 are something Java HotSpot used to do more than 10 years ago.
First off is the value model, the Python runtime handles ALL values as objects and that's fine for an initial naive runtime. All fast/modern language runtimes however use value models/encodings that fits "fast" values directly into machine register at the lowest level.
V8 has(had?) "small-ints" and objects (doubles,strings,etc) by setting the lowest bit in a register for pointers and otherwise dealing with them as numbers. So a+b when JIT'ed has a check (or stored knowledge from a previous opertion to elide the check) that both a and b are integers, if that is true then the actual addition is one single machine addition. if that ISN'T true then a more complex machinery is invoked that could methods like double-dispatch to see if more complex processing (like a "magic" method) is needed. This is how as JS engine handles that + behaves differently between numbers, strings, BigInt's, Date object's,etc.
(Other JS engines and LuaJIT use something called NaN/NuN tagging that also allows for quick passing of numbers w/o allocations and only a few small extra checks)
Re-implementing Python, you'd probably choose a small-int optimization (to better support Pythons seamless bigints) for values, put a runtime specific magic to the number add and make some kind of hook that detects writes to it from user code. Patching that from user code would trigger de-optimizations but for most applicataions it could continue running with optimized paths.
And even with larger objects (like heap-allocated BigInt's) a JS runtime can use inline caching to direct the runtime to fast direct dispatches, and then teams like the V8 team can detect commonly used objects that are often used and create fast-paths. A list addition for example will use common "slow" paths for dispatch but that's ok since it's an inherently slow operation that often involves allocations of some sort so the _relative_ overhead is fairly small in the big picture.
All this naturally assumes that you have the machinery in place, once in place though you can make simple code (numeric additions) fast while retaining magic for more complex objects (bigint, list,string,etc).
Tl;Dr; once you have that kind of optimizing in place, expensive processing can be allowed in special cases in slow paths thanks to type-guards, but 95% of the code will run the fast paths and having that handling in places with speed will give you most wins.