The problem is unlikely to be around the corner for AMD since they do not have to supply as much voltage to their chips as as Intel to keep up since they are - at least when it comes to x86 chips - currently leading in performance and power efficiency
Which will likely affect performance as well, right? As I understood it, they were too ambitious with pushing performance using higher voltages, which now needs to be reduced using a microcode update?
Either performance or stability/reliability. Or even both.
At higher frequencies (during a frequency boost phase for example) signal quality in the digital signals degrades, because the "stable 1" or "stable 0" plateau is shortened and the "maybe 1 or 0" phase in between stays the same. So a signal that is supposed to be as rectangular as possible gets smushed down towards a sinus, and then smushed even further towards lower amplitudes.
One measure against this is of course better (faster) transistors, such that the "maybe" phase is shorter, but that only works by replacing the hardware. The other measure, which you can do during runtime, is to increase the signal amplitude by increasing the voltage. Then even a degraded signal close to the transistors' maximum switching frequency gets over the "stable 1" and "stable 0" thresholds fast enough.
With a lower supply voltage you can thus not clock the CPU as high as before which is important in boost phases during high load. Which would decrease peak performance in standard desktop and server workloads, and decrease overall performance in compute-intensive workloads. Or if you still clock it as high as before, signal quality will be lessened, increasing the probability of bit errors, lessening system stability and reliability of results.
Which direction intel will pick for this firmware upgrade, degraded-performance, degraded-stability or degraded-both, I don't know, I guess we'll see.
Maybe. I don't think anyone had information yet regarding the details. The extra power may have been just pure waste, rather than a performance improvement.
Unless I missed a real investigation posted somewhere?
> for AMD since they do not have to supply as much voltage to their chips
... unless you're using EXPO/XMP. I would really recommend people who run automatically overclocked memory taking a closer look at how much voltage the motherboard is pushing into your CPU, especially into the SoC. Some motherboards just push the voltages to the absolute permitted maximum; ASUS is particularly bad at this. I run lower RAM frequencies than my system is capable of because I haven't seen much performance improvement above 5600 "MHz", and it can be made to run at almost stock voltages.
It definitely causes faster silicon degradation; the question is how fast it will kill the processor. Both Intel and AMD have shown us that it can happen very quickly, not in the course of several decades as we've assumed for the longest time.
It seems new to me that the safety margins are so narrow that a microcode bug can result in voltages that physically damage the CPU. Assuming that's mostly driven by the current small feature sizes and high clock rates, that would mean everyone has this risk.
I'd consider that to be one more driver (bios/software selectable jumpers) in addition to increasingly small features and high clock rates. Rather than something that refutes the situation is new.
It's pretty unlikely that it's around the corner. Intel uses their own foundry and designs and AMD uses their own designs with TSMC as the foundry. If it's a foundry issue then Apple would be in hot water right now since they use the cutting edge. So I think AMD is in the clear.
It’s not a foundry issue, it’s a “how far can we push voltage for better performance to compensate for older architectures” problem. That’s why it can be resolved using a microcode update.
>Intel also confirmed a rumored issue with via oxidation in its 7nm node, but said those issues were corrected in 2023 and didn't contribute to the failures.
Do you trust Intel to not lie about a widespread unfixable problem?
I trust them not to make provably false statements, yeah.
Walking the line of "well, we didn't say..." is one thing but if they say the oxidation issue was ended in may 2023 then I don't see a reason to doubt that.
And the 13th gen CPUs sold between October 2022 and May 2023? "Do not contribute to failures" can be technically not a lie because so much is omitted by not specifying which failures.
What about them? There’s no reason to believe oxidation defects during manufacture is a cause of those failures.
This is a complex problem with several interacting causes and failure modes, and assuming that oxidation is the cause of all of them (that there exist chips not during the oxidation defect window that failed, therefore the oxidation defect window must be larger) is circular reasoning.
How far you can push the voltage is definitely foundry technology dependent. If Intel foundry internally specced a voltage that actually damages a chip I would say it IS a foundry issue.
Silicon chips degrading under high temperature and voltage is very well known. "Overclocking" was (is?) the game of deliberately turning CPU frequency up to get better performance, and turning voltage up to get stable performance, while trying to keep the chip cool enough for this to broadly work out. If you push the envelope a bit hard the stable overclock becomes unstable over time and turning down the voltage and frequency can make it stable again. That behaviour is pretty well documented. Happens to memory too.
When you draw more current - i.e. make the chip do work - the voltage drops. "Load line calibration" was introduced as a way to crank the applied voltage up as the chip worked harder to offset this. Alternatively phrased, if you set the voltage so that it works well under load, it's kind of set too high for idle, so LLC gives a way to decrease idle voltage.
There was some discussion at the time about the wisdom of this. The voltage rings when you change it (overshoots), and it's not that clear whether high voltage at idle matters very much (since the chip is cooler then). I remember reading a bunch of articles, then some materials science books, then turning off LLC in the bios as probably a misfeature.
The current Intel problem is they've built a reputation on being the fastest and the most reliable CPU. Then they lost fastest to chiplets at AMD, so adjusted to "fastest single core". They really didn't want to lose that too, and CPU's run faster when you push more voltage through them, so that's what they did.
---
"Intel silicon degradation" currently refers to people noticing that desktop processors from the last couple of generations were degrading, despite not being overclocked. Some finger pointing there, Intel encourages motherboards to set fairly aggressive out of the box controls because it looks good under review. Also some more recent reports that mobile chips have the same behaviour. At a guess, the desktop ones have the voltage too high and the mobile ones are running at much higher temperature. Some reports are mentioning LLC which would be personally satisfying if it turns out to be the root cause.
I'm waiting to see if reports of degradation start to emerge on the Xeon line. Intel is claiming there's no risk of that but they're made out of the same magic sand and under really heavy competitive pressure from the Epyc chips. They've also got rather high power consumption relative to previous generations.
As to whether the same thing will hit AMD - probably not. If you take one of their chips and push the voltage up, yeah, it'll have the same degradation over time. But they're not running out of the box at the edge of feasible to try to cope with the competition because chiplets are winning the day. Also AMD doesn't have the gilded reputation of Intel to rely upon if reliability comes under question. The incentives are pointing in the opposite direction.