


I had too, though for much less than 20 years.Īfter several full days of team-wide debugging, we had no better explanation based on the available evidence than cosmic rays, or a hardware bug. None of these other guys were noobs- a couple of them had (at the time) built over 20 years of experience in this system, and in diagnosing similar memory corruption bugs beyond any doubt (many were due to errant DMAs from device drivers). It was concerning enough (and took me long enough) that it eventually sucked in the rest of my team to aid in the investigation. Our trace data was quite comprehensive, and is always turned on due to its very low overhead. I once spent a few weeks poring through trace data trying to investigate a very mysterious cache-aligned memory corruption induced by a memory stress test.
#CPUINFO 4620 SOFTWARE#
We don't hear about them often, since they are usually worked around in the software which is usually customised exactly for the application and doesn't change much.)Ī few past lives ago, I used to work on the AIX kernel at IBM. (To those wondering about ARM and other "simpler" SoCs in embedded systems etc.: They have just as much if not more hardware bugs than PCs. I would recommend demoscene productions, cracktros, and even certain malware, since they tend to exercise the hardware in ways that more "mainstream" software wouldn't come close to. This and the other rather scary post at suggests to me that CPU manufacturers should do more regression testing, and far more of it. delaying one operation by a cycle or two)? It reminds me of bugs like neglecting some edge-case, or a hardware-level race condition related to marginal timing (that could be worked around by e.g. Was it a software-like bug in microcode e.g. More interestingly, I would love to read an actual detailed analysis of the problem. Or more importantly, how many engineers at Intel, working on these processors, saw this happen a few times and did the same. and just nonchalantly attributed it to something else like "buggy software" or even "cosmic ray", when it was actually a defect in the hardware. I wonder how many users have experienced intermittent crashes etc. This can only happen when both logical processors on the same physical processor are active. RAX, EAX or AX for AH) may cause unpredictable system behavior. Problem: Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. The problem description is short and scary:
