2023-02-08 - By Robert Elder
Here are some causes of 'bit flips' in computer memory:
- Power Supply & Capacitor Issues
- Bad Motherboard Pin Connections
- Incorrect RAM Timings
- Clock Issues
- Bad CPU Pin Connections
- Signal Integrity Issues
- RAM Design Flaws
- RAM Chip Lithography Process Defects/Variations
- CPU/RAM/Motherboard Integrated Logic Defects
- DRAM Cell Amplification Errors
- Memory Swapping Onto Defective Non-volatile Storage
- Cables & Network Transmissions
- Consumer Vs. Enterprise Grade Differences
- Row-Hammer Exploit
- Multi-Cause Interactions
- Cosmic Rays
- Memory Density & Material
- Inadequate Radiation Shielding
- Orientation Of Chip In Relation To Sky
- Solar Activity
- Coronal Mass Ejections
- Secondary Particles From Cosmic Rays
- Other Exotic Particle Interactions
- Earth's Magnetic Field Lines
- Concrete & Other Building Materials
- Activity Concentration Of Radiation Shielding
- Activity Concentration Of Solder Material
In my most recent article, I documented an experience where bit flips in a computer's memory corrupted a hard drive image that I was creating. In this article, I want to talk more about the root causes of why bit flips happen in a computer's memory. I also want to challenge the commonly held belief that most bit flips are caused by cosmic rays from space. In fact, if you ever do pin down a computer problem to a bit flip in memory, I'd be willing to bet that it's most likely caused by a hardware issue or the way in which the hardware was used.
To explain this, I'll use the made-up concept of an 'error regime':
In the lowest regime, you have the worst possible case scenario, where the computer constantly experiences bit flips from every possible cause imaginable. It has so many problems with bit flips that it's basically unusable. For every level that you go up, you travel into a new 'error regime', where the frequency of bit flips is decreased by an order of magnitude. In the top regime, you have a computer that experiences bit flips as infrequently as possible given the known laws of physics.
Power Supply & Capacitor Issues
To start off, let's consider the worst possible case in the bottom regime. This computer is in really bad shape. It has all of the following problems: First of all, it's overheating and it's way above normal operating temperature. It also has a bad power supply that produces lots of ripple, under voltage, and over voltage. It would also have problems with the input power that comes from the power grid. The power grid would be experiencing problems like voltage sags, voltage spikes, or frequency and phase shifts. All of these glitches in the input power would eventually make their way to one of the pins on the RAM stick. The motherboard and power supply would also have bad capacitors. Likely from the electrolyte inside of them boiling off due to prolonged overheating.
Bad Motherboard Pin Connections
The RAM would also be making bad connections with the motherboard slots. This could be due to dust or oils on the pins that partially prevent a clean connection, but don't prevent it from working at all.
Incorrect RAM Timings
There's also an entire zoo of settings that need to be defined to control how your RAM operates. It's not just voltage and frequency either. Sometimes you can manually control these settings in your BIOS, but other times the values for these settings are automatically determined by your BIOS, and it doesn't always guess correctly. A number of these settings might only matter for very specific RAM designs and implementations. Many of these details touch on aspects of the RAM that are very complex, and sometimes proprietary so it's difficult to find clear documentation on them. Having one or more of these settings wrong might not immediately cause you obvious memory errors, but it could move you down into a new 'error regime'. In the worst case scenario, you'd have all of these timings and voltages set incorrectly.
Another source of error that you could introduce is from the clock signal. The reference oscillator on the motherboard could be defective or damaged and produce incorrect timings. Usually the clock signal is generated by a piezoelectric crystal, and then multiplied to even higher frequencies using a phase locked loop. If the phase locked loop is damaged or incorrectly designed, the resulting clock signals could be incompatible with the RAM even if you configure the correct timings in the BIOS settings.
Bad CPU Pin Connections
Another potential source of error is the CPU itself. After all, in order to put information into memory in the first place it must go through the CPU. If some of the pins on the CPU are not making good connection with the motherboard, this could also cause bit flips.
Signal Integrity Issues
Even when all of the pin connections are perfect and secure, the process of sending information through a wire is an extremely difficult engineering problem. This is one of the first transatlantic telegraph messages ever sent:
TO THE PRESIDENT OF THE UNITED STATES, WASHINGTON The Queen desires to congratulate the President upon the successful completion of this great international work, in which The Queen has taken the deepest interest. The Queen is convinced that the President will join her in fervently hoping that the electric cable, which now connects great Britain with the United States, will prove an additional link between the nations, whose friendship is founded upon their common interest and reciprocal esteem. The Queen has much pleasure in thus communicating with the President, and renewing to him her wishes for the prosperity of the United States.
It was sent on the 16th of August 1858 and it took almost 17 hours to transmit this message from Valencia Ireland to Newfoundland. The reason it took over 17 hours to send this message was due to a number of physical and mathematical errors in the signaling process that was used in the telegraph cable. Many of the precise mathematical tools that were needed to reduce the error rate in the telegraph cables of the time were developed by a man named Oliver Heavyside. This body of work is known as the Telegrapher's Equations. The degree of sophisticated education required to understand and make use of these equations was apparently enough to confuse even the Society of telegraph engineers at the time. In 1873, Oliver Heavyside applied to join the Society of telegraph Engineers. Unfortunately they were total n00bs and they rejected his application with a comment saying 'we don't want telegraph clerks'.
Fortunately we now understand the value of heavyside's work and the telegrapher's equations are an important basis for all of the modern transmission line communication that we use today. This includes reading and writing from RAM.
RAM Design Flaws
To make things even worse, the RAM itself could suffer from design flaws. It's possible that the address or data lines of the RAM motherboard or CPU are not properly impedance matched, which can cause signal reflections and corrupt the data. There also could be a problem with too much parasitic capacitance in the design. Another common problem in printed circuit board routing is the accidental creation of unwanted antennas. Whenever current flows through a wire some, of the electromagnetic energy will propagate into the surrounding space. Some of this energy can be captured by other wires inside the same device and interfere with its operation.
RAM Chip Lithography Process Defects/Variations
We haven't even started talking about the kinds of manufacturing defects that can occur during the manufacturing process of the silicon die of the RAM chip itself. Silicon Wafers are made from what is supposed to be very pure silicon, but variations in the purity do occur and this can affect the quality of the final product. There's also a surprisingly large amount of variation in the quality of final product of silicon chips. Even for those created in the same batch. There is a multitude of problems that can happen during the lithography process of creating the RAM chip. This can be due to things like optical aberrations in the lens system that was used for the lithography. It could be seismic or environmental vibrations even vibrations of someone walking nearby. Even a very common and very low magnitude earthquake that's so subtle humans can't notice it. These unwanted vibrations can cause imperfections in the features that are built up on the surface of the chip this could result in problems like interconnect wires that have the wrong shape, or oxide layers that are too thin or too thick in the wrong places. The lithography process for chips requires chemicals of very precise concentrations. Small variations in heat or exposure time can degrade these chemicals. Keeping the lithography chemicals consistent is a continuous and difficult process. The lithography process is very sensitive and even a stray speck of dust on a given area on the RAM chip could potentially compromise the bits of memory at that location.
CPU/RAM/Motherboard Integrated Logic Defects
The exact physical mechanism of all manufacturing defects in a chip that could cause bit flips is too broad of a topic to cover in one article. In general you'd expect to see different categories of manufacturing defects from each of the different logic and integration families. Some of the earliest integrated chips used diode transistor logic which was soon replaced by transistor transistor logic. Then came the multitude of other integrated logic techniques like NMOS, PMOS, CMOS, JFET, FINFET and many others. One specific example of a logic level circuit problem is known as a latch-up event. A latch-up event occurs when a low impedance path is created between the power supply rails of a mosfet circuit. When the latch-up event occurs, the transistor feature can become locked into a fixed state that can only be resolved by powering off the device. In some cases the transistor feature can become physically degraded or destroyed. Depending on where this transistor feature is, the result could be perceived by the user as a bit flip. Manufacturing or design flaws can significantly increase the likelihood of latch-up events.
DRAM Cell Amplification Errors
In modern dram sticks, bits of information are stored in tiny capacitors. As you might expect, when you have a tiny charge stored in a tiny capacitor it's easy for the charge state to get corrupted which corresponds to a bit flip. When information is finally read out from memory, the charges in these tiny capacitors must be amplified by one of the aforementioned logic families. This amplification is yet another potential source of error.
Memory Swapping Onto Defective Non-volatile Storage
In modern computer systems, you also need to consider the fact that data doesn't always reside in what we call 'main memory'. It can also reside in CPU caches, or registers. Additionally, the operating system can and will swap out memory pages onto what we consider 'permanent storage' on a temporary basis. This is usually done when a long-lived program has to allocate more memory than the computer has freely available at that time. For this reason, it's also appropriate to include the logic families for what we typically call 'non-volatile storage'. One of the earliest forms was called PROM. Then came EPROM, then EEPROM. Today we have NAND flash, NOR Flash and of course magnetic storage. Each type of non-volatile storage has a different limited lifetime and once they start to wear out you can expect to see bit flips. Even with NAND flash specifically, there are many different sub-types that can have dramatically different lifetimes. Some of these storage media can include their own internal mechanisms to try and correct bit flips but, multi-bit flips can still evade correction. The existence of internal error correction mechanisms like these are not always easily verifiable by the user and some of them can even contain programming bugs that introduce errors.
Cables & Network Transmissions
You also need to think about the probability of bit flips, not only when the data is stationary in RAM, but also when the data is being transmitted both internally and externally. Typically when you observe that a bit has been flipped on a piece of data that's in memory, you don't know where the origin of the bit flip actually occurred. It's possible that the bit was flipped while the data was in memory but it could have just as easily have happened in a piece of networking equipment, inside of a hard drive, or even in a data cable. You may have heard before of smart statistics for hard drives. In particular, there is one smart metric that goes by the label 199_CRC_Error_Count. I can tell you that at this moment, I have a solid state drive that's been running in one of my servers for about four years now. As of today, this hard drive is currently reporting three of these CRC error counts in the last four years of operation. If you do a bit of Googling about this CRC error, you'll find a lot of forum posts where people suggest that the root cause of this problem is a loose hard drive cable.
Consumer Vs. Enterprise Grade Differences
When it comes to the manufacturing and design of hardware, consumer grade versions of products are generally considered to be less reliable than their corresponding enterprise grade versions. In our consideration of bit flips, this corresponds to moving up or down into a new 'error regime', where the frequency of bit flips changes by a few orders of magnitude.
If all of this is not enough to convince you that cosmic rays are probably not the cause of most of your bit flips, then I'll have to introduce you to the row-hammer exploit. The row-hammer exploit is a security vulnerability that affects certain types of computer RAM. RAM that's vulnerable to the row-hammer exploit can have specific bits flipped through a controlled process that works through software. Apparently, in some cases you can even use the row-hammer exploit to flip multiple bits at once this could even be used to defeat the error correction mechanisms of ECC memory. The term RAM stands for 'random access memory' and the reasonable assumption is that you should be able to access RAM in any pattern you want. It should not be possible to write a software program that flips bits in neighboring memory cells without explicitly issuing machine instructions to do so. The fact that this is even possible is a clear demonstration of how fickle and easily corruptable computer memory is.
So, now that we've enumerated many of the root causes of bit flips, it's not hard to imagine how eliminating each of these could move you up into a new regime where errors are an order of magnitude less likely. It's also worth noting that many of the potential root causes like bad connections incorrect timings or too much heat are hard to quantify. For this reason, one or more of these variables can likely interact and move you into a new era regime without your knowledge, but only in certain circumstances. When silicon heats up, its ability to conduct electricity increases. I would speculate that this would require that the RAM states would need to be refreshed more frequently when operating at higher temperatures. If this were the case, then I would expect that certain sets of RAM timings would operate perfectly normal when the computer is cool but for those same RAM timings when the computer gets closer to its maximum operating temperature the refresh times may no longer be adequate.
I think there's a simple explanation for why people focus so much on cosmic rays and radiation as being the cause for bit flips. It's because they're the most difficult type of bit flip to completely eliminate. Most of the root causes described so far have one simple solution: And that's just to design and build the hardware correctly. In practice, this is rarely done because it's not profitable enough. But even when it is done, the smallest possible transistor features will always be vulnerable to radiation and no practical amount of shielding can prevent this.
Memory Density & Material
There are semiconductor manufacturing techniques that can mitigate the effects of radiation such as using silicon and sapphire instead of pure silicon wafers, but this is still vulnerable to radiation. So, is there any way to go into an even higher regime with even less bit flips in a high radiation environment? Well, yes there is! You could use magnetic core memory. Magnetic core memory was one of the earliest forms of computer memory and it was known for being very radiation resistant. One of the reasons for this is simply due to the fact that the toroidal cores used to make up the memory are physically very large, so one stray radiation particle is not enough to disrupt the magnetization state of the entire toroid. A similar approach could technically also be done for memory that relies on capacitors to store information. If you find that the radiation particles are changing the charge states in your capacitors, simply use larger capacitors that can store more charge. You can also use higher voltages and more current. Of course the disadvantage of either of these techniques would be that your memory would become very huge and take up a lot of space. It would also use a lot of power and be very slow.
Inadequate Radiation Shielding
If you do a search online to try and find out how frequent bit flips are, you'll find quotes like this: "Research has shown that a computer with four gigabytes of memory has a 96% chance of experiencing at least one bit flip every three days". Personally, I think that overly precise statements like this are completely ridiculous especially when they're absent of context. Even if you have a perfectly designed and constructed computer that only experiences bit flips from cosmic rays, the exact frequency of bid flips will depend very heavily on the amount of radiation shielding. The ability of radioactive particles to penetrate into shielding is a very well studied science. Simply having a layer of steel or concrete between the computer and the sky would have a dramatic effect on the rate of bit flips due to cosmic rays. Most of these quoted statistics don't say anything about what floor the computer was on or what the roof was made of! These are very important variables for determining the frequency of bit flips due to cosmic rays.
Orientation Of Chip In Relation To Sky
The precise physical mechanism by which radiation can flip bits in memory depends on many factors. In academic literature, bit flips due to radiation are called single event upsets. The physical orientation of the silicon chip in relation to the radiation source would have a dramatic effect on the rate and mechanism of single event upsets. In the case of cosmic rays from space, if the surface of the chip was oriented towards the sky this would maximize the rate at which single event upsets could occur. However, if the chip was instead oriented on its side with its service oriented parallel to the path of the radiation source, it would be reasonable to expect to see less unique upset events, but each upset event would likely be able to flip more bits in the chip. This is because the path of the particle would be able to penetrate through a larger number of memory features on the chip.
Some cosmic rays originate from the sun. The total flux of charged particles from the sun is of particular interest to the US government. The National Oceanic and Atmospheric Administration even has a website where you can observe real-time data that shows the total proton flux at various energy levels as measured by an orbiting satellite. You can see from this data that on any given day it's not unusual for the number of observed particles to increase by as much as 10 times. This is yet another reason why anyone providing overly precise predictions about the frequency of bit flips is probably wrong. Unless they're also providing a long list of other environmental variables, and a precise description of the hardware state, they're probably just copying and pasting the first quote that they found on Google.
Coronal Mass Ejections
In extreme cases events like coronal mass ejections can dramatically increase the number of high energy charged particles that come from the Sun. During these events, the number of charged particles that reach the earth's surface is not only enough to flip bits and computers, but also enough to interfere with power and telecommunication systems. In March of 1989, one such event was able to trip the circuit breakers in Quebec's power grid leading to a nine hour blackout. The storm also briefly disrupted radio telecommunications globally.
Secondary Particles From Cosmic Rays
When a single event upset is caused by a cosmic ray, the particle involved in the interaction is usually not the cosmic ray itself. Instead, it's usually one of the secondary lower energy particles that result from the initial collision in the upper atmosphere. Cosmic rays themselves are technically not rays but rather any kind of exotic combination of particles found in the standard model of physics.
Cosmic rays that come from interstellar space typically have a higher intrinsic energy than those that come from the sun. Determining the exact makeup of all secondary particles that can result from a cosmic ray collision is a complicated topic that's beyond the current limits of human knowledge on particle physics. Despite this, the most important particles to consider are those that have a large mass and a non-zero electric charge. To name a few, this includes protons, alpha particles, and muons. When these energetic charged particles move through the crystal lattice of a silicon wafer, their electric charge is able to disturb and provide energy to the electrons within the lattice itself. This trail of disturbed electrons can create a temporary highly conductive path that did not exist previously. This trail would have an effect similar to poking a very thin wire through the chip in a random direction. If the path of the particle travels through a feature within the chip such as a floating mosfet gate, or an NMOS DRAM cell, the result could be a flipped bit.
Neutrons can also trigger problems that lead to bit flips even though they have a neutral electric charge. High-energy neutrons can collide with atoms in the chip to create other electrically charged fragments that can lead to the same kinds of problems as cosmic rays in general. Low-energy thermal neutrons can be more gracefully absorbed by atoms in the chip to create various radioactive isotopes with a wide variety of half-lives and potentially long-lived decay products. Thermal neutrons can also change the doping characteristics of the silicon. This can disrupt the functioning of transistor elements. Natural silicon is made of three percent silicon 30 which, can absorb a thermal neutron to become the unstable isotope silicon 31. Silicon 31 quickly decays to become the stable isotope phosphorus 31. Phosphorus is one of the n-type dopants used to create transistors in the first place. Therefore bombarding existing transistors in a chip with thermal neutrons is likely to change the transistor characteristics or cause them to stop working entirely. There is even an established industrial process that deliberately uses thermal neutrons as a method of doping silicon. This process is called 'neutron transmutation doping of silicon' and it's primarily used for applications that require a highly uniform doping concentration that cannot be achieved by other means.
High-energy electrons could be problematic if they become embedded and accumulate inside the chip or if they're energetic enough to create secondary particles. Low-energy electrons would be less problematic and their effect would be similar to that of thermal noise.
High-energy photons like gamma rays or x-rays are able to catalyze nuclear reactions, so they could be another source of exotic secondary particles. High-energy photons are a form of ionizing radiation so they would have the ability to strip electrons off of atoms in the chip this could obviously disrupt electrical charge states within the chip.
Other Exotic Particle Interactions
This isn't even an exhaustive list of particle interactions either. There's also anti-matter particles, and all sorts of exotic short-lived intermediary particles that I won't even talk about.
Okay, that's enough talk about cosmic rays themselves, but if you really want to be precise in predicting your expected rate of bit flips you also need to consider what altitude your computer operates at. Obviously, the higher you are the less of a protective atmosphere you have against high-energy particles from space. Some reports suggest that a Mount Everest climber would receive an expected radiation dose that's five times as much someone who stayed at sea level. Since higher altitudes also have rarefied atmospheres cooling systems that rely on air will become less effective this could lead to unexpected overheating which also causes bit flips.
Earth's Magnetic Field Lines
Another thing to pay attention to is the location of Earth's magnetic field lines. Since many of the cosmic rays are charged particles they will be deflected by the magnetic field lines of Earth. After all, this is why the Aurora Borealis is usually only seen in the Arctic and not localized entirely within the kitchen of principal Seymour Skinner.
Concrete & Other Building Materials
Based on what I've said so far, you might be under the impression that the best place to put your computer is under a 50-foot thick layer of concrete. However even building materials like the aggregate used in concrete or stone blocks made of granite can be a substantial source of environmental radiation. Even the Chip's own packaging material can become a problem. Starting in the late 1970s chip manufacturers became aware of trace amounts of radioactive particles that were embedded in the ceramic packaging. This trace amount of radioactivity was only noticed after an investigation of unexplained bit flips that were experienced by the chips.
Activity Concentration Of Radiation Shielding
It's also worth noting that even benign things in the environment such as the radiation shielding itself could become a source of radioactive particles. In the early 1900s before nuclear weapons testing, the atmosphere contained a lot less radioactive particles. Some of these radioactive particles made their way into the steel that we used today. For this reason there's a highly sought after material called Low-background steel that contains a low amount of radioactive particles because it was made before nuclear weapons testing. This steel is highly sought after and is used for applications like those found in this article.
Activity Concentration Of Solder Material
In 2004 a white paper was published that describes how the solder materials that are used to connect the chip to the outside world can also become a source of radiation. Natural lead contains trace amounts of radioactivity, and can therefore become another source of alpha particles. For this reason, there is a product term called 'low alpha solder' that addresses this exact issue. The need for low alpha solder is particularly important for the flip chip method where the solder can make direct contact with the exposed surface of the chip itself.
Bit Flip Experiences That I've Had
Now, just like I described in my previous article. It can be very difficult to root cause a problem down to a bit flip in memory. You may have the opinion that bit flips are extremely rare and that they'll never cause a problem for you. In that case I'll tell you but one more time when I experienced a software problem that I definitely root cause to be a bit flip in memory. On this occasion I was compiling the GCC compiler with the GCC compiler inside of a virtual machine. For some reason, every time I tried to compile the compiler it would error out with different random errors. Some of the errors didn't make any sense, so eventually I booted into memtest. Sure enough, I was getting bit flips. Furthermore, just like in the previous article, it only seemed to happen when the laptop got very hot. To make matters worse, in this case my bit flips were a lot more rare. The machine would operate for as much as 8 hours without any problems but after it seemed to heat up to a critical point it started throwing lots of bit flips. I eventually replaced the RAM in this machine and the bit flips did go away.
I also have a desktop server machine that I've been using for a few years. Almost every time after I moved this machine around and then try to boot it up it'll start throwing postcodes and beep at me. By now, I've learned that the solution to this problem is to open up the case and gently nudge on the RAM stick to push it down a bit. This doesn't always work the first time but eventually I can get it to boot up. This always makes me wonder: What kind of connection problems still exist even after the machine boots up? If I were to slightly bump this machine, would this loosen the connection and start throwing bit flips?
And finally, I'll conclude with the story that a friend of mine told me about a time when he encountered bit flips. For some reason, one of his corporate clients in particular would experience random corruption with their images. He did some investigation and eventually came to the conclusion that the problem was not on his end. After talking with the client, he decided to make a visit directly to their corporate headquarters. Apparently this client had a server with a network interface that was flipping the bits. This client was using some sort of fancy system with apparently multiple redundant network interfaces. The idea was that if one internet service provider went down, you could simply route the message to the remaining internet service provider. TCP packets include a checksum as part of the header so flipped bits should generally get discarded. If I recall correctly, the use of multiple network interfaces was key in allowing this bid error to propagate. Due to some sort of error with the logic and the TCP packet of rewriting the bid flip was allowed to propagate because the TCP packet checksum was rewritten with the bad bit.
If you read my previous article and you still weren't convinced the ECC memory is a good idea hopefully I've convinced you in this one.
How To Make A CPU - A Simple Picture Based Explanation
Example Uses Of Semiconductors - More Than Just CPUs
CPUs Are Smaller Than You Think
Die Photos Of Saifun SM90USLC2GbA Flash Memory
Imaging A Hard Drive With non-ECC Memory - What Could Go Wrong?
Jeri Ellsworth & the Robot Uprising of 2038
Can You Create A Wireless Bridge With ESP8266 Modules?
Why Bother Subscribing?