2019-04-22 - By Robert Elder
A very common problem experienced by people who use digital cameras, the Raspberry Pi, or USB flash drives is flash memory corruption. There are several contributing factors that can lead to flash memory corruption, but one of the most important is that most consumer flash memory is just poorly designed and manufactured. As of writing this article, you can pick up 16GiB of flash memory at an office supply store for about $10. These cheap consumer SD cards are often not well designed or manufactured and it is actually quite common for them to fail. Industrial SD cards, such as those you can purchase from Digikey are more reliable, but they can be extremely expensive especially for the larger capacity ones.
Before diving into the details of what can cause 'flash corruption', it's important to distinguish between two different types of 'corruption' that can occur. The following two terms have been coined mainly just for use in this article so they aren't universal, but let's divide up the types of flash 'corruption' you can experience into soft corruption and hard corruption.
Soft Corruption is meant to describe 'software' based corruption of the information stored on the device without any physical issue with the device itself. In this type of corruption, your data may be lost or damaged, but you can likely re-format the device and it will work as good as new again. For example, you can sometimes end up with 'soft corruption' if you copy something to a flash drive, and then quickly pull out the flash device before it has finished writing, or without doing the 'safely remove' feature. In this case, when you try to read the drive again, it might say you need to format it, or the files may not open. In this case you can often format the device and use it again to store files just fine. With soft corruption, you may be able to use various software tools to read all valid and invalid data from the media and then perform manual steps to reconstruct the filesystem structure. This job is often performed by data recovery specialists, and it can be very time consuming or expensive.
Hard Corruption is meant to describe 'hardware' based corruption (permanent physical effects) that prevent the flash device from reliably storing information. With hard corruption, you can expect your data to be lost or damaged, and any kind of formatting or software-based repair tool won't have any hope of making the device reliable again. Hard corruption of flash devices will occur naturally over time with repeated write cycles as the electrostatic mechanisms used to actually 'store' and represent data in the device become more and more noisy. Other forms of hard corruption could be things like electrical shorts that damage the silicon chip inside the SD card, or physical changes to the memory cells in the chip that result from intense static electricity discharge.
Cheap flash memory has been known to sometimes experience soft corruption even with very specific write sequences that you might encounter from heavy Raspberry Pi use. This can be because of poor programming and testing on the part of the SD card manufacturer who may only test their firmware for usage cases that involve only the most common and predictable write patterns. They may assume that most people using the memory will be doing things like taking photos or video with a digital camera. These trivial test cases are enough to cover the most common uses of consumer memory and are therefore 'good enough' to be able to sell the product.
Common Causes of Flash Memory Corruption
In addition to cheap memory being to blame, there are number of other contributing factors and many of them have to do with not having a stable power supply. To summarize, a list of many problems that can lead to corrupted flash memory are:
- Removing the flash media from the Raspberry Pi/Camera/Computer while it is running without safely unmounting it, or using the 'remove safely' feature.
- Buying the cheapest SD card you can find on eBay.
- Power outages, brownouts or voltage surges.
- Using a poor quality USB wall wart adapter. See YouTube search for 'usb adapter teardown'.
- Using a USB cable with wires that are too thin or don't make proper contact when plugged in (causing high wire resistance and voltage drop).
- Using a good quality USB cable and adapter that has an inadequate current rating for your use case (ex. using a 1A adapter when 2A is required to run the Pi, + camera + HD during 100% CPU usage).
- Using extremely high-density flash which is often more prone to failure than low-density flash.
- Using flash memory which, internally, does not have mechanisms (such as write-ahead logging or transactions) to recover from power loss or power sag.
- Electrostatic discharge (static electricity shocks).
- Writing/rewriting to the flash memory too many times.
Many of the power related failure cases described above are likely to cause soft corruption where the data will be lost or damaged, but the flash media can often be used again after a simple re-formatting. The reason for this is that the 'corruption' can manifest itself in a few bits of information that describe the layout of files on the flash device, rather than the actual data you intend to store. When a device is 'formatted' with a filesystem, various entries are written to the device which describe things like: how many files there are, the start and end of each file, the length of filenames, directory structure, etc. If you power off a poor-quality flash device that doesn't expect to be interrupted in the middle of changing the filesystem structure, it could end up with an incomplete listing of files that doesn't make any sense. Then, whatever device tries to read it will notice that the filesystem says things that don't make sense, so your OS will say: "I don't know what to do with this directory that claims to have 7573259375438759843758437597435843543 files in it (that's impossible), so I'm just going to suggest that you format the flash device instead and not even bother to try and figure out what is wrong.".
Another effect that can contribute to soft corruption is that your Raspberry Pi, Camera, or other device, usually caches writes by holding them in some other part of memory before writing large sections to the flash media all at once. The programming that actually writes data from the Pi, camera, etc. might do 'writes' one byte at a time. If the OS actually did writes to the flash media one byte at a time, this would be very slow and wear out the flash much faster. Instead, your OS will usually cache the writes in memory until there is enough of them to bother writing one big chunk of data at once. The trouble is, if the OS is only half-finished writing changes from the cache to the flash storage when you power off the device, the other half of un-written cached data that was in memory is now lost. Even worse, the half that was written probably doesn't make sense and can contribute to corrupted filesystem structures.
It's also worth pointing out that you can encounter 'soft' corruption that is technically still just bits flipped in software states, but still have what most consumers would consider 'hard' corruption. This is because along with your data, the manufacturer needs to program the flash storage with small computer programs and data structures that only they have the expertise to work with (flash memory actually contains a small CPU with its own software!). If you can find out what tools and processes the manufacturer uses internally, you might have a shot, but if a bit gets flipped inside the flash memory's internal firmware, some internal data structure, or an error correction algorithm code, then you're probably going to have a tough time fixing this unless you work for the company that makes them. It's worth mentioning that one of the suggestions above about not buying cheap flash memory from eBay is relevant here: Some of the super cheap '1TB' memory cards you buy are actually much smaller (512MB for example) cards where some of the internal data structures have been updated to simply report some incredibly huge partition size instead of the real size. When you plug them into a computer, it will say the actual size is '1TB' because it gets this size information by asking for it from the SD card. When the card has been purposefully re-programmed to lie about how large it is, you can make it say any size you want! If you're interested in this topic, I suggest reading On Hacking MicroSD Cards.
Solutions To Flash Memory Corruption
In order to mitigate these problems, you should consider doing the following:
- Make sure you always unmount ('safely remove') your flash media before physically removing it.
- If you have a choice between buying high-density flash (high GB/$) and low-density flash (low GB/$) for the same amount of money, pick the lower density one. You'll get less 'storage' per dollar, but the integrity of the flash per bit is likely to be higher.
- Try not to buy cheap consumer grade flash memory.
- Become aware of approximately how many amps your device will consume with specific consideration to current spikes that can happen when it needs to do a lot of work quickly. For a Raspberry Pi, you should consider how many peripherals are attached to it. Make sure you get a wall wart adapter that is rated for the max amperage you plan to use.
- If you suspect that you're close the current limit supported by your wall adapter, avoid doing workloads that max out the CPU at 100%, or activate multiple peripherals at once.
- Minimize the number of writes to your flash to maximize lifetime. For a Raspberry Pi, you should consider doing as much work in memory, without touching disk, as possible.
- If you have cheap consumer flash, try not to create high random read/write workloads, or unusual read/write patterns.
- Avoid static electricity discharges on the pins of the flash memory, especially on the pins that are used to transfer data.
Many of the internal details of the flash memory features and functionality (such as density, power outage recovery, programming quality) are not things you will be able to easily discover for most pieces of cheap consumer flash memory since they are often 'hidden' and proprietary. It is reasonable to suggest that they may even change between different batches of the same product under the same model number from a given manufacturer.
If you have a serious project that requires using flash memory, you should be sure to read up on the difference between SLC, MLC and TLC flash. Keep in mind that the difference between SLC, MLC and TLC between different vendors is unlikely to be an apples to apples comparison. Evaluating flash memory at this level of detail is more about the physics used in the individual manufacturing and programming processes of the chip itself than any kind of well standardized interpretation of what it means for something to be considered 'SLC'.
Another important detail worth considering in how you make use of flash from a programming perspective (and also how the flash may be organized internally) is the concept of write amplification. Write amplification involves the consideration of how the number of writes required to commit information can 'amplify' because of the need to update different data structures, rearrange data, and also satisfy block-level operation constrains.
I can personally vouch for one of the Raspberry Pi flash SD cards that came with the 'CanaKit' Raspberry Pi kit I purchased several years ago. This Raspberry Pi has been operating without any issues for several years now. Having said that, I don't know if it has really experienced any power failure or brownout conditions, and these would really be the important stress test to consider. I have also used a slightly more expensive SD card recently (AF8GUD3) for some of my newer Raspberry Pi projects. This SD card was purchased from DigiKey, and I haven't had any problems with it yet.
If you're interested in learning more about the details of flash memory corruption, an advanced overview requires a detailed understanding of the physics involved in the individual manufacturing techniques used for the particular model of memory you are using. Here is an excellent talk on the subject: Tutorial: Why NAND Flash Breaks Down, from the The Linux Foundation's YouTube channel.
In this article, we've reviewed some of the most common causes of flash memory corruption. The most common reasons stem from problems that occur with low-quality flash, but these problems can often be triggered by power supply related issues.