Bad Memory


When RAM goes wrong, things go downhill quickly. A bad DIMM will mean that values written by your system will be different when they are read back. If you are not using ECC memory the running software will never even know the error occured. This leads to crashes and kernel panics. Generally, the rule of thumb is to replace a bad DIMM when it is detected. But maybe there is another option?

My plucky little hackbox has gone through many iterations. I tend to get an idea for something I want to do with it and end up combing Ebay for cheap parts. Doing so is a crapshoot, of course, but that is the point of hackbox–to screw around and see what I can make work. For this particular project, I picked up some extra DDR3 RAM and an old Intel motherboard in order to use a i5-2500k that had recently been retired from my desktop. Running through my standard burn-in, I booted up Memtest and left it running for the evening. The next morning brought an unpleasant surprise:

Tests Passed: 46/48

Lowest Error Address: 0x362DD3070 (13869 MB)
Highest Error Address: 0x362DD3070 (13869 MB)
Bits in Error Mask: 0000000000000040
Bits in Error - Total: 1 Min: 0 Avg: 1
Max Continuous Errors: 1

Damnit.

Cursing my frupidity, I ran the test a second time. Same error. While frustrating, this case was somewhat novel to me–typically in my experience when a DIMM fails it does so with many errors. This error was not only a single address, but it was consistent. Over and over again repeated test runs always returned the same result. Basically, a few bits in one tiny area of RAM returned 1s instead of 0s, no matter what was written. I decided to table the issue for a while (hackbox, right? this is all part of the adventure).

Kernel, save me!

Several months later I was dealing with some irritating artifacts from the Intel integrated graphics used for driving my display. It occurred to me that the bad address might be affecting memory being utilized by the chip. I really didn’t want to replace the bad DIMM, or even figure out which one it was. After doing some digging, I discovered the Linux memmap kernel argument. With it you can direct Linux to use, protect, or ignore memory in fairly specific ways. In this case, I wanted to tell it to ignore the bad memory address.

It turns out this is easy, or at least easy relative to the normal level of difficulty involved in managing Linux. The syntax is simply memmap=4K$0x362DD3000. Here I am telling the kernel: at address 0x362DD3000, ignore the next 4K bytes of RAM. Given that only a single address was bad the 4k value was probably overkill, but it seemed prudent to ignore the area around the problem as well. At the time I was running Arch Linux with kernel version 5.0.7.

To test, I rebooted the host and hit e during the Grub init screen and manually inserted the argument on the “linux” line. It should be noted that you have to escape the $ character so that Grub ignores it. So:

linux /boot/vmlinuz-linux root=UUID=663c149f-4f28-4167-b8f8-38f23fbe26d1 quiet memmap=4K\$0x362DD3000

Continue to boot and make sure things run normally. You can check the output of free and should see less memory reported there. To make the change permanent, modify /etc/default/grub with the change. Note here you need yet another level of escaping so that the grub-mkconfig process doesn’t strip the characters:

GRUB_CMDLINE_LINUX_DEFAULT="quiet resume=UUID=663c149f-4f28-4167-b8f8-38f23fbe26d1 memmap=4K\\\$0x362DD3000"

With that done, re-generate grub.cfg by running:

sudo grub-mkconfig -o /boot/grub/grub.cfg

Now reboot. You can verify the argument was used correctly by executing cat /proc/cmdline and verifying it matches your desired memmap parameter.

I can now use the machine with more confidence, knowing that the known bad address should never cause havoc in my running system. As for my video artifact issue–well, it may have helped. However, around that time I re-installed the whole system, switched from Xfce to GNOME and confounded a few other variables. Now I have new, different video issues, and I suspect this has more to do with my dodgy Ebay motherboard than anything. An adventure for another day.

References

https://wiki.archlinux.org/index.php/Kernel_parameters#GRUB https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html