Sometimes we hear that someone plans to use a hash function for fault detection instead of a checksum. This is probably not the best idea, because hashes provide weaker fault detection for random independent bit faults than a decent checksum.

A hash function is typically optimized for really good bit mixing, but not for ensuring that small numbers of bit faults are always detected. For a Pud effectiveness metric (random independent bit faults at a particular BER), this means that a hash function is likely to do worse than a good checksum.

To explore this, we evaluated the hash function Murmur3 (32 bits) on a Monte Carlo simulation of random independent bit faults and compared it with a variety of checksums:

The results show that Murmur3 is an excellent approximation of a Simple HD=2 curve, which detects all 1-bit faults, and has a 1/(2**32) probability of undetected fault for all other numbers of bit faults. That is expected of a good hash function. Murmur3 turns out to have Hamming Distance=2, at least for the lengths we investigated.

In comparison, a 32-bit one's complement checksum does worse, but that is not a huge surprise.

All the other checksums did better. Dual-sum and Koopman checksums provided HD=3, although with less mixing. Their curves are below Murmur3 but not all the way down at the Simple HD=3 curve.

DualX-32P and Koopman-32P both have curves above the Simple HD=4 curve. CRC curves were not plotted, but would be at least as good as the Koopman-32P curve, and potentially much better for short data word lengths where a higher HD can be provided.

Summarizing, Murmur3 does a better job of bit mixing than the checksum functions, but suffers in ability to detect random independent bit flips due to being limited to HD=2.

We expect that other good hash functions will trace out either an approximate Simple HD=2 curve, or possibly a Simple HD=1 curve depending on their specifics. A dual-sum checksum, a Koopman checksum, or a CRC will all provide significantly better fault detection.

These curves are for random independent bit faults. For memory arrays sometimes people are concerned with multi-bit single event upsets. The answer to how these will perform will depend on the specifics of the geometry of the memory array (for example, row size and whether columns are interleaved). Checksums and CRCs will generally be good at multi-bit faults in bits that are adjacent in the data word. And the 32-P checksums will detect all 1-, 2-, and 3-bit faults regardless of the bit position. Beyond that is difficult to know without analysis tailored to the specific situation of the memory array you care about as well as the comparative frequency and patterns of different numbers of bit faults.

This post is supplemental to my book Understanding Checksums and Cyclic Redundancy Checks. That book has more information about Simple HD curves, Pud, DualX checksums, and so on.

## No comments:

## Post a Comment