Wednesday, February 28, 2024

Comparative speeds for different Checksum & CRC implementations

Comparative speeds for different Checksum & CRC implementations.

For more information see my new book:  Understanding Checksums and Cyclic Redundancy Checks -- which includes downloadable open source example implementations for the checksums mentioned below.  This is a preview of section 12.1 from that book.

One of the main lessons from doing benchmarking is that the answer to “how fast/good is something” always ends up being “it depends.” We do not attempt to resolve the question of benchmarking algorithm speed in a comprehensive way.

Rather, we present a single relatively simple comparison (figure 12.1) to illustrate possible speed differences among checksum values – along with a strong caution that any particular application on any particular computing hardware with any particular compiler with any particular workload of computational tasks competing for memory hierarchy resources is almost certain to see a different outcome. Nonetheless, we hope this comparison and analysis provides useful insight.

To do a speed comparison, we created optimized implementations of 32-bit checksum algorithms intended to be representative of the type of code one might see in a specific application. This included:

·     Implementation is in the C programming language with optimizations enabled (gcc compiler with -O2 optimization option).

·     Sums use 32-bit integers if possible, but use a 64-bit integer sum if required due to carry-out or shifting within the algorithm. 64-bit unsigned integers are used by Add-32, DualX-32, DualX-32P, Koopman-32, and Koopman-32P.

·     A constant value is used for the modulus.

·     A constant value of zero is used for the initial seed sum initialization.

·     There is no final XOR value applied.

·     No check value adjustment is performed. The check value is not stored in memory as part of the code word, meaning that the time measurement is just of the checksum computation itself.

·     The data word is stored as an array of 32-bit integers to avoid overhead for building 32-bit values from a series of bytes.

·     The data word is initialized with pseudorandom data a single time before repeating the checksum operation many times in a row. This results in a de minimus amortized data word initialization cost. Care is required to prevent the compiler’s optimizer from eliminating repeated calls to the checksum function given that the answer will be the same for repeated calculations. We expect that the data word bytes live comfortably in cache memory for the vast majority of computation loops. This can be expected to accentuate differences between algorithms compared to applications that spend more time waiting for memory accesses to data word content.

·     Bit-shift amounts are defined as constant values that will only work for 32-bit checksums (generally 8 bits or 16 bits depending on the checksum involved). This helps the compiler optimize shifting into byte movements if possible.

·     Modulo operations are deferred until after the block addition loop to the maximum degree practical.

In other words, the implementations do quite a lot to give a favorable implementation for each one.

The algorithms tested had the following implementation and optimization features, yielding corresponding speed results.

·     LRC-32: This performs a 32-bit XOR of 32-bit data elements. This is slightly faster than Add-32 because there is no final modulo operation needed. Also, it only needs a 32-bit sum because there is no carry-out possible from the XOR “sum” operation.

·     Add-32: This performs a 32-bit ones’ complement integer addition using a 64-bit sum variable. A modulo 0xFFFFFFFF operation is performed one time – after all data word blocks have been processed – to achieve a ones’ complement result rather than doing a modulo for every addition. This was the baseline for timing, with other speed results relative to this algorithm’s execution time.

·     DualSum-32: This uses a pair of 32-bit sum variables and 16-bit blocks. Two pairs of additions of 16-bit blocks are done for each 32-bit portion of the data word retrieved (sumA and sumB are first updated with the low 16 bits, then again with the high 16 bits.) A 16-bit modulus of 65533 is applied to sumA and sumB every time the value in sumB gets too large, which is a significant speedup compared to computing a modulo in every loop iteration. This is nonetheless much slower than Add-32, by a factor of about 3, largely because of the need to perform four sums of 16-bit blocks in each loop iteration instead of just one sum of a 32-bit block. Dual-sum checksum with parity can provide HD=4, but would be even slower and would provide only half the data word length support.

·     DualX-32: DualSum-32 modified to have a block size of 32 bits. This uses a pair of 64-bit sums with a 16-bit modulus of 65519. This improves speed compared to DualSum-32 due to doing one pair of additions per 32-bit data word segment retrieved instead of four additions, and also provides twice the HD=3 data word length. It runs about twice as fast as DualSum-32.

·     DualX-32P: This is DualX-32 with 15-bit modulus of 32749 and an additional parity computation. The parity computation boosts fault detection to HD=4 while still being faster than DualSum-32.

·     Koopman-32: This uses a 64-bit sum variable with a modulus of 4294967291. It processes the data word one 32-bit block at a time. A single 64-bit modulus operation is performed inside the inner loop. This runs at approximately one eighth the speed of Add-32 due to the need to perform a modulo operation every loop instead of only occasionally.

·     Koopman-32P: This uses a 64-bit sum variable with a modulus of 0x7FFFFFED, processing the data word one 32-bit block at a time. This is slightly slower than Koopman-32 due to an extra XOR operation with an additional parity register inside the inner loop that is used to compute parity.

·     CRCRT-32: This is a 32-bit CRC using a right-shift 256-entry lookup table that processes the data word a byte at a time. Each inner loop processes a 32-bit data word block by doing four in-line table lookup, shift, and XOR operations, with each of those four operations processing 8 bits of a 32-bit block. Reorganizing the data word into an array of 8-bit blocks likely would not speed up the implementation, because the implementation used can fetch four data word bytes at a time from memory into a 32-bit register. The lookup table is precomputed, so there is zero overhead for table initialization. The modulus does not affect computing speed, but happened to be 0xEDB88320 (IEEE 802.3 CRC-32). This is slower than Koopman checksums largely because it needs to make four lookup table accesses to process four bytes of a 32-bit block inside each inner loop, and that turns out to be slower than performing a division operation. This table-based CRC implementation is just over twice as slow as a Koopman checksum.

·     CRCR-32: This is a bit-at-a-time right-shift CRC implementation that accesses the data word as a sequence of 32-bit blocks. This algorithm is so slow that the graph’s executing time bar did not fit on a reasonable vertical axis scale. This speed issue is what motivates using CRC_R32T instead.

Relative times are given in the chart for the average of 10 runs. Each run was 100 million iterations of computing the check value for a 256-element array of 32-bit integers (1 KB) for the CRC implementations, and one billion iterations for other checksums.[1] The hardware used was one core of an 8-core Intel Xeon Silver CPU running at 3.2 GHz. The array size is small enough to fit comfortably into cache memory, so the run times shown were optimistic compared to a situation in which the checksum is being computed on data that is not resident in cache memory when the computation is started.

If memory access times dominate in an application, the computing speed differences among algorithms will be reduced. If integer division is slower on some processor (e.g., due to no hardware support), it is possible that DualSum-32 will have more speed advantage compared to a Koopman checksum.

Recapping the fault detection vs. speed tradeoff overall: LRC-32 and Add-32 give HD=2 but are fast; DualX-32 and DualX-32P are reasonably fast while giving HD=3 or HD=4 for shorter data word lengths; DualSum-32 is slower and gives HD=3 for a shorter data word length than DualX-32; Koopman-32 gives much longer HD=3 for an additional speed penalty; Koopman-32P gives longer HD=4 for a little more speed penalty; and a CRC can give HD=3, HD=4, or higher HD according to the polynomial selected but is at best half as fast as the other algorithms.

Recommendations based on these results are:

(1) use a ones’ complement addition checksum only if speed is the sole and overwhelming consideration;

(2) use a dual-sum checksum if it provides a long enough HD=3 or HD=4 length for the application, preferring DualX or DualXP rather than a conventional dual-sum checksum;

(3) move to a Koopman checksum if a longer HD=3 or HD=4 length is needed; and

(4) use a CRC (preferably with a 256-entry lookup table) if higher HD is needed than available with other checksum techniques.

[1] The spread between fastest and slowest of the ten runs ranged from 1.7% for CRCRT-32 to 12.8% for DualX-32P. The full-length simulation campaign produced results quite similar to a 1/10th-size pilot campaign. While a much more thorough experimental campaign on a variety of systems might be attempted, it would not change the general trend of relative performance, which is the point for our purposes.

No comments:

Post a Comment

Why To Avoid Hash Algorithms If What You Really Need Is A Checksum

Sometimes we hear that someone plans to use a hash function for fault detection instead of a checksum. This is probably not the best idea, b...