Implementing CRC32 in Python
In this post I’ll explain how cyclic redundancy checks (CRC) works, and implement a basic version of CRC32 in python. In my opinion, the best way to go about learning something is by doing it, so I recommend you to implement this yourself. If you want to do some CRC calculations, here is a good calculator.
Why?
Cyclic redundancy checks are numbers you append to moving data (either on a network, or even transferring from storage to memory) that allows you to confirm that the data has no errors. Moving data around without some sort of error detection would be pretty useless.
How?
The maths involved is interesting but because I’m not an expert on finite fields I’m not going to delve too deeply into things. You can find decent books on the theory behind Galois fields and proofs on how well CRC works on the internet. The main principle of cyclic redundancy checks is to represent that data as a polynomial in GF(2). This means that the data will be a polynomial, where the coefficients are either 1 or 0. Since data is represented in binary in computers, this is perfect.
Here is an example:
We then decide on a generator polynomial. We’ll use this to calculate the number that we append at the end of the message. An example polynomial might be:
We will use this polynomial to divide into our data polynomial. This is the exact same long division algorithm you would have learned in school for dividing polynomial, and finding their remainder. We’re using GF(2) arithmetic. Subtraction and addition are equivalent and are the same as an XOR operation.
The remainder 01
is our redundant information that we will append to our outgoing message 11010
. The sender will transmit 1101001
and the receiver, who also has the generator polynomial will divide 111
into 11010
and will check to see if the remainder is the same as what was received. I left out the quotient because it’s irrelevant to CRC. All we need is the remainder.
Basic Implementation
Our generator polynomial for CRC32 will be 32bits long as the name implies. The polynomial we’ll be using is 0x04C11DB7
. The biggest thing to note here is that in an actual implementation, we don’t need the leading 1, since everytime we XOR 1 with 1 it will be 0.
# Basic implementation if crc32
def crc32(message:bytearray, poly:int):
bitmask = 0xFFFFFFFF
crc = 0
for byte in message:
for _ in range(8):
b = byte & (1<<7) != 0
divide = bitmask if (crc & (1<<31) != 0) else 0
crc = (crc << 1) | b
crc ^= (poly & divide)
byte <<= 1
return (crc & bitmask) # to keep it at 32bits long
Some little details
There are a few things that can be improved upon above, before we go implementing anything.
First, it’s going to be a pain having to wait for the entire message to come in before we get to calculate the remainder. We want this to be really fast, and start calculating the remainder while the message is coming in.
To do this, in our little example above, we will append 2 zeroes to our outgoing message, before we calculate the remainder. Like so
So, our outgoing message will be 1101011.
This allows will allow the receiver to calculate the remainder as the message is coming in. If the remainder is 0, we know (or most likely) have the correct data.
What happens if leading 0’s get added/deleted during transmission? It will still be divisible by the generator, but obviously the wrong data. The solution is to prepend a certain amount of 1’s (in our case 32, since we’re going to implement a 32 bit CRC) to the data.
What happens if ending 0’s get added/deleted during transmission? Well this wouldn’t be a problem if our message (data + remainder) ended with a 1, but if it ends with a 0 this error may go undetected. In this case, we will end the function with an XOR of some value to fix this problem.
The resulting code becomes:
# Better implementation of crc32, leading/trailing zeros
def crc32_improved(message:bytearray, poly:int, init:int=0, final_xor:int=0):
bitmask = 0xFFFFFFFF
crc = init
for byte in message:
for _ in range(8):
b = bitmask if byte & (1<<7) != 0 else 0
divide = bitmask if (crc & (1<<31)) != 0 else 0
crc = (crc << 1) ^ (poly & (b ^ divide))
byte <<= 1
return (crc & bitmask) ^ final_xor
There are some other optimizations you could make. You could make a look up table that does some caching like so:
from functools import lru_cache
@lru_cache
def create_lut(poly): return [crc32_improved(bytearray([x]), poly) for x in range(256)]
# Implementation of crc32 with look up table
def crc32_lut(message:bytearray, poly:int):
"""
Generates a crc with a look up table for improved speed
"""
bitmask = 0xFFFFFFFF
crc = 0
lut = create_lut(poly)
for m in message:
index = (int(m) ^ (crc >> 24)) & 0xFF
crc = (crc << 8) ^ lut[index]
return crc & bitmask