How to handle a data packet while calculating CRC.

In article <p8o14m$1iu$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
>>description of CRC tables snipped

Sometimes email/nntp/forums makes these technical dicussions difficult.
As some background - I've written many CRCs designs in hardware for
both ASICs, and FPGAs. Every one of them uses the "plain-vanilla"
method as described in many of the CRC papers. The only twist
is, as I've tried to describe, is putting the single-shift
implementation in a for loop, to allow 'N' bits to be calculated in a
single clock cycle.

This "twist" is shockingly simple to write in HDLs. The core kernel of
a (generic) CRC - shifting through N-bits at a time is really just about
10 (simple) lines of code. No online-generators needed - that spit out gobs
of (unreadable) XOR gates.

The "table" methods we've been discussing, are all solutions aimed at
software. Software where it's difficult to do (many) parallel XORs of
any random bits. But that's something that hardware excels at.

Taking those software solutions, and retargetting them back at hardware,
is at best case, overcomplicating something very simple. At worst case
quite costly.

I've never written a table method for calculating CRCs (because as I've
asserted it's pointless for hardware), but I quite understand the underlying
concepts and what they are doing.

Some numbers to back things up.
I have a generic crc module that's used all over the place. I
synthesized in Vivado various version of a 32-bit CRC. (The one used in
the MPEG-2 spec with POLY=0x4C11DB7)

I targetting something high-speed to make sure the tools are
actually trying to do at least some optimizations. Targetted 2ns clocks
(500 MHZ) in a Virtex7 device

Test 1:
1 bit at a time:
CLB LUTs* | 151
CLB Registers | 64
Worst case timing slack: 0.754ns

You'll first note that I use twice as many registers than seems
neccesary. This is because the way it's been coded it stores both the
"augmented" and "non-augmented" CRC in FFs.

Test 2:
8 bits at a time:
CLB LUTs* | 172
CLB Registers | 64
Worst case timing slack: 0.270ns

Notice, not much delta in LUT usage. One thing to note - the baseline
may seem high. After all for a single bit shift, you only need one XOR
gate per (high POLY bit). The augmentation adds more complexity too.
But, in the end the overhead "housekeeping", if you will, actually is
probably what's dominating the logic here. Things like loading the
INIT value, holding the current value when no new data is available,
etc. Plus a LUT6 is under-utilized to do just 2-bit XOR.

Test 3:
32-bits at a time:
CLB LUTs* | 259
CLB Registers | 64
Worst case timing slack: 0.332ns

A few more LUTs. But paradoxically, the timing's better.
Decided to go for broke, and really up the DATA WIDTH:

Test 4:
512-bits at a time
CLB LUTs* | 1734
CLB Registers | 64
Worst case timing slack: 0.188ns

So even taking a very large datapath, things are fine.

Adding a "software" LUT method at it's very baseline, adds 1 BRAM block
to the solution - 36Kbits - a very large, IMHO resource. I would argue
that for such a software solution aimed at the "Test 3 above (32-bits at a
time)" solution would likely still consume a signficant amount
of CLB LUTs - just to handle the "housekeeping" and further logic behind
the software LUT. I'd also argure that it would very likely run slower
too, as BRAMs are in general slower than logic.

Well, don't use a BRAM you might ask - use the CLB logic. I'd argue
that this solution, to the tools is the same thing as my 10-line
"plain-vanilla", just overly complicated, and unclear, and likely take
the tool longer to achieve the same solution (if at all).

Those software implementations are quite interesting methods, and I
quite like reading up at the clever tricks being performed. But none
of that is neccesary in hardware where you have free-access to any bit,
and can freely perform many things in parallel.

Regards,

Mark
 
On Monday, March 12, 2018 at 4:32:17 PM UTC+5:30, yogesh tripathi wrote:
Hi,

I'm trying to process a Ethernet type package. Suppose if i have detected SFD and now have a <1600Byte data.

I'm extracting different package element(ds_addr,src_addr,etc) concatenating them in a long shift register and at same time passing it to a fifo to buffer and calculating crc32 which will take some clock cycles(xoring and shifting). Now if calculated CRC matched what is received, pass data to nxt stage else rst fifo.


Is there a better technique for it?

Thank-You in advance.

--========================================Thank-you all for you responses.
I'm tried with the online crc calculator to generate a parallel LFSR's:-
"https://leventozturk.com/engineering/crc/".
And its working ,verified the results. Although how the algorithm is generating those xor circuitary is still evade me, will look into it further.

following is the generated rtl code for crc32, 8bit input, data widt progression d(7)->d(0)

ca(0) <= ca(24) xor ca(30) xor d(1) xor d(7);
ca(1) <= ca(24) xor ca(25) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(6) xor d(7);
ca(2) <= ca(24) xor ca(25) xor ca(26) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(5) xor d(6) xor d(7);
ca(3) <= ca(25) xor ca(26) xor ca(27) xor ca(31) xor d(0) xor d(4) xor d(5) xor d(6);
ca(4) <= ca(24) xor ca(26) xor ca(27) xor ca(28) xor ca(30) xor d(1) xor d(3) xor d(4) xor d(5) xor d(7);
ca(5) <= ca(24) xor ca(25) xor ca(27) xor ca(28) xor ca(29) xor ca(30) xor ca( 31) xor d(0) xor d(1) xor d(2) xor d(3) xor d(4) xor d(6) xor d(7);
ca(6) <= ca(25) xor ca(26) xor ca(28) xor ca(29) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(2) xor d(3) xor d(5) xor d(6);
ca(7) <= ca(24) xor ca(26) xor ca(27) xor ca(29) xor ca(31) xor d(0) xor d(2) xor d(4) xor d(5) xor d(7);
ca(8) <= ca(0) xor ca( 24) xor ca(25) xor ca(27) xor ca(28) xor d(3) xor d(4) xor d(6) xor d(7);
ca(9) <= ca(1) xor ca( 25) xor ca(26) xor ca(28) xor ca(29) xor d(2) xor d(3) xor d(5) xor d(6);
ca(10) <= ca(2) xor ca( 24) xor ca(26) xor ca(27) xor ca(29) xor d(2) xor d(4) xor d(5) xor d(7);
ca(11) <= ca(3) xor ca( 24) xor ca(25) xor ca(27) xor ca(28) xor d(3) xor d(4) xor d(6) xor d(7);
ca(12) <= ca(4) xor ca( 24) xor ca(25) xor ca(26) xor ca(28) xor ca(29) xor ca(30) xor d(1) xor d(2) xor d(3) xor d(5) xor d(6) xor d(7);
ca(13) <= ca(5) xor ca( 25) xor ca(26) xor ca(27) xor ca(29) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(2) xor d(4) xor d(5) xor d(6);
ca(14) <= ca(6) xor ca( 26) xor ca(27) xor ca(28) xor ca(30) xor ca(31) xor d( 0) xor d(1) xor d(3) xor d(4) xor d(5);
ca(15) <= ca(7) xor ca( 27) xor ca(28) xor ca(29) xor ca(31) xor d(0) xor d( 2) xor d( 3) xor d(4);
ca(16) <= ca(8) xor ca( 24) xor ca(28) xor ca(29) xor d(2) xor d(3) xor d(7);
ca(17) <= ca(9) xor ca( 25) xor ca(29) xor ca(30) xor d(1) xor d(2) xor d(6);
ca(18) <= ca(10) xor ca(26) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(5);
ca(19) <= ca(11) xor ca(27) xor ca(31) xor d( 0) xor d( 4);
ca(20) <= ca(12) xor ca(28) xor d(3);
ca(21) <= ca(13) xor ca(29) xor d(2);
ca(22) <= ca(14) xor ca(24) xor d(7);
ca(23) <= ca(15) xor ca(24) xor ca(25) xor ca(30) xor d(1) xor d( 6) xor d( 7);
ca(24) <= ca(16) xor ca(25) xor ca(26) xor ca(31) xor d(0) xor d( 5) xor d( 6);
ca(25) <= ca(17) xor ca(26) xor ca(27) xor d(4) xor d(5);
ca(26) <= ca(18) xor ca(24) xor ca(27) xor ca(28) xor ca(30) xor d(1) xor d(3) xor d( 4) xor d( 7);
ca(27) <= ca(19) xor ca(25) xor ca(28) xor ca(29) xor ca(31) xor d(0) xor d(2) xor d( 3) xor d( 6);
ca(28) <= ca(20) xor ca(26) xor ca(29) xor ca(30) xor d(1) xor d(2) xor d(5);
ca(29) <= ca(21) xor ca(27) xor ca(30) xor ca(31) xor d(0) xor d(1) xor d(4);
ca(30) <= ca(22) xor ca(28) xor ca(31) xor d(0) xor d(3);
ca(31) <= ca(23) xor ca(29) xor d(2);
 

Welcome to EDABoard.com

Sponsor

Back
Top