Bit error counter - how to make it faster

G

gamer

Guest
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,

Verilog code:
----------
assign mismatch[datawidth-1:0] = input_data ^ expected_data;
assign matched = ~( | mismatch); // reduction NOR

always @(posedge clk or posedge reset) begin
if (reset)
error_count = 0;
else if (~matched)
for (i=0; i<datawidth; i=i+1)
error_count = error_count + mismatch;
end
---------/////////----------------------
The above meets timing for small datawidths (like 8-bits) and starts
failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.

Thanks
 
The above meets timing for small datawidths (like 8-bits) and starts
failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.
Spread your algorithm to work over more than one clock cycle. Use a pipeline
stage over eigth clocks. The first stage calculates bit 0 to 7, the second
calculates 8 to 15 and so on. In an additional stage you sum up all results.
The result is always 9 clocks behind.

See http://en.wikipedia.org/wiki/Pipeline_%28computer%29 for a general
explanation of pipelines.

hth
Günther Jehle
 
On Jun 27, 12:34 pm, gamer <csan...@gmail.com> wrote:
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,

Verilog code:
----------
assign mismatch[datawidth-1:0] = input_data ^ expected_data;
assign matched = ~( | mismatch); // reduction NOR

always @(posedge clk or posedge reset) begin
if (reset)
error_count = 0;
else if (~matched)
for (i=0; i<datawidth; i=i+1)
error_count = error_count + mismatch;
end
---------/////////----------------------
The above meets timing for small datawidths (like 8-bits) and starts
failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.

I know no verilog and am not even a C programmer, but ISTM
that the summation could be highly parallel by using a tree
of adders (perhaps using carry save to further accelerate
the summation). Consider it like computing the population
count followed by an ordinary addition: one can add together
the even and odd mismatch bits in parallel, then add pairs
of these sums, etc., finally adding the sum to error_count.

(BTW, if I understand correctly one should not need the
"reduction NOR" and the "if (~matched)" since if the
popcount is zero, adding it to error_count would effectively
be a no-op, assuming worst-case speed is the only
consideration. As Gunther Jehle pointed out pipelining
could provide higher throughput.)


Paul A. Clayton
just a technophile
 
On 2007-06-27, gamer <csanjay@gmail.com> wrote:
else if (~matched)
for (i=0; i<datawidth; i=i+1)
error_count = error_count + mismatch;

Although from a software (and possibly a simulation) perspective, that
looks like it should reduce the workload by computing 'matched' and
"avoiding" the loop if it is not needed, in hardware your speed is
going to be limited by the worst case (and the additions are happening
all the time, only the enables on the error_count register are affected
by your 'if').

If you must have your answer on the very next clock, your options are
severely limited. If you only care about the answer occasionally (or
with a fixed latency that's greater than one clock), think about how
you can take advantage of that.

To follow along your original lines of thinking: If you can really find
'matched' at your desired clockrate, and it is infrequent, then you could
stuff 'mismatch' into a fifo whenever it's nonzero and add it in a
separate loop which only has to keep up with the worst case bit error
rate.

Or, if, as you said, the process is fast enough for 1 bit, make yourself
'datawidth' 1-bit error counters and add their results whenever you want
the total number of errors.

--
Ben Jackson AD7GD
<ben@ben.com>
http://www.ben.com/
 
gamer wrote:
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,

Verilog code:
----------
assign mismatch[datawidth-1:0] = input_data ^ expected_data;
assign matched = ~( | mismatch); // reduction NOR

always @(posedge clk or posedge reset) begin
if (reset)
error_count = 0;
else if (~matched)
for (i=0; i<datawidth; i=i+1)
error_count = error_count + mismatch;
end
---------/////////----------------------
The above meets timing for small datawidths (like 8-bits) and starts

Remember this, it is a key to a possible solution!

failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.
You need to parallelize your error counter, particularly for very wide
(256-bit?) paths:

A few layers of full adders to reduce the 256 input lines into a more
reasonable number, then one or more carry-save accumulators to gather
these into something that can be handled very quickly.

Only when reading out the number of errors do you propagate the carries
to generate the final/real counts.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
 
On 28 Jun., 15:40, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:
A few layers of full adders to reduce the 256 input lines into a more
reasonable number, then one or more carry-save accumulators to gather
these into something that can be handled very quickly.

Only when reading out the number of errors do you propagate the carries
to generate the final/real counts.
In Virtex-5 the 6-LUTs point to a 6-to-3 reduction instead of full
adders.
You can also use BRAMs for an 11 to 4 reduction.
512 bits at 500MHz might be easier than 256 bits at 1GHz.


Kolja Sulimma
 
gamer <csanjay@gmail.com> writes:

My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,
How much resources do you have available and how often do you want to
process the counters?

If you have lots of registers available and just want to sample the
counters at some point in time (e.g. after one hour of operation) and
present the result on a PC or whatever you could:

Use datawidth X linear feedback shift registers (of sequence length
error count, or some more optimal length for the LFSR sequence) as
counters. This is only a single XOR delay between two registers. Then
sample all the counters. Transfer the results to your PC. Convert the
LFSR sequences and add the values before you present the
result.

However, you can't do this if you have to continuously present the
result. You haven't provided enough details about this in your
specification.

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
 
On Jun 27, 9:34 am, gamer <csan...@gmail.com> wrote:
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,
How is the input data created? If the register values come from bit
set/clear operations, you can keep track of the number of ones with a
separate counter that is updated appropriately when the register is
updated.

For example, suppose that you'd like to know the number of ones in a
shift register. If the input bit and the bit shifted off (and thus
removed from the register) are the same, the separate counter is
unchanged. If the input is 1 and the bit shifted off is 0, the
counter is incremented. If the input is 0 and the bit shifted off is
1, the counter is decremented. When the register as a whole is
cleared, the counter is also cleared.

It's easy enough to extend this to any update scheme where the number
of bits that can change each cycle is small, even if those can be in
arbitrary positions.
 
On 6 28 , 12 34 , gamer <csan...@gmail.com> wrote:
My goal is to implement a bit-error counter targeting 1GHz. The
datawidth is parametrizable. I started off this way,

Verilog code:
----------
assign mismatch[datawidth-1:0] = input_data ^ expected_data;
assign matched = ~( | mismatch); // reduction NOR

I guess the critical path of timming is summation chain.
There're serveral binary tricks to get number of bit "1" in a word,
take C code of 32-bit word for example, there's only 5 summations
instead of 32.

unsigned int bitcount32(unsigned int x)
{
x = (x & 0x55555555UL) + ((x >> 1) & 0x55555555UL); // 0-2 in 2
bits
x = (x & 0x33333333UL) + ((x >> 2) & 0x33333333UL); // 0-4 in 4
bits
#if 1
// Version 1
x = (x & 0x0f0f0f0fUL) + ((x >> 4) & 0x0f0f0f0fUL); // 0-8 in 8
bits
x = (x & 0x00ff00ffUL) + ((x >> 8) & 0x00ff00ffUL); // 0-16 in 16
bits
x = (x & 0x0000ffffUL) + ((x >> 16) & 0x0000ffffUL); // 0-31 in 32
bits
return x;
#else
// Version 2
x = (x + (x >> 4)) & 0x0f0f0f0fUL; // 0-8 in 4 bits
x += x >> 8; // 0-16 in 8 bits
x += x >> 16; // 0-32 in 8 bits
return x & 0xff;
#endif
}

always @(posedge clk or posedge reset) begin
if (reset)
error_count = 0;
else if (~matched)
for (i=0; i<datawidth; i=i+1)
error_count = error_count + mismatch;
end
---------/////////----------------------
The above meets timing for small datawidths (like 8-bits) and starts
failing to meet timing once the datawidth gets larger. I will be glad
for suggestions of alternate ways to implement this bit-error counter.
In practice our datawidths could go upto 256 bits.

Thanks
 
YUAN, Nan wrote:

(snip)

There're serveral binary tricks to get number of bit "1" in a word,
take C code of 32-bit word for example, there's only 5 summations
instead of 32.

unsigned int bitcount32(unsigned int x)
{
x = (x & 0x55555555UL) + ((x >> 1) & 0x55555555UL); // 0-2 in 2
bits
x = (x & 0x33333333UL) + ((x >> 2) & 0x33333333UL); // 0-4 in 4
bits
#if 1
(snip)
#else
// Version 2
x = (x + (x >> 4)) & 0x0f0f0f0fUL; // 0-8 in 4 bits
x += x >> 8; // 0-16 in 8 bits
x += x >> 16; // 0-32 in 8 bits
return x & 0xff;
#endif
But note that you do a lot more arithmetic than is needed.
This takes advantage of the availability of a 32 bit ALU.
You do five 32 bit additions with carry, and four 32 bit AND
operations.

The hardware implementation using carry save adders is much more
efficient, in terms of both logic and speed.

11 full adders for the first level, eight for the second level,
five for the third level, four for the fourth level, three
for the fifth level, three for the sixth level, two each for
the seventh and eighth level, and one final OR gate. If I
counted right. No carry logic is needed.

Some can be half adders for ASIC, and there might be some
other changes to make optimal use of FPGA LUTs.

-- glen
 
comp.arch.fpga wrote:

On 28 Jun., 15:40, Terje Mathisen <terje.mathi...@hda.hydro.com
wrote:

A few layers of full adders to reduce the 256 input lines into a more
reasonable number, then one or more carry-save accumulators to gather
these into something that can be handled very quickly.

Only when reading out the number of errors do you propagate
the carries to generate the final/real counts.
That is an interesting idea...

In Virtex-5 the 6-LUTs point to a 6-to-3 reduction instead of
full adders.
Last time I did it was in the days of 4-LUTs. Also, I didn't
need all the high bits, as I only needed 0, 1, 2, 3, 4 or more
as output values.

You can also use BRAMs for an 11 to 4 reduction.
512 bits at 500MHz might be easier than 256 bits at 1GHz.
Or pipeline the reduction stages. You have to be able to
go through at least one LUT in a clock cycle, anyway, and
the FFs are conveniently positioned after the LUTs.

-- glen
 

Welcome to EDABoard.com

Sponsor

Back
Top