Digesting runs of ones or zeros "well"

John_H · Oct 1, 2003

Greetings,

I need to detect runs. I want to look at 65 bits and show when there are 9
consecutive 1s or 0s from the byte boundaries resulting in 8 values per
clock. This should be comfortably done in two logic levels (I need clean
logic delays).

The idea is simple but the implementation is tough. I'm working with
Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to
design violence to get the results that I believe are "best."

Any thoughts on how to do this "better?" (the following code likes fixed
fonts)

- John_H
=====================================
module testRun ( input clk
, input [64:0] bytePlus1
, output reg [ 7:0] runByte /* synthesis xc_props = "INIT=R"
*/
); // INIT included to force register as FD primitive - bleah

reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg"
wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah
reg [ 7:0] runByte_;
integer i,j,k;

always @(*)
begin
runBits = -24'h1;
runByte_ = -8'h1;
k = 0; // overlapping aaa aaaa
for( i=0; i<64; i=i+j ) // consecutive aaaa
begin // bit regions 876543210
for( j=0; (i%8+j<8) && (j<3); j=j+1 )
runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]);
runByte_[i/8] = runByte_[i/8] & runBits_[k];
k = k + 1;
end
end
always @(posedge clk) runByte = runByte_;

endmodule

rickman · Oct 1, 2003

I can't say I really understand your problem statement. I also don't
see how your code is solving the problem. Can you give a better
explanation of the problem? What do you mean "from the byte
boundaries"? Are you counting only the bit sets of {0..8}, {8..16},
{16..24}...? If so, this seems like an easy problem to implement.

John_H wrote:

Greetings,

I need to detect runs. I want to look at 65 bits and show when there are 9
consecutive 1s or 0s from the byte boundaries resulting in 8 values per
clock. This should be comfortably done in two logic levels (I need clean
logic delays).

The idea is simple but the implementation is tough. I'm working with
Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to
design violence to get the results that I believe are "best."

Any thoughts on how to do this "better?" (the following code likes fixed
fonts)

- John_H
=====================================
module testRun ( input clk
, input [64:0] bytePlus1
, output reg [ 7:0] runByte /* synthesis xc_props = "INIT=R"
*/
); // INIT included to force register as FD primitive - bleah

reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg"
wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah
reg [ 7:0] runByte_;
integer i,j,k;

always @(*)
begin
runBits = -24'h1;
runByte_ = -8'h1;
k = 0; // overlapping aaa aaaa
for( i=0; i<64; i=i+j ) // consecutive aaaa
begin // bit regions 876543210
for( j=0; (i%8+j<8) && (j<3); j=j+1 )
runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]);
runByte_[i/8] = runByte_[i/8] & runBits_[k];
k = k + 1;
end
end
always @(posedge clk) runByte = runByte_;

endmodule

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

John_H · Oct 1, 2003

"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:3F7B2060.A4FA04B@yahoo.com...

I can't say I really understand your problem statement. I also don't
see how your code is solving the problem. Can you give a better
explanation of the problem? What do you mean "from the byte
boundaries"? Are you counting only the bit sets of {0..8}, {8..16},
{16..24}...? If so, this seems like an easy problem to implement.

You have it right - 8:0, 16:8, 24:16.... In the 41 bits illustrated below I
want to note when the sequence across [16:8] (illustrated by the 9 1s below
the count) in result[1].

11010101000101010111101111111111110010101 pattern
09876543210987654321098765432109876543210 count
00000000000000000000000011111111100000000 (reference)

The trouble is that while it seems easy to implement, getting the logic to
come out clean in the implementation - the "pushing the rope" problem -
doesn't comes easily. If one does a loop looking for 8 adjacent equals or
the 8-wide AND of 8 XNORs, the realization takes up 3 levels of logic with
FDR primitives resulting in 2x-3x the resources and about twice the delay.

The code with the two nested for loops breaks up the 9-bit compare into two
4-bit and one 3-bit compare to verify all the 9 bits are equal to each other
and to break the implementation into 2 levels of logic rather than 3+ (the +
being from the flop's reset input taking some extra routing delay).

I'd love a cleaner approach that doesn't come off so complex and gets
realized in the resources it "should" use. It's tough to get it where I
want by pushing on the rope.

John_H wrote:

Greetings,

I need to detect runs. I want to look at 65 bits and show when there
are 9
consecutive 1s or 0s from the byte boundaries resulting in 8 values per
clock. This should be comfortably done in two logic levels (I need
clean
logic delays).

The idea is simple but the implementation is tough. I'm working with
Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to
design violence to get the results that I believe are "best."

Any thoughts on how to do this "better?" (the following code likes
fixed
fonts)

- John_H
=====================================
module testRun ( input clk
, input [64:0] bytePlus1
, output reg [ 7:0] runByte /* synthesis xc_props =
"INIT=R"
*/
); // INIT included to force register as FD primitive -
bleah

reg [23:0] runBits; // I wanted the syn_keep on this combinatorial
"reg"
wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah
reg [ 7:0] runByte_;
integer i,j,k;

always @(*)
begin
runBits = -24'h1;
runByte_ = -8'h1;
k = 0; // overlapping aaa aaaa
for( i=0; i<64; i=i+j ) // consecutive aaaa
begin // bit regions 876543210
for( j=0; (i%8+j<8) && (j<3); j=j+1 )
runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]);
runByte_[i/8] = runByte_[i/8] & runBits_[k];
k = k + 1;
end
end
always @(posedge clk) runByte = runByte_;

endmodule

Vinh Pham · Oct 1, 2003

John

Pushing rope is a good analogy. Jan Gray, yeah?

When writing HDL, on one extreme you can write your code to explicity tell
the synthesizer how to structure your logic. You'd instantiate logic
primitives and wire them all up. You're basically doing all the work for
the synthesizer. The benefit is total control. The problem is a lot of
code to write, which means increase chances of human error and a lot of
rewriting when you make changes to your design.

On the other extreme, you have code that's structureless. You imply no
logic or wiring inside. It's purely a black box and leaves it up to the
synthesizer to figure things out. The benefits are more compact code and
"ease" of modification (if your algorithm isn't too convoluted). The
problem of course is the synthesizer usually doesn't do exactly what you
want. Much like a 2 year old coming up to you with a smile full of pride
saying, "I went potty!" only to realize they did it on the floor.

So when the synthesizer isn't smart enough, you need to help it out by
putting a little more structure into your code. Basically you're giving the
synthesizer hints on what you want. Also hardware engineers tend to better
understand algorithms that have a bit of hardware structure to them. We
once had a math major write VHDL, and the comment from one engineer was,
"Man, his code looks like C." :_D

Unfortunately I don't use Verilog, but hopefully some psuedo code will be
enough to give you an idea of what I'm thinking.

I noticed your code was bit based. Since you only are checking on byte
boundaries, it can help to write your code byte based. Good luck with this
psuedo code (you should see my C).

data[64:0] -- 65-bit input data signal
eight_0s_consec_flag[7:0] -- intermediate signals
eight_1s_consec_flag[7:0] -- intermediate signals
ninth_bit[7:0] -- intermediate signals
consec_detect_flag[7:0] -- 8-bit output signal

for byte = 0...7

lsb = byte*8
msb = byte*8+7

if data[msb:lsb] = "00000000" then
eight_0s_consec_flag[byte] = '1'
else
eight_0s_consec_flag[byte] = '0'
end if

if data[msb:lsb] = "111111111" then
eight_1s_consec_flag[byte] = '1'
else
eight_1s_consec_flag[byte] = '0'
end if

ninth_bit[byte] = data[msb+1]

if ninth_bit[byte] = '1' then
consec_detect_flag[byte] = eight_1s_consec_flag
else
consec_detect_flag[byte] = eight_0s_consec_flag
end if;

end loop

Well hopefully that's correct, and makes sense. Let me know if you have any
questions or see any problems.

Regards,
Vinh

Martin Euredjian · Oct 2, 2003

John,

1- How many 65 bit words per second (ms, ns?) do you have to process?
2- Where do the 65 bits come from? (internal, external)
3- Do they get into the FPGA in parallel or serially?
4- Why are you saying that you need two levels of logic? (trying to control
delay with combinatorial logic is not a great idea).
5- Why fight with inference? Instantiate what primitives you need.

Two logic levels?

Two LUT's to look at two consecutive nibbles.
One LUT to AND the output of the above with the next most significant bit
(the ninth bit).
That's it. Two levels. 24 LUT's.
Is that what you wanted?

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Euredjian

To send private email:
0_0_0_0_@pacbell.net
where
"0_0_0_0_" = "martineu"

"John_H" <johnhandwork@mail.com> wrote in message
news:Xwoeb.26$XP3.4711@news-west.eli.net...

Greetings,

I need to detect runs. I want to look at 65 bits and show when there are
9
consecutive 1s or 0s from the byte boundaries resulting in 8 values per
clock. This should be comfortably done in two logic levels (I need clean
logic delays).

The idea is simple but the implementation is tough. I'm working with
Verilog in Synplify, targeting a Xilinx Spartan-3. I have to resort to
design violence to get the results that I believe are "best."

Any thoughts on how to do this "better?" (the following code likes fixed
fonts)

- John_H
=====================================
module testRun ( input clk
, input [64:0] bytePlus1
, output reg [ 7:0] runByte /* synthesis xc_props =
"INIT=R"
*/
); // INIT included to force register as FD primitive -
bleah

reg [23:0] runBits; // I wanted the syn_keep on this combinatorial "reg"
wire [23:0] runBits_ /* synthesis syn_keep = 1 */ = runBits; // - bleah
reg [ 7:0] runByte_;
integer i,j,k;

always @(*)
begin
runBits = -24'h1;
runByte_ = -8'h1;
k = 0; // overlapping aaa aaaa
for( i=0; i<64; i=i+j ) // consecutive aaaa
begin // bit regions 876543210
for( j=0; (i%8+j<8) && (j<3); j=j+1 )
runBits[k] = runBits[k] & (bytePlus1[i+j]==bytePlus1[i+j+1]);
runByte_[i/8] = runByte_[i/8] & runBits_[k];
k = k + 1;
end
end
always @(posedge clk) runByte = runByte_;

endmodule

Vinh Pham · Oct 2, 2003

Whoopsy, brain-fart. My previous code will create 3 levels of logic. If we
didn't have to detect both nine 1s or nine 0s, then it'd work okay.

Here's an idea for one that should generate 2 levels, but it looks uglier.
Definately not as compact as rickman's.

data[64:0] -- input signal
ninth_bit[7:0] -- intermediate signal
run_dibble[31:0] -- intermediate signal
run_byte[7:0] -- output signal

for byte 0...7

ninth_bit[byte] = data[(byte+1)*8]

for dibble 0...3

lsb = byte*8 + dibble*2
msb = byte*8 + dibble*2 + 1

if data[lsb] = ninth_bit[byte] AND data[msb] = ninth_bit[byte] then
run_dibble[byte*4 + dibble] = 1
else
run_dibble[byte*4 + dibble] = 0
end

end loop

lsb = byte*4
msb = byte*4 + 3

if run_dibble[msb:lsb] = "1111" then
run_byte[byte] = 1
else
run_byte[byte] = 0
end

end loop

John_H · Oct 2, 2003

rickman <spamgoeshere4@yahoo.com> wrote in message news:<3F7B3F59.191901A4@yahoo.com>...

My experience has been that it does not much matter how you code
combinatorial logic like this. The tools run it through a grinder and
produce an optimal version (in its own mind). When I want to optimize
like this, I either use a "keep" attribute on the wire, or sometimes you
can instantiate primitives. For logic I don't think primitives work
since gates just get remapped.

I overuse the syn_keep attribute and I hate the idea of instantiating
LUTs. My Carnot skills aren't exactly used regularly.

But I still don't understand your code. Why does the outer loop range
over 64 values.

I've had problems with bit ranges in the past where [i+4:i] is a
complaint. Perhaps this isn't an issue with for loops but I've
learned to avoid them in general logic. They do work fine in generate
blocks, however. I stepped through every bit to make a comparison to
the adjacent bit; 3 adjacent comparisons lumped into one variable
(with an eventual syn_keep) would give me 4-input functions that
should pack into LUTs. The complex end of the inside loop is so that
the three "LUTs" per byte are 4-input, 4-input, and 3-input functions.

I would code two nested loops where the outer loop
ranges over the 8 outputs and the inner loop ranges over the 9 inputs
for each output. Or just skip the inner loop and use two outputs from
two sets of four inputs feeding a 3 input function and use keeps on the
first two output arrays. Maybe that is what you are doing, but I can't
figure out the code easily.

I see you are incrementing the i variable by j and ranging j in the
second loop by some complex control expression. Can't you just
increment i by 8?

for( i=0; i<64; i=i+8 ) begin
k = i % 8;
for( j=0; j<4; j=j+1 ) begin
runBitsA_[k] = runBitsA_[k] & bytePlus1[i+j];
runBitsB_[k] = runBitsB_[k] & bytePlus1[i+j+4];
end
runByte_ = runBitsA_ & runBitsB_[k] & bytePlus1[i+9];
end

Put the keep on runBitsA_ and runBitsB_ and you should get your two
level structure.

This works very well for runs of ones only. I need to identify runs
of ones or runs of zeros. The technique can be expanded to my needs
resulting in runBitsA, B, and C where one of them needs to cover 2
comparisons, not 3 like the others. ...which is really is the
approach I was coding but using consecutive bits in a vector rather
than {A,B,C} and using the one statement rather than 3 to make the
assignments, dealing with the 2 comparison exception by terminating
the inside loop early.

Thanks for the help.

John_H · Oct 2, 2003

"Vinh Pham" <a@a.a> wrote in message news:<XcSeb.39218$5z.21702@twister.socal.rr.com>...

Whoopsy, brain-fart. My previous code will create 3 levels of logic. If we
didn't have to detect both nine 1s or nine 0s, then it'd work okay.

Thanks for noticing

I like the code below with respect to its symmetry - it's a lot easier
to read than the stuff I generated. The four 3-input LUTs feed a
single 4-input LUT with (only a) little arguement from the
synthesizer, I'm sure. It can be done with fewer LUTs by using
4-input LUTs covering 3 compares each but then the symmetry gets lost
and the coding gets unpleasant.

I think I have an acceptable solution together that gives me good
speed and good utilization which I'll post separately.

Thanks for your thoughs with this.

Here's an idea for one that should generate 2 levels, but it looks uglier.
Definately not as compact as rickman's.

data[64:0] -- input signal
ninth_bit[7:0] -- intermediate signal
run_dibble[31:0] -- intermediate signal
run_byte[7:0] -- output signal

for byte 0...7

ninth_bit[byte] = data[(byte+1)*8]

for dibble 0...3

lsb = byte*8 + dibble*2
msb = byte*8 + dibble*2 + 1

if data[lsb] = ninth_bit[byte] AND data[msb] = ninth_bit[byte] then
run_dibble[byte*4 + dibble] = 1
else
run_dibble[byte*4 + dibble] = 0
end

end loop

lsb = byte*4
msb = byte*4 + 3

if run_dibble[msb:lsb] = "1111" then
run_byte[byte] = 1
else
run_byte[byte] = 0
end

end loop

John_H · Oct 2, 2003

"Martin Euredjian" <0_0_0_0_@pacbell.net> wrote in message news:<FALeb.6677$fB4.1788@newssvr29.news.prodigy.com>...

John,

1- How many 65 bit words per second (ms, ns?) do you have to process?
This run detection is one part of a 100MHz-200MHz mechanism.
2- Where do the 65 bits come from? (internal, external)
Internal, blindsided from BlockRAMs with a new value per cycle.
3- Do they get into the FPGA in parallel or serially?
Entirely parallel, into the BlockRAMs at full width.
4- Why are you saying that you need two levels of logic? (trying to control
delay with combinatorial logic is not a great idea).
If I go from BlockRAM to registers, I have the (relatively) long

Tcko delay for the BlockRAM read and associated routing leaving little
time to manipulate the data within the period. If I register the data
from the BlockRAM, it's best to generate and use the run values in the
next cycle requiring moe logic after I flag the runs, suggesting
minimum delay is best.

5- Why fight with inference? Instantiate what primitives you need.
The logic primitives are what I've avoided. I don't want to use

LUT4 primitives with INIT attributes since I might mess up the carnot
map. This is why the inference has been broken down into bits that
can be retained (with syn_keep or other method).

Two logic levels?

Two LUT's to look at two consecutive nibbles.
One LUT to AND the output of the above with the next most significant bit
(the ninth bit).
That's it. Two levels. 24 LUT's.
Is that what you wanted?

Almost. The LUTs can't look at full nibbles. Since I need to make
sure all bits are equal to each other, there's a "smear." One attempt
was to XOR adjacent bits, then to do the 8-wide AND of the result,
letting the synth give me the "best" results. It didn't. Thinking
about the XOR-to-AND progression, 4 bits are needed to implement 3
bits of the AND, so two 4-bit LUTs and one 3-bit LUT are needed,
feeding a 3-input AND.

That's it. Two levels. 32 LUTs.
But the synth doesn't like my inferrences.
I think I have a solution that "works."

John_H · Oct 2, 2003

For anyone interested in how I got things together, I ended up using
one generate loop to instantiate 8 MUXF5s. Why MUXF5s?

1) One can make an 8-input AND with 2 LUTs and a little extra delay by
having the first 4-bit AND feed the select and the sel==0 input - if
the AND is false the result is false, if the AND is true, the result
is the other AND.

2) By using a primitive, the logic feeding the primitive's pin isn't
optimized across the primitive.

The synthesizer will produce a nice 2-level implementation for 5
compares but not 8 so splitting it up into 5 compares and 3, the MUXF5
used as an AND can give a nice balance of delays. Its slightly more
than 2 LUTs of delay, but very slightly. The code looks cleaner and
the implementation is tight.

===============================================================
module testRun ( input clk
, input [64:0] bytesPlus1
, output reg [ 7:0] runByte
);

wire [ 7:0] runMux;
wire [63:0] xnorBits = bytesPlus1[63:0] ^~ bytesPlus1[64:1];
// the result of a bit compare a==b is the same as a^~b

genvar h;
generate
for( h=0; h<8; h=h+1 )
begin : run
MUXF5 mux ( .O(runMux[h]), .S ( & xnorBits[h*8+2:h*8+0] )
, .I1( & xnorBits[h*8+7:h*8+3] )
, .I0( & xnorBits[h*8+2:h*8+0] ) );
end
endgenerate

always @(posedge clk) runByte <= runMux;

endmodule

Goran Bilski · Oct 2, 2003

Hi,

Why not use the carry-chain?

You can do any kind of detection on that primitive and it will save you LUTs

Göran

John_H wrote:

For anyone interested in how I got things together, I ended up using
one generate loop to instantiate 8 MUXF5s. Why MUXF5s?

1) One can make an 8-input AND with 2 LUTs and a little extra delay by
having the first 4-bit AND feed the select and the sel==0 input - if
the AND is false the result is false, if the AND is true, the result
is the other AND.

2) By using a primitive, the logic feeding the primitive's pin isn't
optimized across the primitive.

The synthesizer will produce a nice 2-level implementation for 5
compares but not 8 so splitting it up into 5 compares and 3, the MUXF5
used as an AND can give a nice balance of delays. Its slightly more
than 2 LUTs of delay, but very slightly. The code looks cleaner and
the implementation is tight.

===============================================================
module testRun ( input clk
, input [64:0] bytesPlus1
, output reg [ 7:0] runByte
);

wire [ 7:0] runMux;
wire [63:0] xnorBits = bytesPlus1[63:0] ^~ bytesPlus1[64:1];
// the result of a bit compare a==b is the same as a^~b

genvar h;
generate
for( h=0; h<8; h=h+1 )
begin : run
MUXF5 mux ( .O(runMux[h]), .S ( & xnorBits[h*8+2:h*8+0] )
, .I1( & xnorBits[h*8+7:h*8+3] )
, .I0( & xnorBits[h*8+2:h*8+0] ) );
end
endgenerate

always @(posedge clk) runByte <= runMux;

endmodule

rickman · Oct 2, 2003

John_H wrote:

rickman <spamgoeshere4@yahoo.com> wrote in message news:<3F7B3F59.191901A4@yahoo.com>...
My experience has been that it does not much matter how you code
combinatorial logic like this. The tools run it through a grinder and
produce an optimal version (in its own mind). When I want to optimize
like this, I either use a "keep" attribute on the wire, or sometimes you
can instantiate primitives. For logic I don't think primitives work
since gates just get remapped.

I overuse the syn_keep attribute and I hate the idea of instantiating
LUTs. My Carnot skills aren't exactly used regularly.

Actually, I don't think logic primatives will work since the back end
mapper can redo logic at will. The keep attribute is what is required
to define the LUTs and even that is not guaranteed since it only results
in a wire being kept; the LUT can still be split if other logic uses the
same inputs.

But I still don't understand your code. Why does the outer loop range
over 64 values.

I've had problems with bit ranges in the past where [i+4:i] is a
complaint. Perhaps this isn't an issue with for loops but I've
learned to avoid them in general logic. They do work fine in generate
blocks, however. I stepped through every bit to make a comparison to
the adjacent bit; 3 adjacent comparisons lumped into one variable
(with an eventual syn_keep) would give me 4-input functions that
should pack into LUTs. The complex end of the inside loop is so that
the three "LUTs" per byte are 4-input, 4-input, and 3-input functions.

I don't really see what problem you are trying to solve with that, but
then I am not as well versed in verilog compared to my VHDL.

I would code two nested loops where the outer loop
ranges over the 8 outputs and the inner loop ranges over the 9 inputs
for each output. Or just skip the inner loop and use two outputs from
two sets of four inputs feeding a 3 input function and use keeps on the
first two output arrays. Maybe that is what you are doing, but I can't
figure out the code easily.

I see you are incrementing the i variable by j and ranging j in the
second loop by some complex control expression. Can't you just
increment i by 8?

for( i=0; i<64; i=i+8 ) begin
k = i % 8;
for( j=0; j<4; j=j+1 ) begin
runBitsA_[k] = runBitsA_[k] & bytePlus1[i+j];
runBitsB_[k] = runBitsB_[k] & bytePlus1[i+j+4];
end
runByte_ = runBitsA_ & runBitsB_[k] & bytePlus1[i+9];
end

Put the keep on runBitsA_ and runBitsB_ and you should get your two
level structure.

This works very well for runs of ones only. I need to identify runs
of ones or runs of zeros. The technique can be expanded to my needs
resulting in runBitsA, B, and C where one of them needs to cover 2
comparisons, not 3 like the others. ...which is really is the
approach I was coding but using consecutive bits in a vector rather
than {A,B,C} and using the one statement rather than 3 to make the
assignments, dealing with the 2 comparison exception by terminating
the inside loop early.

Again, I may not completely understand your problem. This was intended
to show you how to solve the problem. To cover the adjacent zeros, you
just do the same logic using the OR operator and invert the result.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Vinh Pham · Oct 2, 2003

Thanks for noticing

Yeah, it hit me when I was laying down to sleep. I wanted to put it off
till today but it annoyed me too much.

synthesizer, I'm sure. It can be done with fewer LUTs by using
4-input LUTs covering 3 compares each but then the symmetry gets lost
and the coding gets unpleasant.

Great catch. Yes, you can save a LUT per byte by comparing on tribbles
instead of dibbles, but the code gets even more ugly than it already is.

I think I have an acceptable solution together that gives me good
speed and good utilization which I'll post separately.

Cool, I'm looking forward to seeing it.

John_H · Oct 3, 2003

Goran Bilski <goran@xilinx.com> wrote in message news:<blhm91$h0f1@cliff.xsj.xilinx.com>...

Hi,

Why not use the carry-chain?

You can do any kind of detection on that primitive and it will save you LUTs

Göran

I tried that approach earlier today but I wasn't getting the carry
chain I was trying to infer. The Virtex-IIs started getting poorer at
getting on/off carry chains timing-wise relative to the general logic
resources so I was trying to get general logic to work; I suspect
Spartan-3s are similar. If I go straight to register, I would need 4
LUTs to go into the register through the XORCY instead of the natural
XORCY so the logic savings isn't a given to achieve the speed but I
could keep it to 3 LUTs with a small routing hit.

I believe I'd need to implement all the carry chain primitives through
the generate block, including the MUXCYs and XORCY elements because
the synthesizer sees that "oh, it's a short chain" and converts my
simple arithmetic form to a cascade of LUTs rather than the carry
chain. I tried my tricks, I stopped pursuing.

Maybe I'll try to coax it again tomorrow. Thank goodness for that
generate!

John_H · Oct 3, 2003

Goran Bilski <goran@xilinx.com> wrote in message news:<blj78j$h0h1@cliff.xsj.xilinx.com>...

Hi,

What synthesis tool did you use?
When I instanciate primitives directly in my code, they tend to stay in
the netlist.

Synplify does a good job of leaving the instantiated primitives in the
code, sure. My first issue was that I believe carry chains are longer
than 2 levels of LUTs. The second issue is that I was trying to infer
- not instantiate - the adder chain by adding 1 to {1,1,1} when all
three LUTs worth of logic are valid, using the sign as my output. The
synthesizer turned the inferrence into LUTs which is probably more
effective in logic delay. I would need to generate 3 MUXCYs per chain
for 8 chains. If I want the quick-register destination, I need to
also instantiate the XOR in the sign bit. 4 primitives replicated 8
times for timing which *may* be worse. I chose not to pursue hte
instantiations because of the expected lower performance.

The synthesis tool usually leaves them alone.
If isn't working attach a U_SET attribute to them so the tools thinks
it's a RPM which is normally leaves alone.
That has been my approach for doing primitives.

Göran

John_H wrote:

Goran Bilski <goran@xilinx.com> wrote in message news:<blhm91$h0f1@cliff.xsj.xilinx.com>...

Hi,

Why not use the carry-chain?

You can do any kind of detection on that primitive and it will save you LUTs

Göran

I tried that approach earlier today but I wasn't getting the carry
chain I was trying to infer. The Virtex-IIs started getting poorer at
getting on/off carry chains timing-wise relative to the general logic
resources so I was trying to get general logic to work; I suspect
Spartan-3s are similar. If I go straight to register, I would need 4
LUTs to go into the register through the XORCY instead of the natural
XORCY so the logic savings isn't a given to achieve the speed but I
could keep it to 3 LUTs with a small routing hit.

I believe I'd need to implement all the carry chain primitives through
the generate block, including the MUXCYs and XORCY elements because
the synthesizer sees that "oh, it's a short chain" and converts my
simple arithmetic form to a cascade of LUTs rather than the carry
chain. I tried my tricks, I stopped pursuing.

Maybe I'll try to coax it again tomorrow. Thank goodness for that
generate!

--

Vinh Pham · Oct 4, 2003

than 2 LUTs of delay, but very slightly. The code looks cleaner and
the implementation is tight.

Yes, it looks really clean. Nice :_)

Cool use of the XNOR and AND functions to do your compare. A lot more
elegant than all my for loops. Learn something new every day.

But something is nagging me. Can't your xnor/and scheme be implemented in
two levels of LUTs without the need of a MUXF5? I must be missing
something, or maybe the problem is the synthesizer isn't mapping it that
way?

The first level LUTs would implement (output = (bit0 xnor bit1) and (bit1
xnor bit2)). You'd have four LUTs in the first level, then the second level
LUT would AND all of them together.

Hal Murray · Oct 4, 2003

I overuse the syn_keep attribute and I hate the idea of instantiating
LUTs. My Carnot skills aren't exactly used regularly.

Are Carnot skills needed? I can generally count to 4. A 4 input LUT
can implement any function of 4 inputs.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

John_H · Oct 4, 2003

hmurray@suespammers.org (Hal Murray) wrote in message news:<vnsdiro5e8e272@corp.supernews.com>...

I overuse the syn_keep attribute and I hate the idea of instantiating
LUTs. My Carnot skills aren't exactly used regularly.

Are Carnot skills needed? I can generally count to 4. A 4 input LUT
can implement any function of 4 inputs.

Do you realize how patronizing your response was?

Please, quickly give the appropriate INIT for a LUT4 where the desired
output is

&(in[3:1]^~in[2:0])

Since you can count to 4, this should be simple.
Can you guarantee that other engineers looking at your code later will
understand what you're trying to do?

With Regards,
John_H

John_H · Oct 4, 2003

"Vinh Pham" <a@a.a> wrote in message news:<XBnfb.23230$Ak3.11431@twister.socal.rr.com>...

But something is nagging me. Can't your xnor/and scheme be implemented in
two levels of LUTs without the need of a MUXF5? I must be missing
something, or maybe the problem is the synthesizer isn't mapping it that
way?

The first level LUTs would implement (output = (bit0 xnor bit1) and (bit1
xnor bit2)). You'd have four LUTs in the first level, then the second level
LUT would AND all of them together.

The implementation can, indeed, be as you imagine. No nagging
necessary. Your earlier pseudocode did just what you're suggesting
here. One of the reasons I went with the MUXF5 was that the primitive
doesn't need the syn_keeps that I use to coerce the synthesis into
thinking how I want; a LUT savings is nice though at a slight cost in
delay if I want the combinatorial output. But your discussion makes
me wonder if the Xilinx AND3 primitive will work just as well. I'll
test it out when I get back to work. Comparing the implementations
below, the only advantage to "mine" is one fewer LUT.

Maybe the synthesizer will let this stay. I don't have to define
intermediate variables to syn_keep and I don't need nested loops.
Since I don't need to use the xnors, I don't even need to comment the
code any more than usual.

generate
for( h=0; h<8; h=h+1 )
begin : run
AND4 y ( .O(yours[h])
, .I0( bytesPlus1[h*8+2:h*8+1] == bytesPlus1[h*8+1:h*8+0] )
, .I1( bytesPlus1[h*8+4:h*8+3] == bytesPlus1[h*8+3:h*8+2] )
, .I2( bytesPlus1[h*8+6:h*8+5] == bytesPlus1[h*8+5:h*8+4] )
, .I3( bytesPlus1[h*8+8:h*8+7] == bytesPlus1[h*8+7:h*8+6] )
);`
AND3 m ( .O(mine[h]),
, .I0( bytesPlus1[h*8+3:h*8+1] == bytesPlus1[h*8+2:h*8+0] )
, .I1( bytesPlus1[h*8+6:h*8+4] == bytesPlus1[h*8+5:h*8+3] )
, .I2( bytesPlus1[h*8+8:h*8+7] == bytesPlus1[h*8+7:h*8+6] )
);`
end
endgenerate

Thanks for keeping me thinking.
- John_H

Hal Murray · Oct 5, 2003

Do you realize how patronizing your response was?

Please, quickly give the appropriate INIT for a LUT4 where the desired
output is

&(in[3:1]^~in[2:0])

Since you can count to 4, this should be simple.
Can you guarantee that other engineers looking at your code later will
understand what you're trying to do?

Sorry. I wasn't trying to be an asshole.

I thought you were into trying to partitioning logic into LUTs in some
sneaky way.

I assumed software is smart enough to compute an INIT string
from a logic equation. The old Xilinx tools could do that for
3000 series parts. Has that fallen through the cracks with the
newer software?

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Digesting runs of ones or zeros "well"

John_H

Guest

rickman

Guest

John_H

Guest

Vinh Pham

Guest

Martin Euredjian

Guest

Vinh Pham

Guest

John_H

Guest

John_H

Guest

John_H

Guest

John_H

Guest

Goran Bilski

Guest

rickman

Guest

Vinh Pham

Guest

John_H

Guest

John_H

Guest

Vinh Pham

Guest

Hal Murray

Guest

John_H

Guest

John_H

Guest

Hal Murray

Guest

Log in

Welcome to EDABoard.com

Sponsor