16->5 "Sort"

K

Kevin Neilson

Guest
I'm trying to design a circuit (Virtex-7) which you might call either a priority encoder or a sorter. This is what it should do:

<i>Given a 16-bit vector with 5 bits set, create a list of 5 4-bit encoded values of each bit set. These needn't be in order.</i>

This turns out to be a lot harder than I thought. Writing the behavioral RTL isn't hard, but Vivado synthesizes it to 16 levels of logic, and when I draw out an optimized version, I still get at least 5 levels (using 6-input LUTs). I'd like to do it with minimal latency, but I can only do about 3 levels of logic at my clock speed. I've tried thinking about how I can use the carry chain muxes but they don't seem to be helpful. I can do a leading-ones detector and encode the leading 1, but I'd probably have to pipeline each stage so that would take 5 cycles.
 
The list is a 20-bit vector comprising the 5 4-bit values. This is put into a 20-bit wide FIFO; each of the 5 values will be processed simultaneously when read from the FIFO. So another way you could describe this is a 16-bit/5-hot -&gt; 20-bit encoder.
 
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000
 
On 5/11/2015 10:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

I am old school so I have to picture these things....

I see five priority encoders. The first priority encoder output is used
to disable that input to the second priority encoder, etc.

How did you code it? I can see how this would be a lot more than three
levels of logic. The equations get quite long. When you talk about
levels of logic I assume you mean layers of LUTs? I can see maybe each
16 input priority encoder being no more than 3 levels of LUTs, but all
five layers with the inhibit logic... I don't think so. I expect the
tools did a pretty good job of optimizing it and I don't easily see any
way of using carry chains.

--

Rick
 
On 5/11/2015 9:27 PM, Kevin Neilson wrote:
I'm trying to design a circuit (Virtex-7) which you might call either
a priority encoder or a sorter. This is what it should do:

i&gt;Given a 16-bit vector with 5 bits set, create a list of 5 4-bit
encoded values of each bit set. These needn't be in order.&lt;/i

This turns out to be a lot harder than I thought. Writing the
behavioral RTL isn't hard, but Vivado synthesizes it to 16 levels of
logic, and when I draw out an optimized version, I still get at least
5 levels (using 6-input LUTs). I'd like to do it with minimal
latency, but I can only do about 3 levels of logic at my clock speed.
I've tried thinking about how I can use the carry chain muxes but
they don't seem to be helpful. I can do a leading-ones detector and
encode the leading 1, but I'd probably have to pipeline each stage so
that would take 5 cycles.

I'm not sure what you mean by "list". Logic circuits don't normally use
lists. You want to know which five lines are set. It is easy to
encode that into four bit values. But once you have the values what
exactly do you want to do with them to provide them in a "list"?

If you want them accessible on the same four data lines in sequence,
what determines the timing of the sequence? What exactly is your
interface?

--

Rick
 
rickman &lt;gnuarm@gmail.com&gt; wrote:

(snip)
Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

I am old school so I have to picture these things....

I see five priority encoders. The first priority encoder output is used
to disable that input to the second priority encoder, etc.

How did you code it? I can see how this would be a lot more than three
levels of logic. The equations get quite long. When you talk about
levels of logic I assume you mean layers of LUTs? I can see maybe each
16 input priority encoder being no more than 3 levels of LUTs, but all
five layers with the inhibit logic... I don't think so. I expect the
tools did a pretty good job of optimizing it and I don't easily see any
way of using carry chains.

It is slightly easier than that.

(Sometime ago I did this with 36 and 3.)

Consider the case of only two bits high. Use one priority encoder
the usual way, and one upside down. (That is, the highest and lowest
set bit.) Now, as you said, subtract off the highest and lowest, and
two more priority encoders, and finally subtract one more and the
last one.

But first I would add the logic to determine that only
(or at least) five were set.

I suspect that, as the OP noted, it is combinatorial hard.
Consider the logic needed to, separately, compute each bit
of the result. Even just the first one.

It does, however, pipeline very well. In the one I was working on,
it had a fairly fast clock (66MHz) but I could stand some levels
of latency. The OP claims the need for low latency.

-- glen
 
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

How badly do you want one level of logic?

Rob.
 
GaborSzakacs &lt;gabor@alacron.com&gt; wrote:
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

How badly do you want one level of logic?

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

If not in BRAM, that seems a good way.

How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen
 
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000


64K x 20 ROM?

How badly do you want one level of logic?

Rob.

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

--
Gabor
 
rickman &lt;gnuarm@gmail.com&gt; wrote:

(snip, I wrote)
Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

(snip)

How many levels of logic is a 256 bit ROM in FPGA LUTs?

Like most things, "it depends". The older devices have 16 bits per LUT,
if any, and many of the newer devices have 64 bits per LUT.

Nearly all devices have BRAMs so it's a no brainer by the split table
method. Only trouble is BRAMs use a clock cycle... I'm just sayin'...

OK, going back the OP says that three levels of logic is fine.
Doesn't say how many pipeline stages.

I presume a clock is available for the BRAM, but I am not sure.

-- glen
 
The idea of working from both sides is useful. Finding the 5th bit set is a lot harder than finding the 1st because you have to keep a running sum of the bits already set so that eats up a few inputs of each LUT. If I search from both ends the running sum would be no bigger than 2 (finding, say, 3 starting from the top and 2 starting from the bottom). When I draw this out I still have at least 3 levels of logic plus a few levels of carry chain mux, which I'd probably still have to pipeline one stage, but that's acceptable. Vivado surely won't use the carry chain mux unless I instantiate it anyway, and then it would be 5-6 levels of logic, so I'd definitely need to pipeline that one stage.
 
Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work: the first output is the easiest. You just find the leading 1 with a priority encoder and encode it. You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on. This requires 4 levels of LUTs. One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that. So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic. (I can't post pictures here, can I?) So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux. I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly. But it just seems like there should be some easier way.
 
Yes, that would work. I think it would be about about 180 LUTs, which is quite a bit. It would probably work in one cycle: there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.
 
I have plenty of BRAMs and don't mind using them, but they're a pain sometimes. I have to use the output registers so they have 2 cycles of latency, and often I have to add another cycle just to route data to or from the BRAM column. They come in handy, though. Lately I've been using them for a lot of Galois arithmetic, such as lookup tables for 1/x.
 
On Tuesday, May 12, 2015 at 12:10:35 PM UTC-6, Gabor wrote:
glen herrmannsfeldt wrote:
GaborSzakacs &lt;gabor@alacron.com&gt; wrote:
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

How badly do you want one level of logic?

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

If not in BRAM, that seems a good way.

How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen


In Xilinx 7-series:

A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three
muxes (two for each pair of LUTS and one to combine the result), so it
would show up as 3 levels of logic, but it all routes internally to the
slice and the muxes are really fast.

Convincing the tools to use LUT memory is the fun part. Here's my test
code:

module simple_lut
(
input wire [7:0] addr,
output wire [7:0] data
);

(* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];

initial $readmemh ("../source/lutcontents.hex", lut_mem);

assign data = lut_mem[addr];

endmodule

--
Gabor

The ROM can be fast if you use the F7/F8 muxes built into the slice. I've found the key thing for V7 is to minimize the number of routes on the fabric. The F7/F8 muxes are slower than LUTs, I think, but since you don't have to route the connecting net onto the fabric you save a lot of time.
 
On 5/12/2015 11:54 AM, glen herrmannsfeldt wrote:
GaborSzakacs &lt;gabor@alacron.com&gt; wrote:
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

How badly do you want one level of logic?

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

If not in BRAM, that seems a good way.

How many levels of logic is a 256 bit ROM in FPGA LUTs?

Like most things, "it depends". The older devices have 16 bits per LUT,
if any, and many of the newer devices have 64 bits per LUT.

Nearly all devices have BRAMs so it's a no brainer by the split table
method. Only trouble is BRAMs use a clock cycle... I'm just sayin'...

--

Rick
 
glen herrmannsfeldt wrote:
GaborSzakacs &lt;gabor@alacron.com&gt; wrote:
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

How badly do you want one level of logic?

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

If not in BRAM, that seems a good way.

How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen

In Xilinx 7-series:

A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three
muxes (two for each pair of LUTS and one to combine the result), so it
would show up as 3 levels of logic, but it all routes internally to the
slice and the muxes are really fast.

Convincing the tools to use LUT memory is the fun part. Here's my test
code:

module simple_lut
(
input wire [7:0] addr,
output wire [7:0] data
);

(* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];

initial $readmemh ("../source/lutcontents.hex", lut_mem);

assign data = lut_mem[addr];

endmodule

--
Gabor
 
On 5/12/2015 2:03 PM, glen herrmannsfeldt wrote:
rickman &lt;gnuarm@gmail.com&gt; wrote:

(snip, I wrote)
Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

(snip)

How many levels of logic is a 256 bit ROM in FPGA LUTs?

Like most things, "it depends". The older devices have 16 bits per LUT,
if any, and many of the newer devices have 64 bits per LUT.

Nearly all devices have BRAMs so it's a no brainer by the split table
method. Only trouble is BRAMs use a clock cycle... I'm just sayin'...

OK, going back the OP says that three levels of logic is fine.
Doesn't say how many pipeline stages.

I presume a clock is available for the BRAM, but I am not sure.

The OP said, "minimal latency" and I think I over spec'd that to mean
combinatorial. So one clock for the BRAM and one clock for the muxing
should be ok. Better than the 5 clock cycles he mentions.

--

Rick
 
On 5/12/2015 2:09 PM, GaborSzakacs wrote:
glen herrmannsfeldt wrote:
GaborSzakacs &lt;gabor@alacron.com&gt; wrote:
Rob Doyle wrote:
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

20-bit Output List:
1111 1110 0010 0001 0000

64K x 20 ROM?

Hmm, that probably does work. Especially with the BRAM in many FPGAs.

When I did it some years ago, it was three out of 36 bits, and ROMs
weren't, and still aren't, that big.

How badly do you want one level of logic?

Two ROMs, 256 by 23 and 256 by 20. The first encodes the
lower 8 bits into 0 to 5 4-bit numbers right-justified plus
a 3-bit indication of how many bits were set. The second
just encodes 0 to 5 4-bit numbers, left justified (you assume
it has the remaining 1's not covered by the low 8 bits. Then
one more level to select how many bits to take from each ROM
based on the number of ones in the lower 8 bits (each output
bit is only a 2:1 mux in this case - that's why the ROMs
are justified right and left as noted).

If not in BRAM, that seems a good way.
How many levels of logic is a 256 bit ROM in FPGA LUTs?

-- glen


In Xilinx 7-series:

A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three
muxes (two for each pair of LUTS and one to combine the result), so it
would show up as 3 levels of logic, but it all routes internally to the
slice and the muxes are really fast.

Convincing the tools to use LUT memory is the fun part. Here's my test
code:

module simple_lut
(
input wire [7:0] addr,
output wire [7:0] data
);

(* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];

initial $readmemh ("../source/lutcontents.hex", lut_mem);

assign data = lut_mem[addr];

endmodule

It would be 43 x 4 or 172 LUTs. A fair amount plus the muxes to combine
the outputs. Still, it is mostly in parallel so it should be fairly fast.

--

Rick
 
On Tuesday, May 12, 2015 at 4:29:39 PM UTC-6, rickman wrote:
On 5/12/2015 3:11 PM, Kevin Neilson wrote:
I have plenty of BRAMs and don't mind using them, but they're a pain
sometimes. I have to use the output registers so they have 2 cycles
of latency, and often I have to add another cycle just to route data
to or from the BRAM column. They come in handy, though. Lately I've
been using them for a lot of Galois arithmetic, such as lookup tables
for 1/x.

Why do you have to use the output registers? The clock to out time on a
BRAM has always been very fast as is the setup time. The ones I've
worked with were only slightly slower than a FF in the context of
typical delays in logic and fabric. What is your clock speed?

If you are working in a large part the LUTs are not an unreasonable way
to implement this. Not sure how fast the resulting logic will be, but
it should be in the same ballpark as the BRAM but purely combinatorial.
Do you need to run faster than 100 MHz?

--

Rick

I'm using 350mHz, or a period of 2.8ns. The clk-&gt;out time for a V7 -1 BRAM (without output reg) is about 2.1ns, so if I didn't use the BRAM output register, I'd barely have enough time to get the output across a net to a FF. And I know even that usually won't meet timing, because Vivado is fond of pulling the output registers out of my BRAMs and putting them into slices, I guess because it thinks it has extra slack and can give some of it to the next path. But then the net to the FF will be 600ps and the path will fail. I have not figured out how to make Vivado stop doing this (except by instantiating BRAM primitives).
 

Welcome to EDABoard.com

Sponsor

Back
Top