16->5 "Sort"

Kevin Neilson wrote:
> Yes, that would work. I think it would be about about 180 LUTs, which is quite a bit. It would probably work in one cycle: there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

If you decide to pipeline it, a register placed after the 256-deep LUT
will go into the same slice with the 4 LUTs, 2 F7 muxes and F8 mux.
Then the final 2:1 would go after a standard fabric register, which
has pretty small clock to Q (much better than BRAM without the output
register). Even in a -2 Artix you can run above 500 MHz with this
arrangement.

--
Gabor
 
On 5/12/2015 3:11 PM, Kevin Neilson wrote:
I have plenty of BRAMs and don't mind using them, but they're a pain
sometimes. I have to use the output registers so they have 2 cycles
of latency, and often I have to add another cycle just to route data
to or from the BRAM column. They come in handy, though. Lately I've
been using them for a lot of Galois arithmetic, such as lookup tables
for 1/x.

Why do you have to use the output registers? The clock to out time on a
BRAM has always been very fast as is the setup time. The ones I've
worked with were only slightly slower than a FF in the context of
typical delays in logic and fabric. What is your clock speed?

If you are working in a large part the LUTs are not an unreasonable way
to implement this. Not sure how fast the resulting logic will be, but
it should be in the same ballpark as the BRAM but purely combinatorial.
Do you need to run faster than 100 MHz?

--

Rick
 
On 5/12/2015 2:57 PM, Kevin Neilson wrote:
Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work: the first output is the easiest. You just find the leading 1 with a priority encoder and encode it. You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on. This requires 4 levels of LUTs. One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that. So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic. (I can't post pictures here, can I?) So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux. I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly. But it just seems like there should be some easier way.

Sorry, I just can't picture what you are doing. What is the "running
sum" for? I think I might understand. You look at the first 5 inputs
and output codes for all five positions. I'm not sure why you can't
look at the first 6 inputs though. This outputs a three bit code of the
number of 1's found. The second block looks at the next five inputs
and outputs five codes. The last five bits would be like the second
group and have a mux with the second group when in turn is what actually
feeds the first mux. The first group would be one level of LUTs. The
following two groups

Let me try to draw this...

,------, 3 ,-----,
0-5 | |--/------------------------|SEL |
-->--| | 20 | | 20
| |--/------------------------| BUM*|--/-->--
'------' | |
,---| |
,------, 3 ,-----, | '-----'
6-10 | |--/--------|SEL | |
-->--| | 20 | | |
| |--/--------| | |
'------' | | 20 |
| BUM*|--/--'
,------, | |
11-15 | | | |
-->--| | 20 | |
| |--/--------| |
'------' '-----'

*Big, Ugly Mux

The mux might be hard to work out and will surely be more than 1 level
of LUTs.... unless you can use the magic muxes in the slice to combine
multiple LUTs into a 6 input mux. You don't need any adders for the
counts since each 3 bit count controls a separate mux. This might just
work in three levels of LUTs if you can use multiple LUTs to form a 6
input mux.

I just read your post where you said you were running at 350 MHz. I
guess even this will have to be pipelined. But it should be less logic
than the brute force distributed RAM approach. But who knows until the
LUTs are counted? In essence this is the same thing I guess. It might
work better with the larger front end blocks and just one mux.

I'm very surprised the clock to out time on the V7 BRAM is 2.1ns. I
think that is about the same number as the Spartan 3s from long ago. Am
I mistaken?

--

Rick
 
On Tuesday, May 12, 2015 at 5:37:27 PM UTC-6, rickman wrote:
On 5/12/2015 2:57 PM, Kevin Neilson wrote:
Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work: the first output is the easiest. You just find the leading 1 with a priority encoder and encode it. You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on. This requires 4 levels of LUTs. One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that. So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic. (I can't post pictures here, can I?) So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux. I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly. But it just seems like there should be some easier way.

Sorry, I just can't picture what you are doing. What is the "running
sum" for? I think I might understand. You look at the first 5 inputs
and output codes for all five positions. I'm not sure why you can't
look at the first 6 inputs though. This outputs a three bit code of the
number of 1's found. The second block looks at the next five inputs
and outputs five codes. The last five bits would be like the second
group and have a mux with the second group when in turn is what actually
feeds the first mux. The first group would be one level of LUTs. The
following two groups

Let me try to draw this...

,------, 3 ,-----,
0-5 | |--/------------------------|SEL |
-->--| | 20 | | 20
| |--/------------------------| BUM*|--/-->--
'------' | |
,---| |
,------, 3 ,-----, | '-----'
6-10 | |--/--------|SEL | |
-->--| | 20 | | |
| |--/--------| | |
'------' | | 20 |
| BUM*|--/--'
,------, | |
11-15 | | | |
-->--| | 20 | |
| |--/--------| |
'------' '-----'

*Big, Ugly Mux

The mux might be hard to work out and will surely be more than 1 level
of LUTs.... unless you can use the magic muxes in the slice to combine
multiple LUTs into a 6 input mux. You don't need any adders for the
counts since each 3 bit count controls a separate mux. This might just
work in three levels of LUTs if you can use multiple LUTs to form a 6
input mux.

I just read your post where you said you were running at 350 MHz. I
guess even this will have to be pipelined. But it should be less logic
than the brute force distributed RAM approach. But who knows until the
LUTs are counted? In essence this is the same thing I guess. It might
work better with the larger front end blocks and just one mux.

I'm very surprised the clock to out time on the V7 BRAM is 2.1ns. I
think that is about the same number as the Spartan 3s from long ago. Am
I mistaken?

--

Rick

The BRAM output is 2.1 ns, but if you use the output register (which I have to) it's 750 ps. Then the BRAM has 2 cycles of latency.

Yes, something like you show would work. The design I'd written up had the sums as inputs to the LUTs. So the top LUT could look at 6 bits (I said 5 originally because I was going to use the MUXCY but I abandoned that). Then the next LUT looks at 4 bits, and the other 2 inputs would be the 2-bit sum of the first 5 bits. And the next LUT looks at 4 more bits and also as a 2-bit sum of the first 10 bits. (This is for the 3rd encoded output so we're looking for the 3rd bit set.) I end up with 4 of these LUTs, 2 levels of LUTs to do the sums, and an F7/F8 mux afterward to pick one of the 4 LUTs. So that's 3 levels of LUTs and an F7/F8, which would work in 1 cycle. The whole thing would be about 100 LUTs.

I couldn't get that to work, though, because I can't get Vivado to synthesize anything right, and I was going to have to instantiate a lot of primitives (including the F7/F8 muxes). I couldn't even get Vivado to do the sums correctly. You should be able to find the mod-2 sum of up to 18 bits with 8 LUTs in 2 levels, but Vivado does 3 levels. It's pitiful.

I ended up doing something else. I did a trailing-one detector like this:

wire [15:0] trailing_1 = ~(input_vec[15:0]-1) & input_vec[15:0];

This uses the carry chain. I think the idea is from Knuth. That gives you a 16-bit vector with just the trailing 1 set.

You encode that for the 1st output. You the same thing with a mirrored version of input_vec to do a leading-one detector and encode that for the 2nd output. Then you XOR those two vectors with the original to get a vector with just the 3 middle bits still set. You do another leading/trailing 1 detector and encode those two and then XOR those with the original and you have a vector with 1 bit set and you encode that.

That's all 200 LUTs and I pipelined it for 3 cycles of latency. There's a lot of slack so I might be able to do it in 2 but I'm not sure if I want to risk it.
 

Welcome to EDABoard.com

Sponsor

Back
Top