RISC-V Support in FPGA

On 04/05/17 18:12, kristoff wrote:
Hi all,


As a follow-up in the RISC-V thread.


On 02-05-17 18:11, kristoff wrote:
Or, you can "mix-match" licenses. Sifive (the company that sells the
E310 CPU and hifive devboards) are an interesting example of this.
They open-sourced the RTL design but keep the knowledge of actually
implementing a risc-v core as optimised as possible for themselfs, as a
service to sell.

This was on eenews Europe today:
http://www.eenewseurope.com/news/sifive-launches-commercial-risc-v-processor-cores





As a small follow-up question:
Does anybody have any idea how to get the hifive boards in Europe?

I got one from the crowdsupply site. I haven't got round to trying it
yet :-(

For the last thing I ordered in the US (a pandaboard), I had to pay VAT
(ok, that's normal), but also a handling-fee for the shipping-company
and the customs-service to get the thing shipped in.
In the end, these additional costs where more then the VAT itself.



Cheerio! Kr. Bonne.
 
On Thu, 04 May 2017 10:56:56 -0700, Kevin Neilson wrote:

I use Vivado to do GF multiplications that wide using purely
behavioural VHDL. BTW, A straightforward behavioural implementation
will *not* give good results with a wide bus.
I believe the problem is that most tools (in particular Vivado) do a
poor job of synthesising xor trees with a massive fanin (e.g. >> 100
bits). The optimisers have a poor complexity (I guess at least O(N^2),
but it might be exponential) wrt the size of the function.

You can use all sorts of mathematical tricks to make it work without
need to go "low level".
For example, to deal with large fanin, partition your 512 bit input
into N slices of 512/N bits each. Use N multipliers, one for each
slice, put a keep (or equivalent) attribute on the outputs, then xor
the outputs together. This gives the same result, uses about the same
number of LUTs,
but gives the optimiser in the tool a chance to do a good job.


I use the same GF multiplier code in ISE and Quartus, too (but not on
buses that wide).

The entire flow is in VHDL and works in any LRM-compliant tool. It's
parameterised, too, so I don't need to rewrite for a different bus
width.


I've been using similar approaches in VHDL since the turn of the
century and have never been burned.

YMMV.

Regards,
Allan

I used to do big GF matrix multiplications in which you could set
parameters for the field size and field generator poly, etc. Vivado
just gets bogged down. Now I just expand that into a GF(2) matrix in
Matlab and dump it to a parameter and all Vivado has to know how to do
is XOR.

I also have problems with the wide XORs. Multiplication by a big GF(2)
matrix means a wide XOR for each column. Vivado tries to share LUTs
with common subexpressions across the columns. Too much sharing. That
sounds like a good thing, but it's not smart enough to know how much
it's impacting timing. You save LUTs, but you end up with a routing
mess and too many levels of logic and you don't come close to meeting
timing at all. So then I have to make a generate loop and put
subsections of the matrix in separate modules and use directives to
prevent optimizing across boundaries. (KEEPs don't work.) It's all a
pain. But then I end up with something a little bigger but which meets
timing.

I thought about my historical code some more, and I realised that I did
have some examples of behavioural GF multipliers that didn't work as well
as the same function expressed as a bunch of wide xors.

The particular example I'm thinking of had a 128 in, 128 xor tree that
really shouldn't be any harder to synth than a CRC. It's a linear
mapping stage in an SP block cipher (like AES, but not AES (which has a
relatively weak mixing function)).

Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3
levels of logic. Hmmm. The revised source code (expressed as a bunch of
xors) produced 4 levels of logic, and routed to speed.

BTW, I used my VHDL testbench for the original function to write out the
VHDL for the xor tree.


> I really wish there were a way to use the carry chains for wide XORs.

I think that carry chains (and similar structures) became less important
for wide functions once six input LUTs became commonplace.

The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input
xor in a single DSP48E2 slice. I've never tried it.

Regards,
Allan
 
The particular example I'm thinking of had a 128 in, 128 xor tree that
really shouldn't be any harder to synth than a CRC. It's a linear
mapping stage in an SP block cipher (like AES, but not AES (which has a
relatively weak mixing function)).

Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3
levels of logic. Hmmm. The revised source code (expressed as a bunch of
xors) produced 4 levels of logic, and routed to speed.

Same here. I have constant multiplier matrices and each has a column weight of about 160 so I end up with a 160-input XOR for each column. Ideally that would be log6(160)=2.8 levels. First I have to use very low-level code and even then Vivado shares subexpressions too much and I end up with 6 levels unless I isolate column groups in different modules. If I isolate each column in its own module I can get the 3 levels. Isolating column groups also means they are placed as a group which reduces wirelengths.

The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input
xor in a single DSP48E2 slice. I've never tried it.

Yeah, I looked into this at one point but decided against it for a few reasons. I thought a nice feature would be to be able to turn off the carries in the DSP48 and then you could use them for GF multipliers. I have used DSP48s as GF(2) accumulators and I've used them as transposers to extract column data from rows stored in RAMs.
 

Welcome to EDABoard.com

Sponsor

Back
Top