RISC-V Support in FPGA

Kevin Neilson · May 2, 2017

I probably have 20 lines of ARM assembly written, and in retrospect that
could just as well be carefully-crafted C. Assuming that FreeRTOS makes
a port, everything else is C or C++, and could just be compiled for the
new target.

I don't know about the cell phone companies -- are they really that
heavily invested in processor-specific stuff?

I know it's not trivial to port to a new processor, but if there are savings to be had, even small ones, there is no loyalty that will keep somebody stuck to ARM. Apple switched processors after decades in the business and it didn't seem to affect business too much. I think people are conflating ubiquity with monopoly. But people can switch, and as cellphone (for example) margins erode, this is one of the places where savings might be eked out..

May 2, 2017

Den tirsdag den 2. maj 2017 kl. 02.15.05 UTC+2 skrev Rob Gaddi:

On 05/01/2017 04:46 PM, Tim Wescott wrote:
On Mon, 01 May 2017 16:07:02 -0700, Kevin Neilson wrote:

I don't know how small the RISC-V can be made. I know there is a
version designed in an ASIC that can compete with the ARM CPUs and
there are more than one version for FPGAs. I would hope they had a
version similar to the ARM CM-1 which is specifically targeted to
programmable logic and not overly large.

Speaking of ARM, I still can't figure out how ARM was acquired for $32B.
If even a student can make a synthesizable 32-bit processor in a few
weeks, how much value can there be in a processor? It's almost a
commodity. I know there is a lot of value in prediction pipelines,
cache logic, compilers, etc., but not $32b' worth.

So, maybe the people who SOLD it are laughing their way to the bank.

ARM processor variants have a huge installed base -- I suspect that went
a long way to justifying the $32B. But, if ST started offering parts
with the RISC-V core tomorrow, at a better price, I'd switch.

You would. I probably wouldn't, having a larger team to drag around and
all of the associated infrastructure.

But the cell phone companies, with all that already written codebase and
10s of millions of units sold per year? Not a chance they do. That's
billions of dollars of inertia.

I'm also sure that ARM has a crap ton of patents

David Brown · May 2, 2017

On 02/05/17 17:24, rickman wrote:

On 5/2/2017 3:12 AM, David Brown wrote:

A sizeable part of that is hidden in the three key components - the OS
kernel, the basic libraries, and the compiler. The huge majority of the
code on a telephone is cpu agnostic. Most of it got bumped from 32-bit
ARM to 64-bit ARM without much bother, and the 32 to 64 bit jump is
often a bigger port issue than moving between different 32-bit
architectures.

The proportion of the code to be modified is not relevant, only the
amount. It is also not relevant where the code resides. If you want to
port to a new processor you will have to touch every bit of code that is
specific to the processor, period.

That is true - but it is rather important that the parts that are
processor specific are limited in scope.

Of course, the rest of the code (at least, the C or C++ code) needs to
be checked and tested - there may be accidental processor specific code
such as alignment errors that happen to work fine on one processor but
cause bus errors on another one.

Changing architectures is not a small job, but it is not /that/ bad. In
the Linux world, the great majority of code works fine on a range of
32-bit and 64-bit targets, big-endian and little-endian, with little
more than a re-compile with the right gcc version (perhaps with a
../configure first). And of course, code in Java, Javascript, Python,
Lua, HTML5, etc., is all independent from from the target cpu anyway.

I don't know if the current state of these RISC-V tools are good enough,
however - I believe the Linux port of RISC-V is quite new, and the gcc
port has just been redone. For the big customers, they will want to see
a bit of maturity before considering RISC-V.

For us mere mortals, however, RISC-V is a great idea. If nothing else,
it gives ARM some much-needed competition (which should have come from
MIPS).

I had the impression MIPS is still a viable contender in many markets,
but mostly built into ASICs.

MIPS certainly still exists, and is used in a fair number of devices.
But MIPS make cores that are at least as good as ARM cores in many
areas, with cores that would be ideal on microcontrollers. But the only
off-the-shelf microcontrollers you can get with MIPS cores are the PIC32
- a device which was launched long before it was working, had
intentionally crippled development tools, and a name designed to invoke
terror in any software developer. IMHO, that greatly reduced MIPS
chances of being a significant player in the microcontroller market, and
I find that a great shame.

I wonder how important the royalties are when designing a CPU into an
SOC. I believe the RISC-V is totally royalty free. But I'm not
familiar with the BSD license, but I think it allows for commercial
versions. So a company may spring up that adds significant value and
charges royalties.

The ISA is royalty free, but as far as I know there is nothing to stop
you charging for a particular implementation (either as HDL source code,
pre-generated macros, silicon, or whatever).

Tim Wescott · May 2, 2017

On Tue, 02 May 2017 22:10:30 +0200, David Brown wrote:

On 02/05/17 17:24, rickman wrote:

I wonder how important the royalties are when designing a CPU into an
SOC. I believe the RISC-V is totally royalty free. But I'm not
familiar with the BSD license, but I think it allows for commercial
versions. So a company may spring up that adds significant value and
charges royalties.

The ISA is royalty free, but as far as I know there is nothing to stop
you charging for a particular implementation (either as HDL source code,
pre-generated macros, silicon, or whatever).

If there's a value to the royalty-freeness it'll be in the wide usage,
and the fact that manufacturers won't have to screw around with the legal
issues.

If it ever came to it that ARM was losing out to RISC-V implementations,
I could see them taking their considerable ARM-spertise and applying it
to RISC-V-compatible cores.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Theo Markettos · May 2, 2017

Mark Curry <gtwrek@sonic.net> wrote:

I gave it a quick glance. It all looks synthesizable to me. We've used
SystemVerilog in both Vivado, and Synplify, and I think the code should
work fine. YMMV.

A primary motivation was to teach SystemVerilog to undergrads - rather than
teach them lowest-common-denominator Verilog that's universally accepted by
tools but is pretty tedious as a learning environment.

We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

Robert F. Jarnot · May 2, 2017

The soft core world is always changing, but looking at
www.ijireeice.com/upload/2015/december-15/IJIREEICE%2041.pdf seems to
indicate that 300 MHz is fast for a soft core, with many soft cores (at
opencores.org for example) running much slower than this. Note that
Table 1 in the referenced article does not seem to distinguish between
soft cores running in an FPGA and those implemented in an ASIC.

On 05/02/2017 09:55 AM, rickman wrote:

On 5/2/2017 11:59 AM, Robert F. Jarnot wrote:
See http://fpga.org/grvi-phalanx/ Processor clock speed is 300-375 MHz
in a Kintex UltraScale. Placement constraints are required to get this
kind of clock speed, something that Jan Gray is very good at. Follow the
link to the 'Best Short Paper Award' for more details on how the logic
is partitioned between processor and router. For another perspective on
RISC-V see
http://www.adapteva.com/andreas-blog/why-i-will-be-using-the-risc-v-in-my-next-chip/

I suppose 300 MHz is pretty fast in an FPGA. But what would the
comparison point be to call it "fast". It should be compared to other
processors.

rickman · May 2, 2017

On 5/2/2017 2:12 PM, Kevin Neilson wrote:

I probably have 20 lines of ARM assembly written, and in retrospect
that could just as well be carefully-crafted C. Assuming that
FreeRTOS makes a port, everything else is C or C++, and could just
be compiled for the new target.

I don't know about the cell phone companies -- are they really that
heavily invested in processor-specific stuff?

I know it's not trivial to port to a new processor, but if there are
savings to be had, even small ones, there is no loyalty that will
keep somebody stuck to ARM.

I don't doubt that for a minute. It would be a business driven decision
and both the cost and the benefit would be considered.

Apple switched processors after decades
in the business and it didn't seem to affect business too much.

The issue is not how it would "affect business" since there should only
be benefits business-wise, otherwise why switch? The issue is the cost
of switching.

I
think people are conflating ubiquity with monopoly. But people can
switch, and as cellphone (for example) margins erode, this is one of
the places where savings might be eked out.

Yep.

--

Rick C

Kevin Neilson · May 2, 2017

We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

At first glance I thought I'd seen some object-oriented stuff in there but it was just structs. I actually used a lot of SystemVerilog a few years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept many Verilog-2005 constructs.

rickman · May 3, 2017

On 5/2/2017 5:52 PM, Kevin Neilson wrote:

We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

At first glance I thought I'd seen some object-oriented stuff in there but it was just structs. I actually used a lot of SystemVerilog a few years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept many Verilog-2005 constructs.

I wonder what is behind that. Much of VHDL-2008 is supported in most
tools, at least all the good stuff. I believe the Xilinx tools don't
include 2008, but I haven't tried it. Otherwise I'm told the third
party vendors support it and the Lattice tools I've used do a nice job
of it.

I can't understand a vendor being so behind the times.

--

Rick C

Mark Curry · May 3, 2017

In article <oebfia$o4a$2@dont-email.me>, rickman <gnuarm@gmail.com> wrote:

On 5/2/2017 5:52 PM, Kevin Neilson wrote:
We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

At first glance I thought I'd seen some object-oriented stuff in there but it was just structs. I actually used a lot of SystemVerilog a few
years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept
many Verilog-2005 constructs.

I wonder what is behind that. Much of VHDL-2008 is supported in most
tools, at least all the good stuff. I believe the Xilinx tools don't
include 2008, but I haven't tried it. Otherwise I'm told the third
party vendors support it and the Lattice tools I've used do a nice job
of it.

I can't understand a vendor being so behind the times.

Rick - yeah, it's pathetic. The synthesizable subset of SystemVerilog was
actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
We're just now - 12 years later really finding an acceptable solution for
FPGA designs. To repeat myself - It's really pathetic.

Vivado seems to actually have BETTER language support for SystemVerilog than
Synplify - believe it or not. But this only works so far until you hit some
sort of corner case and the tool spits out a netlist which doesn't match the
RTL. (We've hit too many of those issues in the past 2-3 years).

Synplify, on the other hand barfs on perfectly acceptable, synthesizable code
(i.e. SystemVerilog features that already have parallels in VHDL). But
Synplify has never (for us) produced a netlist which doesn't match RTL...

Regards,

Mark

rickman · May 3, 2017

On 5/3/2017 11:22 AM, Mark Curry wrote:

In article <oebfia$o4a$2@dont-email.me>, rickman <gnuarm@gmail.com> wrote:
On 5/2/2017 5:52 PM, Kevin Neilson wrote:
We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

At first glance I thought I'd seen some object-oriented stuff in there but it was just structs. I actually used a lot of SystemVerilog a few
years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept
many Verilog-2005 constructs.

I wonder what is behind that. Much of VHDL-2008 is supported in most
tools, at least all the good stuff. I believe the Xilinx tools don't
include 2008, but I haven't tried it. Otherwise I'm told the third
party vendors support it and the Lattice tools I've used do a nice job
of it.

I can't understand a vendor being so behind the times.

Rick - yeah, it's pathetic. The synthesizable subset of SystemVerilog was
actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
We're just now - 12 years later really finding an acceptable solution for
FPGA designs. To repeat myself - It's really pathetic.

Vivado seems to actually have BETTER language support for SystemVerilog than
Synplify - believe it or not. But this only works so far until you hit some
sort of corner case and the tool spits out a netlist which doesn't match the
RTL. (We've hit too many of those issues in the past 2-3 years).

Synplify, on the other hand barfs on perfectly acceptable, synthesizable code
(i.e. SystemVerilog features that already have parallels in VHDL). But
Synplify has never (for us) produced a netlist which doesn't match RTL...

Am I hearing a justification for staying with VHDL rather than learning
Verilog as I've been intending for some time? My understanding is that
to write test benches like what VHDL can do it is useful to have
SystemVerilog. Or is this idea overblown?

--

Rick C

Mark Curry · May 3, 2017

In article <oed1dk$a7b$2@dont-email.me>, rickman <gnuarm@gmail.com> wrote:

On 5/3/2017 11:22 AM, Mark Curry wrote:
In article <oebfia$o4a$2@dont-email.me>, rickman <gnuarm@gmail.com> wrote:
On 5/2/2017 5:52 PM, Kevin Neilson wrote:
We tested it pretty extensively with Modelsim and Intel FPGA tools; we
didn't have enough summer to put it through Xilinx or ASIC tools but happy
to fix things if there's any issues.

Theo

At first glance I thought I'd seen some object-oriented stuff in there but it was just structs. I actually used a lot of SystemVerilog a few
years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept
many Verilog-2005 constructs.

I wonder what is behind that. Much of VHDL-2008 is supported in most
tools, at least all the good stuff. I believe the Xilinx tools don't
include 2008, but I haven't tried it. Otherwise I'm told the third
party vendors support it and the Lattice tools I've used do a nice job
of it.

I can't understand a vendor being so behind the times.

Rick - yeah, it's pathetic. The synthesizable subset of SystemVerilog was
actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
We're just now - 12 years later really finding an acceptable solution for
FPGA designs. To repeat myself - It's really pathetic.

Vivado seems to actually have BETTER language support for SystemVerilog than
Synplify - believe it or not. But this only works so far until you hit some
sort of corner case and the tool spits out a netlist which doesn't match the
RTL. (We've hit too many of those issues in the past 2-3 years).

Synplify, on the other hand barfs on perfectly acceptable, synthesizable code
(i.e. SystemVerilog features that already have parallels in VHDL). But
Synplify has never (for us) produced a netlist which doesn't match RTL...

Am I hearing a justification for staying with VHDL rather than learning
Verilog as I've been intending for some time? My understanding is that
to write test benches like what VHDL can do it is useful to have
SystemVerilog. Or is this idea overblown?

Rick - I was speaking of Synthesizer support within FPGA tools only.

Simulation support depends entirely on your vendor, and is an entirely
different beast. We've been happy with Modelsim for all our SystemVerilog
simulations - for many years. Can't comment much on other simulation
vendors, and their support. I've not used VCS, or NCSIM (or whatever
they're now called) in many years. Never tried Xilinx "free" simulators,
but for "free" I'd expect you'd get what you pay for.

I'll not wade any deeper into language wars - use what you're most
comfortable with. Doesn't hurt to have experience with both.

Regards,

Mark

Kevin Neilson · May 3, 2017

Rick - yeah, it's pathetic. The synthesizable subset of SystemVerilog was
actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
We're just now - 12 years later really finding an acceptable solution for
FPGA designs. To repeat myself - It's really pathetic.

In my case, I mostly write for Vivado, but I have to write code which will also work for some ASIC synthesis tools which don't like anything too modern. I'm not sure why; I just know I have to keep to a low common denominator.

Anyway, and this is a different topic altogether, I've reverted to writing very low-level code for Vivado. I've given up the dream of parameterizable HDL. I do a lot of Galois Field arithmetic and I put all my parameterization in Matlab and generate Verilog include files (mostly long parameters) from that. The Verilog then looks about as understandable as assembly and I hate doing it but I have to. It's the same thing I was doing over ten years ago with Perl but now do with Matlab. Often Vivado will synthesize the high-level version with functions and nested loops, but it is an order of magnitude slower (synthesis time) than the very low-level version. And sometimes it doesn't synthesize how I like. I've just given up on high-level synthesizable code.

Mark Curry · May 3, 2017

In article <e4a7957e-c42e-4846-b8ab-ebcf170238dd@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:

Rick - yeah, it's pathetic. The synthesizable subset of SystemVerilog was
actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
We're just now - 12 years later really finding an acceptable solution for
FPGA designs. To repeat myself - It's really pathetic.

In my case, I mostly write for Vivado, but I have to write code which will also work for some ASIC synthesis tools which don't like anything too
modern. I'm not sure why; I just know I have to keep to a low common denominator.

Anyway, and this is a different topic altogether, I've reverted to writing very low-level code for Vivado. I've given up the dream of
parameterizable HDL. I do a lot of Galois Field arithmetic and I put all my parameterization in Matlab and generate Verilog include files (mostly
long parameters) from that. The Verilog then looks about as understandable as assembly and I hate doing it but I have to. It's the same thing I
was doing over ten years ago with Perl but now do with Matlab. Often Vivado will synthesize the high-level version with functions and nested loops,
but it is an order of magnitude slower (synthesis time) than the very low-level version. And sometimes it doesn't synthesize how I like. I've just
given up on high-level synthesizable code.

(continuing a bit OT...)

Kevin,

That's unfortunate. We've been very successful with writing parameterizable code - even
before SystemVerilog. Heck even before Verilog-2001. Things like N-Tap FIRs,
Two-D FIRs. FFTs, Video Blenders, etc... All with configurable settings -
bit widths, rounding/truncation options/etc.. I think in a previous job I had a
parametizable Galois Field Multiplier too.

I'm not sure what trouble you had with the tools. It takes a bit more up front work,
but pays off quite a bit in the end. We really had no choice, given the number of
FPGAs we do, along with how many engineers support them. Lot's of shared code
was the only way to go.

If you've got something you like, then I suggest keeping it. But for others,
I think writing parameterizable HDL isn't too much trouble - and is made
even easier with SystemVerilog. And higher level too.

Regards,

Mark

Kevin Neilson · May 3, 2017

(continuing a bit OT...)

Kevin,

That's unfortunate. We've been very successful with writing parameterizable code - even
before SystemVerilog. Heck even before Verilog-2001. Things like N-Tap FIRs,
Two-D FIRs. FFTs, Video Blenders, etc... All with configurable settings -
bit widths, rounding/truncation options/etc.. I think in a previous job I had a
parametizable Galois Field Multiplier too.

I'm not sure what trouble you had with the tools. It takes a bit more up front work,
but pays off quite a bit in the end. We really had no choice, given the number of
FPGAs we do, along with how many engineers support them. Lot's of shared code
was the only way to go.

If you've got something you like, then I suggest keeping it. But for others,
I think writing parameterizable HDL isn't too much trouble - and is made
even easier with SystemVerilog. And higher level too.

Regards,

Mark

I've just been burned too many times. I know better now. The last time I made the mistake I was just making a simple PN generator (LFSR). The only complication was that it was highly parallel--I think I had to generate maybe 512 bits per cycle, so it ends up being a big matrix multiplication over GF(2). First I made the high-level version where you could set a parameters for the width and taps and so on. It took forever for Vivado to crank on it. This is just a few lines of code, mind you, and is just a bunch of XORs. Then I had Matlab generate an include file with the matrix packed into a long parameter which essentially sets up XOR taps. That was, I think, ~20x faster, which translated into hours of synthesis time. The synthesized circuit was also better for various reasons. This is just one example. I also still have to instantiate primitives frequently for various reasons. The level of abstraction doesn't seem like it's changed much in 15 years if you really need performance. This doesn't really have anything to do with the SystemVerilog constructs. I'm just talking about high-level code in general. If I were allowed, I would still use modports, structs, enums, etc.

Mark Curry · May 4, 2017

In article <a66c4c17-6f43-4aec-9dd5-c06badf5b11f@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:

(continuing a bit OT...)

Kevin,

That's unfortunate. We've been very successful with writing parameterizable code - even
before SystemVerilog. Heck even before Verilog-2001. Things like N-Tap FIRs,
Two-D FIRs. FFTs, Video Blenders, etc... All with configurable settings -
bit widths, rounding/truncation options/etc.. I think in a previous job I had a
parametizable Galois Field Multiplier too.

I'm not sure what trouble you had with the tools. It takes a bit more up front work,
but pays off quite a bit in the end. We really had no choice, given the number of
FPGAs we do, along with how many engineers support them. Lot's of shared code
was the only way to go.

If you've got something you like, then I suggest keeping it. But for others,
I think writing parameterizable HDL isn't too much trouble - and is made
even easier with SystemVerilog. And higher level too.

Regards,

Mark

I've just been burned too many times. I know better now. The last time I made the mistake I was just making a simple PN generator (LFSR). The
only complication was that it was highly parallel--I think I had to generate maybe 512 bits per cycle, so it ends up being a big matrix
multiplication over GF(2). First I made the high-level version where you could set a parameters for the width and taps and so on. It took forever
for Vivado to crank on it. This is just a few lines of code, mind you, and is just a bunch of XORs. Then I had Matlab generate an include file
with the matrix packed into a long parameter which essentially sets up XOR taps. That was, I think, ~20x faster, which translated into hours of
synthesis time. The synthesized circuit was also better for various reasons. This is just one example. I also still have to instantiate
primitives frequently for various reasons. The level of abstraction doesn't seem like it's changed much in 15 years if you really need performance.
This doesn't really have anything to do with the SystemVerilog constructs. I'm just talking about high-level code in general. If I were allowed, I
would still use modports, structs, enums, etc.

Ah, we did find something similar in Vivado. For use is was a large parallel
CRC - which is pretty much functionally identical to your LFSR (big XOR trees).

We had code that calculated, basically a shift table to calculate the CRC of a long word.
The RTL code worked fine for ISE. But when we hit Vivado, it'd pause 10 minutes or so
over each instance (we had lots) which significantly hit our build times.

So, I changed this code to "almost" self-modifying code. The code would by default
calculate the shift matrix using our "normal" RTL, which looked something like:
assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );
where H_zero was an "matrix" of constants, and NUM_ZEROS_MINUS_ONE a static
parameter. The end result is a matrix of constants as well, but "dynamically"
calculated. (Here "dynamically" means once at elaboration time, since all inputs
to the function were static).

Then we just added code to dump each unknown table entry sort-of like:
if( ( POLY_WIDTH == 8 ) && ( NUM_ZEROS_MINUS_ONE == 7 ) && ( POLYNOMIAL == 'h2f ) )
assign H_n_o = 'hd4eaf52e175ffba9;
...
else // no table entry - use default RTL calc
assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );

We "closed" the loop by hand. If the "table" entry didn't exist, the tool would use the
RTL definition, and spit out the pre-calculated entry. All done in
verilog. We insert that new table entry into our source code by hand, and continue - next
time the build would be quicker.

This *workaround* was a bit kludge, but was the rare (only really) exception for us
in our parameterized code. Normally the tools just handled things fine.
And again to be clear the only thing we were working around was long synthesis times.
The quality of results was fine in either case.

Maybe the code you were creating the pendulum swings the other way
and it was more the norm, rather than the exception to see things like this.

Interesting topic, I'm glad to hear of your (and others) experiences.

Regards,

Mark

Allan Herriman · May 4, 2017

On Wed, 03 May 2017 13:39:38 -0700, Kevin Neilson wrote:

(continuing a bit OT...)

Kevin,

That's unfortunate. We've been very successful with writing
parameterizable code - even before SystemVerilog. Heck even before
Verilog-2001. Things like N-Tap FIRs,
Two-D FIRs. FFTs, Video Blenders, etc... All with configurable
settings -
bit widths, rounding/truncation options/etc.. I think in a previous
job I had a parametizable Galois Field Multiplier too.

I'm not sure what trouble you had with the tools. It takes a bit more
up front work, but pays off quite a bit in the end. We really had no
choice, given the number of FPGAs we do, along with how many engineers
support them. Lot's of shared code was the only way to go.

If you've got something you like, then I suggest keeping it. But for
others,
I think writing parameterizable HDL isn't too much trouble - and is
made even easier with SystemVerilog. And higher level too.

Regards,

Mark

I've just been burned too many times. I know better now. The last time
I made the mistake I was just making a simple PN generator (LFSR). The
only complication was that it was highly parallel--I think I had to
generate maybe 512 bits per cycle, so it ends up being a big matrix
multiplication over GF(2). First I made the high-level version where
you could set a parameters for the width and taps and so on. It took
forever for Vivado to crank on it. This is just a few lines of code,
mind you, and is just a bunch of XORs. Then I had Matlab generate an
include file with the matrix packed into a long parameter which
essentially sets up XOR taps. That was, I think, ~20x faster, which
translated into hours of synthesis time. The synthesized circuit was
also better for various reasons. This is just one example. I also
still have to instantiate primitives frequently for various reasons.
The level of abstraction doesn't seem like it's changed much in 15 years
if you really need performance. This doesn't really have anything to do
with the SystemVerilog constructs. I'm just talking about high-level
code in general. If I were allowed, I would still use modports,
structs, enums, etc.

I use Vivado to do GF multiplications that wide using purely behavioural
VHDL. BTW, A straightforward behavioural implementation will *not* give
good results with a wide bus.
I believe the problem is that most tools (in particular Vivado) do a poor
job of synthesising xor trees with a massive fanin (e.g. >> 100 bits).
The optimisers have a poor complexity (I guess at least O(N^2), but it
might be exponential) wrt the size of the function.

You can use all sorts of mathematical tricks to make it work without need
to go "low level".
For example, to deal with large fanin, partition your 512 bit input into
N slices of 512/N bits each. Use N multipliers, one for each slice, put
a keep (or equivalent) attribute on the outputs, then xor the outputs
together. This gives the same result, uses about the same number of LUTs,
but gives the optimiser in the tool a chance to do a good job.

I use the same GF multiplier code in ISE and Quartus, too (but not on
buses that wide).

The entire flow is in VHDL and works in any LRM-compliant tool. It's
parameterised, too, so I don't need to rewrite for a different bus width.

I've been using similar approaches in VHDL since the turn of the century
and have never been burned.

YMMV.

Regards,
Allan

kristoff · May 4, 2017

Hi all,

As a follow-up in the RISC-V thread.

On 02-05-17 18:11, kristoff wrote:

Or, you can "mix-match" licenses. Sifive (the company that sells the
E310 CPU and hifive devboards) are an interesting example of this.
They open-sourced the RTL design but keep the knowledge of actually
implementing a risc-v core as optimised as possible for themselfs, as a
service to sell.

This was on eenews Europe today:
http://www.eenewseurope.com/news/sifive-launches-commercial-risc-v-processor-cores

As a small follow-up question:
Does anybody have any idea how to get the hifive boards in Europe?

For the last thing I ordered in the US (a pandaboard), I had to pay VAT
(ok, that's normal), but also a handling-fee for the shipping-company
and the customs-service to get the thing shipped in.
In the end, these additional costs where more then the VAT itself.

Cheerio! Kr. Bonne.

Kevin Neilson · May 4, 2017

We had code that calculated, basically a shift table to calculate the CRC of a long word.
The RTL code worked fine for ISE. But when we hit Vivado, it'd pause 10 minutes or so
over each instance (we had lots) which significantly hit our build times.

So, I changed this code to "almost" self-modifying code. The code would by default
calculate the shift matrix using our "normal" RTL, which looked something like:
assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );
where H_zero was an "matrix" of constants, and NUM_ZEROS_MINUS_ONE a static
parameter. The end result is a matrix of constants as well, but "dynamically"
calculated. (Here "dynamically" means once at elaboration time, since all inputs
to the function were static).

Then we just added code to dump each unknown table entry sort-of like:
if( ( POLY_WIDTH == 8 ) && ( NUM_ZEROS_MINUS_ONE == 7 ) && ( POLYNOMIAL == 'h2f ) )
assign H_n_o = 'hd4eaf52e175ffba9;
...
else // no table entry - use default RTL calc
assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );

We "closed" the loop by hand. If the "table" entry didn't exist, the tool would use the
RTL definition, and spit out the pre-calculated entry. All done in
verilog. We insert that new table entry into our source code by hand, and continue - next
time the build would be quicker.

This *workaround* was a bit kludge, but was the rare (only really) exception for us
in our parameterized code. Normally the tools just handled things fine.
And again to be clear the only thing we were working around was long synthesis times.
The quality of results was fine in either case.

Maybe the code you were creating the pendulum swings the other way
and it was more the norm, rather than the exception to see things like this.

Interesting topic, I'm glad to hear of your (and others) experiences.

Regards,

Mark

I looked up my notes for the LFSR I was referring to and one instance of the more-abstract version took 16 min to synthesize and the less-abstract version took less than a minute. (And we needed many instances.) When I try to do something at a higher level it ends up like your experience: I have to do a lot of experiments to see what works and then tweak things endlessly. It eats up a lot of time.

Kevin Neilson · May 4, 2017

I use Vivado to do GF multiplications that wide using purely behavioural
VHDL. BTW, A straightforward behavioural implementation will *not* give
good results with a wide bus.
I believe the problem is that most tools (in particular Vivado) do a poor
job of synthesising xor trees with a massive fanin (e.g. >> 100 bits).
The optimisers have a poor complexity (I guess at least O(N^2), but it
might be exponential) wrt the size of the function.

You can use all sorts of mathematical tricks to make it work without need
to go "low level".
For example, to deal with large fanin, partition your 512 bit input into
N slices of 512/N bits each. Use N multipliers, one for each slice, put
a keep (or equivalent) attribute on the outputs, then xor the outputs
together. This gives the same result, uses about the same number of LUTs,
but gives the optimiser in the tool a chance to do a good job.

I use the same GF multiplier code in ISE and Quartus, too (but not on
buses that wide).

The entire flow is in VHDL and works in any LRM-compliant tool. It's
parameterised, too, so I don't need to rewrite for a different bus width.

I've been using similar approaches in VHDL since the turn of the century
and have never been burned.

YMMV.

Regards,
Allan

I used to do big GF matrix multiplications in which you could set parameters for the field size and field generator poly, etc. Vivado just gets bogged down. Now I just expand that into a GF(2) matrix in Matlab and dump it to a parameter and all Vivado has to know how to do is XOR.

I also have problems with the wide XORs. Multiplication by a big GF(2) matrix means a wide XOR for each column. Vivado tries to share LUTs with common subexpressions across the columns. Too much sharing. That sounds like a good thing, but it's not smart enough to know how much it's impacting timing. You save LUTs, but you end up with a routing mess and too many levels of logic and you don't come close to meeting timing at all. So then I have to make a generate loop and put subsections of the matrix in separate modules and use directives to prevent optimizing across boundaries. (KEEPs don't work.) It's all a pain. But then I end up with something a little bigger but which meets timing.

I really wish there were a way to use the carry chains for wide XORs.

RISC-V Support in FPGA

Kevin Neilson

Guest

Guest

David Brown

Guest

Tim Wescott

Guest

Theo Markettos

Guest

Robert F. Jarnot

Guest

rickman

Guest

Kevin Neilson

Guest

rickman

Guest

Mark Curry

Guest

rickman

Guest

Mark Curry

Guest

Kevin Neilson

Guest

Mark Curry

Guest

Kevin Neilson

Guest

Mark Curry

Guest

Allan Herriman

Guest

kristoff

Guest

Kevin Neilson

Guest

Kevin Neilson

Guest

Log in

Welcome to EDABoard.com

Sponsor