synthesis and module interfaces

Apr 12, 2017

I've been wondering about module interfaces recently...

I'm a total newbie at this ASIC stuff, learning as I go, so I figured I'd ask what the collective wisdom is on wide busses in the real world. the module interface to a crypto core looks like:

module bmw512
(
input clk,
input rst,

input [511:0] partialMsg,
input msgReady,

output [511:0] hash,
output reg hashReady
);

In general, is a 512-bit bus a good idea ? Or would I be better off clocking it in 64-bits at a time and paying the clock-cycle cost, in terms of routing the design. The internal state of the module is in fact 1024 bits wide, so the number of D flip-flops isn't going to change much, just the expense of having a large bus running around the chip.

The reason for asking is that the design turned out to be significantly larger than I'd expected (about 5mm^2 on a 180um process), and even with all the internal register state (~1600 bits overall), that was a bit of a surprise.

Kevin Neilson · Apr 12, 2017

In general, is a 512-bit bus a good idea ? Or would I be better off clocking it in 64-bits at a time and paying the clock-cycle cost, in terms of routing the design. The internal state of the module is in fact 1024 bits wide, so the number of D flip-flops isn't going to change much, just the expense of having a large bus running around the chip.

The reason for asking is that the design turned out to be significantly larger than I'd expected (about 5mm^2 on a 180um process), and even with all the internal register state (~1600 bits overall), that was a bit of a surprise.

I don't know if there's enough information here. You want the bus width to be as small as possible, but the bus width is probably dictated by the data rate. You take the number of message bits you need to process per second, divide by the clock frequency you can achieve, and there is your bus width.

Spaced Cowboy · Apr 12, 2017

On Wednesday, April 12, 2017 at 12:13:20 PM UTC-7, Kevin Neilson wrote:

> I don't know if there's enough information here. You want the bus width to be as small as possible, but the bus width is probably dictated by the data rate. You take the number of message bits you need to process per second, divide by the clock frequency you can achieve, and there is your bus width.

Thanks

What you're saying is what I was suspecting - I guess I didn't think it through properly in the first case, I just coded to what made it easiest to solve the problem with

Getting spoilt with all these enormous FPGAs where 512 bits of bus really isn't that.

Cheers
Simon

rickman · Apr 13, 2017

On 4/12/2017 3:54 PM, Spaced Cowboy wrote:

On Wednesday, April 12, 2017 at 12:13:20 PM UTC-7, Kevin Neilson wrote:

I don't know if there's enough information here. You want the bus width to be as small as possible, but the bus width is probably dictated by the data rate. You take the number of message bits you need to process per second, divide by the clock frequency you can achieve, and there is your bus width.

Thanks

What you're saying is what I was suspecting - I guess I didn't think it through properly in the first case, I just coded to what made it easiest to solve the problem with Getting spoilt with all these enormous FPGAs where 512 bits of bus really isn't that.

There can be issues with ground bounce if you have 512 outputs changing
at once. Driving 10 pF per output with a 1 nS rise time gives 1.5 amps
of current during the transition. That's a lot of current. Figure the
di/dt and the package inductance and I bet you have quite some ground
bounce.

512 outputs changing at once would be hard to deal with in an FPGA or
any device. To mitigate it the current should be spread out in time by
phasing the I/O clocks. Routing delays might be adequate depending.
Another way to mitigate this is by providing lots of power and ground
pins. You can also use differential outputs. But you don't want to
ignore it. It can and will bite you.

--

Rick C

Spaced Cowboy · Apr 13, 2017

It's a crypto core, so on average I guess you'd expect to see 50% changing per clock, so 256 in this case.

Question, though: are you talking about internal or external outputs ? I haven't personally done it, but I'm pretty sure there are people at work who regularly drive very wide busses (1024, 2048) through internal module-connections within the FPGA for video processing and digital effects.

These are purely internal connections - the external i/o is only a slow 20MHz SPI interface

rickman · Apr 13, 2017

On 4/12/2017 7:09 PM, Spaced Cowboy wrote:
> It's a crypto core, so on average I guess you'd expect to see 50% changing per clock, so 256 in this case.

Average? Where does the average of the current over time enter into the
picture? With raw data you might have been able to make a case for any
given pattern of data never appearing. But if the data has been xord
with with a pseudo-random sequence you are pretty much assured
degenerate patterns will show up. It may be a long time before you see
all ones, but you can easily find 90% ones.

Question, though: are you talking about internal or external outputs ? I haven't personally done it, but I'm pretty sure there are people at work who regularly drive very wide busses (1024, 2048) through internal module-connections within the FPGA for video processing and digital effects.

These are purely internal connections - the external i/o is only a slow 20MHz SPI interface

Well sure, the 10 pF in the calculation was intended to be a drive by an
external line. As internal connections there is no issue in general
although it will depend on the details of your layout.

As to the tradeoffs with multiplexing your data over a wide or narrow
bus, I expect that will greatly depend on your layout. If you can
arrange modules to use short, direct routes it will be better to leave
it as a wide bus. If the design need to route the signals over longer
distances or to multiple destinations, then more multiplexing might be
in order. Do you know the general layout of the modules yet?

--

Rick C

Spaced Cowboy · Apr 13, 2017

On Wednesday, April 12, 2017 at 4:35:43 PM UTC-7, rickman wrote:

On 4/12/2017 7:09 PM, Spaced Cowboy wrote:
It's a crypto core, so on average I guess you'd expect to see 50% changing per clock, so 256 in this case.

Average? Where does the average of the current over time enter into the
picture? With raw data you might have been able to make a case for any
given pattern of data never appearing. But if the data has been xord
with with a pseudo-random sequence you are pretty much assured
degenerate patterns will show up. It may be a long time before you see
all ones, but you can easily find 90% ones.

Point.

Question, though: are you talking about internal or external outputs ? I haven't personally done it, but I'm pretty sure there are people at work who regularly drive very wide busses (1024, 2048) through internal module-connections within the FPGA for video processing and digital effects.

These are purely internal connections - the external i/o is only a slow 20MHz SPI interface

Well sure, the 10 pF in the calculation was intended to be a drive by an
external line. As internal connections there is no issue in general
although it will depend on the details of your layout.

As to the tradeoffs with multiplexing your data over a wide or narrow
bus, I expect that will greatly depend on your layout. If you can
arrange modules to use short, direct routes it will be better to leave
it as a wide bus. If the design need to route the signals over longer
distances or to multiple destinations, then more multiplexing might be
in order. Do you know the general layout of the modules yet?

It's basically a chain of well-defined modules, so msg -> core1 -> core2 -> core3 -> hash in register form. If I could specify the layout, I'd love to, but I'm using yosys / qflow and although it's frankly amazing there's *anything* like this, I haven't played with the tools enough to get a feel for how to place things by hand yet, if it's in fact possible.

At the moment, the flow takes the verilog source files, converts to one huge netlist and uses simulated annealing to place the primitive cells, then does coarse routing and finally detail routing. I do have intentions to dig in a lot deeper into the place/route stuff because I ultimately want to incorporate the free memory compiler OpenRAM into the flow.

I have a lot ahead of me in this - learning and understanding LEF, DEF files, Liberty archives, GDS-II / Calma file formats etc. Still, the fun in teaching yourself is you get to self-direct

Cheers
Simon.

synthesis and module interfaces

Guest

Kevin Neilson

Guest

Spaced Cowboy

Guest

rickman

Guest

Spaced Cowboy

Guest

rickman

Guest

Spaced Cowboy

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

synthesis and module interfaces

Guest

Kevin Neilson

Guest

Spaced Cowboy

Guest

rickman

Guest

Spaced Cowboy

Guest

rickman

Guest

Spaced Cowboy

Guest

Log in

Welcome to EDABoard.com

Sponsor