Automatic latency balancing in VHDL-implemented complex pipe

Sep 29, 2015

Hi,
Last time I have spent a lot of time on development of quite complex high speed data processing systems in FPGA. They all had pipeline architecture, and data were processed in parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during development. For example some operations were performed by blocks with tree structure, so the number of levels depended on number of inputs handled by each node. The number of inputs in each node was varied to find the acceptable balance between the number of levels and maximum clock speed. I also had to add some pipeline registers to improve timing.

Entire designs were written in pure VHDL, so I had to adjust latencies manually, to ensure that data coming from different paths arrive in the next block in the same clock cycle. It was really a nightmare so I dreamed about an automated way to ensure proper equalization of latencies.

After some work I have elaborated a solution which I'd like to share with the community. It is available under the BSD license on the OpenCores website http://opencores.org/project,lateq . The paper with detailed description is available on arXiv.org http://arxiv.org/abs/1509.08111.

I'll appreciate any comments.
I hope that the proposed method will be useful for others.

With best regards,
Wojtek

glen herrmannsfeldt · Sep 29, 2015

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing
during development. For example some operations were
performed by blocks with tree structure, so the number of
levels depended on number of inputs handled by each node.
The number of inputs in each node was varied to find the
acceptable balance between the number of levels and maximum
clock speed. I also had to add some pipeline registers to
improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to
hand optimize them that long.

-- glen

Sep 29, 2015

W dniu wtorek, 29 wrzeĹnia 2015 07:49:09 UTC+1 uĹźytkownik glen herrmannsfeldt napisaĹ:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing
during development. For example some operations were
performed by blocks with tree structure, so the number of
levels depended on number of inputs handled by each node.
The number of inputs in each node was varied to find the
acceptable balance between the number of levels and maximum
clock speed. I also had to add some pipeline registers to
improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to
hand optimize them that long.

Yes, of course the pipeline registers may be moved (e.g. using the "retiming" feature). I usually keep this option switched on for implementation.
My method only ensures, that the number of pipeline stages is the same in all parallel paths. And keeping track of that was really a huge problem in bigger designs.
--
Wojtek

Sep 29, 2015

W dniu wtorek, 29 wrzeĹnia 2015 11:50:53 UTC+1 uĹźytkownik kaz napisaĹ:

W dniu wtorek, 29 wrzeĂâşnia 2015 07:49:09 UTC+1 uĂÂźytkownik glen
herrmannsfeldt napisaĂâ:
wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing
during development. For example some operations were
performed by blocks with tree structure, so the number of
levels depended on number of inputs handled by each node.
The number of inputs in each node was varied to find the
acceptable balance between the number of levels and maximum
clock speed. I also had to add some pipeline registers to
improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to
hand optimize them that long.

Yes, of course the pipeline registers may be moved (e.g. using the
"retiming" feature). I usually keep this option switched on for
implementation.
My method only ensures, that the number of pipeline stages is the same
in
all parallel paths. And keeping track of that was really a huge problem
in
bigger designs.
--
Wojtek

Not sure why you expect the tool to do what you should do and do so for
simulation tool. How can you you simulate a design that synthesis will put
for you registers?

The tool is supposed to ensure that the appropriate number of registers is added.
In case of high-level parametrized description it is really difficult to avoid mistakes. Therefore an automated tool is preferred.
The registers are put not only for synthesis, but also for simulation.
I hope, that my preprint explains more clearly both motivation and implementation.

Regards,
Wojtek

kaz · Sep 29, 2015

W dniu wtorek, 29 wrzeĂÂnia 2015 07:49:09 UTC+1 uĂÂźytkownik glen
herrmannsfeldt napisaĂÂ:
wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing
during development. For example some operations were
performed by blocks with tree structure, so the number of
levels depended on number of inputs handled by each node.
The number of inputs in each node was varied to find the
acceptable balance between the number of levels and maximum
clock speed. I also had to add some pipeline registers to
improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.

I think it can move registers, but maybe not add them. You might
need enough registers in place for it to move them around.

I used to work on systolic arrays, which are really just very long
(hundred or thousands of stages) pipelines. It is pretty hard to
hand optimize them that long.

Yes, of course the pipeline registers may be moved (e.g. using the
"retiming" feature). I usually keep this option switched on fo
implementation.
My method only ensures, that the number of pipeline stages is the sam
in
all parallel paths. And keeping track of that was really a huge proble
in
bigger designs.
--
Wojtek

Not sure why you expect the tool to do what you should do and do so fo
simulation tool. How can you you simulate a design that synthesis will pu
for you registers?

Ka
--------------------------------------
Posted through http://www.FPGARelated.com

glen herrmannsfeldt · Sep 29, 2015

Tim Wescott <seemywebsite@myfooter.really> wrote:

(snip, I wrote)

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the feature
yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer
still needed to specify the total delay, but the tools took the
responsibility for distributing it.

Some time ago, and before I knew about this, I was working on designs
for some very long pipelines, thousands of steps. Each step is
fairly simple, and all are alike (except for data values).

I figured that in an FPGA, the pipeline would go across the array,
then down and across backwards, until it got to the end.

I then figured that the delay at the end, where it turned around to
go back, would be longer than other delays, but didn't know how to
modify my code.

As with many pipelines, I can add registers to all the signals without
affecting the results, though they will come out a little later.
But where to add the registers?

It turned out to be too expensive, so never got built, or even close.
Sometime later, I learned about this feature, but never went back
to try it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the timing
for section A matches the timing for section B -- for the moment at least
that's really beyond the tool's ability to cope.

One could put in sets of optional registers, such that either all or
none of a set get implemented. That might not be so hard, but you
do need a way to say it.

-- glen

Sep 29, 2015

W dniu wtorek, 29 wrzeĹnia 2015 21:41:26 UTC+1 uĹźytkownik rickman napisaĹ:

On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite complex
high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in parallel
in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks with
tree structure, so the number of levels depended on number of inputs
handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I also
had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the feature
yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer
still needed to specify the total delay, but the tools took the
responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the timing
for section A matches the timing for section B -- for the moment at least
that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to the
tools distributing the delays, again, each stage has the same timing
constraint so unless there are complications such as inputs with
separately specified delays, the tool just has to move logic across
register boundaries to make each section meet the timing spec or better
to balance all the delays in case you wish to have the fastest possible
clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

--

Rick

The problem I'm dealing with is just about the number of clock cycles, by which data in each data path are delayed.

The equal distribution of delay between stages of pipeline is so technology specific, that it probably must be handled by the vendor provided tools and in fact usually it is. In old Xilinx tools it was "register balancing", in Altera tools and in new Xilinx tools it is "register retiming".

So my problem is not so complex. And yes, it was solved in GUI based tools many years ago.
In old Xilinx System Generator it was a special "sync" block which was doing that. Just see Fig. 4 in my old paper from 2003 ( http://tesla.desy.de/new_pages/TESLA_Reports/2003/pdf_files/tesla2003-05.pdf ).

The importance of the problem is still emphasized by the vendors of block-based
tools (e.g. http://www.mathworks.com/help/hdlcoder/examples/delay-balancing-and-validation-model-workflow-in-hdl-coder.html )

However I've never see tool like this available for designs written in pure HDL,
not composed from blocks in GUI based tool...

I have found that for designs with pipelines with lengths depending on different parameters and somehow interconnected in a complex way there is really a need for a tool for automatic verification, or even better for automatic adjustment of those lengths.
Without that you can easily get incorrect design which processes data misaligned in time.

So that was the motivation.
Sorry if my original post was somehow misleading.

Regards,
Wojtek

Tim Wescott · Sep 30, 2015

On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite complex
high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in parallel
in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks with
tree structure, so the number of levels depended on number of inputs
handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I also
had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the feature
yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer
still needed to specify the total delay, but the tools took the
responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the timing
for section A matches the timing for section B -- for the moment at least
that's really beyond the tool's ability to cope.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

rickman · Sep 30, 2015

On 9/29/2015 4:22 PM, Tim Wescott wrote:

On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite complex
high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in parallel
in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks with
tree structure, so the number of levels depended on number of inputs
handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I also
had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the feature
yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the designer
still needed to specify the total delay, but the tools took the
responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the timing
for section A matches the timing for section B -- for the moment at least
that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to the
tools distributing the delays, again, each stage has the same timing
constraint so unless there are complications such as inputs with
separately specified delays, the tool just has to move logic across
register boundaries to make each section meet the timing spec or better
to balance all the delays in case you wish to have the fastest possible
clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

--

Rick

Tim Wescott · Sep 30, 2015

On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:

On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite complex
high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks
with tree structure, so the number of levels depended on number of
inputs handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I also
had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the
feature yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the
designer still needed to specify the total delay, but the tools took
the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the
timing for section A matches the timing for section B -- for the moment
at least that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to the
tools distributing the delays, again, each stage has the same timing
constraint so unless there are complications such as inputs with
separately specified delays, the tool just has to move logic across
register boundaries to make each section meet the timing spec or better
to balance all the delays in case you wish to have the fastest possible
clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within the combinatorial
logic, making it pipelined.

As I said, I've never done it -- I couldn't even tell you what search
terms to use to find out what the tool vendors call the process.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

rickman · Sep 30, 2015

On 9/29/2015 5:01 PM, Tim Wescott wrote:

On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:

On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite complex
high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks
with tree structure, so the number of levels depended on number of
inputs handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I also
had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move around
pipeline registers to optimize timing. I haven't tried using the
feature yet, though.

I knew about this sort of thing ten years ago, although I've never used
it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the
designer still needed to specify the total delay, but the tools took
the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the
timing for section A matches the timing for section B -- for the moment
at least that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to the
tools distributing the delays, again, each stage has the same timing
constraint so unless there are complications such as inputs with
separately specified delays, the tool just has to move logic across
register boundaries to make each section meet the timing spec or better
to balance all the delays in case you wish to have the fastest possible
clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within the combinatorial
logic, making it pipelined.

Yes, but you talked about the tool not being able to "cope" with
matching the delays in section A and B. I'm not following that.

--

Rick

rickman · Sep 30, 2015

On 9/29/2015 5:58 PM, wzab01@gmail.com wrote:

W dniu wtorek, 29 wrzeĹnia 2015 21:41:26 UTC+1 uĹźytkownik rickman
napisaĹ:
On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA. They all
had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing
during development. For example some operations were
performed by blocks with tree structure, so the number of
levels depended on number of inputs handled by each node. The
number of inputs in each node was varied to find the
acceptable balance between the number of levels and maximum
clock speed. I also had to add some pipeline registers to
improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried
using the feature yet, though.

I knew about this sort of thing ten years ago, although I've
never used it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the
designer still needed to specify the total delay, but the tools
took the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has
to decide how much delay is right, and who has to make sure that
the timing for section A matches the timing for section B -- for
the moment at least that's really beyond the tool's ability to
cope.

I'm not picturing the model you are describing. If all sections
have the same clock, they all have the same timing constraint, no?
As to the tools distributing the delays, again, each stage has the
same timing constraint so unless there are complications such as
inputs with separately specified delays, the tool just has to move
logic across register boundaries to make each section meet the
timing spec or better to balance all the delays in case you wish to
have the fastest possible clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

--

Rick

The problem I'm dealing with is just about the number of clock
cycles, by which data in each data path are delayed.

Yes, I understand the problem you are addressing. I have never done a
design where this was much of a problem, but I'm sure some designs are
much larger and more complex than the ones I have done.

The equal distribution of delay between stages of pipeline is so
technology specific, that it probably must be handled by the vendor
provided tools and in fact usually it is. In old Xilinx tools it was
"register balancing", in Altera tools and in new Xilinx tools it is
"register retiming".

So my problem is not so complex. And yes, it was solved in GUI based
tools many years ago. In old Xilinx System Generator it was a special
"sync" block which was doing that. Just see Fig. 4 in my old paper
from 2003 (
http://tesla.desy.de/new_pages/TESLA_Reports/2003/pdf_files/tesla2003-05.pdf
).

The importance of the problem is still emphasized by the vendors of
block-based tools (e.g.
http://www.mathworks.com/help/hdlcoder/examples/delay-balancing-and-validation-model-workflow-in-hdl-coder.html
)

Yes, it is important to have a tool to do this when the design is large
or your timing margins are tight. It can save a lot of work.

However I've never see tool like this available for designs written
in pure HDL, not composed from blocks in GUI based tool...

I have found that for designs with pipelines with lengths depending
on different parameters and somehow interconnected in a complex way
there is really a need for a tool for automatic verification, or even
better for automatic adjustment of those lengths. Without that you
can easily get incorrect design which processes data misaligned in
time.

So that was the motivation. Sorry if my original post was somehow
misleading.

Not to me.

--

Rick

Tim Wescott · Sep 30, 2015

On Tue, 29 Sep 2015 18:33:32 -0400, rickman wrote:

On 9/29/2015 5:01 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:

On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks
with tree structure, so the number of levels depended on number of
inputs handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I
also had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried using
the feature yet, though.

I knew about this sort of thing ten years ago, although I've never
used it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the
designer still needed to specify the total delay, but the tools took
the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the
timing for section A matches the timing for section B -- for the
moment at least that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to
the tools distributing the delays, again, each stage has the same
timing constraint so unless there are complications such as inputs
with separately specified delays, the tool just has to move logic
across register boundaries to make each section meet the timing spec
or better to balance all the delays in case you wish to have the
fastest possible clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within the
combinatorial logic, making it pipelined.

Yes, but you talked about the tool not being able to "cope" with
matching the delays in section A and B. I'm not following that.

Basically I meant that you need to be responsible for lining up the
delays in all the sections -- you can't make one section delay by five
more clocks without identifying all the other pertinent sections that
depend on that and make them delay by five more clocks, too.

If the tool could do everything we'd all be wiring houses for a living.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

rickman · Sep 30, 2015

On 9/29/2015 7:19 PM, Tim Wescott wrote:

On Tue, 29 Sep 2015 18:33:32 -0400, rickman wrote:

On 9/29/2015 5:01 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:

On 9/29/2015 4:22 PM, Tim Wescott wrote:
On Tue, 29 Sep 2015 06:49:02 +0000, glen herrmannsfeldt wrote:

wzab01@gmail.com wrote:

Last time I have spent a lot of time on development of quite
complex high speed data processing systems in FPGA.
They all had pipeline architecture, and data were processed in
parallel in multiple pipelines with different latencies.

The worst thing was that those latencies were changing during
development. For example some operations were performed by blocks
with tree structure, so the number of levels depended on number of
inputs handled by each node.
The number of inputs in each node was varied to find the acceptable
balance between the number of levels and maximum clock speed. I
also had to add some pipeline registers to improve timing.

I have heard that some synthesis software now knows how to move
around pipeline registers to optimize timing. I haven't tried using
the feature yet, though.

I knew about this sort of thing ten years ago, although I've never
used it (for FPGA I'm mostly an armchair coach).

At the time that my FPGA friends were rhapsodizing about it, the
designer still needed to specify the total delay, but the tools took
the responsibility for distributing it.

It makes sense to do it that way, because you're the one that has to
decide how much delay is right, and who has to make sure that the
timing for section A matches the timing for section B -- for the
moment at least that's really beyond the tool's ability to cope.

I'm not picturing the model you are describing. If all sections have
the same clock, they all have the same timing constraint, no? As to
the tools distributing the delays, again, each stage has the same
timing constraint so unless there are complications such as inputs
with separately specified delays, the tool just has to move logic
across register boundaries to make each section meet the timing spec
or better to balance all the delays in case you wish to have the
fastest possible clock rate.

Maybe by timing you mean the clock cycles the OP is talking about?

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within the
combinatorial logic, making it pipelined.

Yes, but you talked about the tool not being able to "cope" with
matching the delays in section A and B. I'm not following that.

Basically I meant that you need to be responsible for lining up the
delays in all the sections -- you can't make one section delay by five
more clocks without identifying all the other pertinent sections that
depend on that and make them delay by five more clocks, too.

If the tool could do everything we'd all be wiring houses for a living.

Ok, but that is not the tool CAD vendors provide. That is the tool the
OP is talking about.

--

Rick

kaz · Sep 30, 2015

Any VHDL compiler cannot be a useful compiler unless it respects the use
entered registers. Though it may fit an equivalent arrangement as i
register retiming for timing purposes.

Register delay stages is obviously what we are talking about rather tha
combinatorial/routing delays which is a concern for each register timin
and which the tool decides together with any constraints from user.

It is up to user to decide the register delay stages. It cannot b
technology sensitive unless you are doing some high level coding that doe
not specify registers. I don't know what this level is though.

How come a user build a design without being correct about register delay
How do you add streams or multiply or switch etc. and ask the tool to d
the job?

Ka
--------------------------------------
Posted through http://www.FPGARelated.com

Sep 30, 2015

W dniu Ĺroda, 30 wrzeĹnia 2015 09:41:15 UTC+1 uĹźytkownik kaz napisaĹ:

Any VHDL compiler cannot be a useful compiler unless it respects the user
entered registers. Though it may fit an equivalent arrangement as in
register retiming for timing purposes.

Register delay stages is obviously what we are talking about rather than
combinatorial/routing delays which is a concern for each register timing
and which the tool decides together with any constraints from user.

It is up to user to decide the register delay stages. It cannot be
technology sensitive unless you are doing some high level coding that does
not specify registers. I don't know what this level is though.

How come a user build a design without being correct about register delay..
How do you add streams or multiply or switch etc. and ask the tool to do
the job?

In the systems which I have to build there are some paremetrized components, in which latency depends on their parameters. Unfortunately I can not publish the original designs but a simplified version of one of those systems is provided as a demonstration of the method on OpenCores.
For example I have a block for finding the maximum value from certain number of inputs. It is a tree built from elemantary comparators.
When looking for optimal implementation (in terms of resource usage and maximum clock frequency) I have to select the number of values compared simultaneously in such a basic comparator. My implementation automatically adjusts number of stages to the number of inputs in an elementary comparator and in the whole system. Of course the number of stages affects the latency (delay in number of clocks). There are many such blocks which may be adjusted independently.
Tryig to keep design adjusted properly (in a sense that all latencies in parallel pipelines are equal) is really difficult and error-prone.
So thats why I needed a tool which does it for me.
Of course I have to analyze the results, and sometime introduce manual corrections...
Does it answer the question above?

Regards,
Wojtek

HT-Lab · Sep 30, 2015

On 29/09/2015 22:01, Tim Wescott wrote:

On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:
...

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within the combinatorial
logic, making it pipelined.

As I said, I've never done it -- I couldn't even tell you what search
terms to use to find out what the tool vendors call the process.

As mentioned before just search for register retiming. It works exactly
as you described although it is not perfect. It can move combinational
logic between register pairs to balance the slack. Register retiming is
a relative old technology and has been available on most independent
tools (like Mentor's Precision and Synopsys's Synplify) and Vendor
synthesis tools for many years. From what I understand vendor tools can
only move logic into one direction due to a patent owned by Mentor Graphics.

# Info: [7004]: Starting retiming program ...
# Info: [7012]: Phase 1
# Info: [7012]: Phase 2
# Info: [7012]: Phase 3
# Info: [7012]: Phase 4
# Info: [7012]: Total number of DSPs processed : 0
# Info: [7012]: Total number of registers added : 138
# Info: [7012]: Total number of registers removed : 66
# Info: [7012]: Total number of logic elements added : 0

Register retiming is something you want to enable by default unless you
are planning to use an equivalence checker,

Hans
www.ht-lab.com

kaz · Sep 30, 2015

On 29/09/2015 22:01, Tim Wescott wrote:
On Tue, 29 Sep 2015 16:41:08 -0400, rickman wrote:
..

The way I've seen it, rather than carefully hand-designing a pipeline,
you just design a system that's basically

.---------------------. .-------.
data in -->| combinatorial logic |---->| delay |----> data out
'---------------------' '-------'

where the "delay" block just delays all the outputs from the
combinatorial block by some number of clocks.

Then you tell the tool "move delays as you see fit", and it magically
distributes the delay in a hopefully-optimal way within th
combinatorial
logic, making it pipelined.

As I said, I've never done it -- I couldn't even tell you what search
terms to use to find out what the tool vendors call the process.

As mentioned before just search for register retiming. It works exactly
as you described although it is not perfect. It can move combinational
logic between register pairs to balance the slack. Register retiming is
a relative old technology and has been available on most independent
tools (like Mentor's Precision and Synopsys's Synplify) and Vendor
synthesis tools for many years. From what I understand vendor tools can
only move logic into one direction due to a patent owned by Mentor
Graphics.

# Info: [7004]: Starting retiming program ...
# Info: [7012]: Phase 1
# Info: [7012]: Phase 2
# Info: [7012]: Phase 3
# Info: [7012]: Phase 4
# Info: [7012]: Total number of DSPs processed : 0
# Info: [7012]: Total number of registers added : 138
# Info: [7012]: Total number of registers removed : 66
# Info: [7012]: Total number of logic elements added : 0

Register retiming is something you want to enable by default unless you
are planning to use an equivalence checker,

Hans
www.ht-lab.com

Register retiming is a technique to help timing of setup/hold of a give
path.
It does not and should not change latency of path in terms of cloc
periods.
The OP is referring to latency of a path in terms of clock periods rathe
than delay issues within a given path.

Ka
--------------------------------------
Posted through http://www.FPGARelated.com

kaz · Sep 30, 2015

W dniu ĂÂroda, 30 wrzeĂÂnia 2015 09:41:15 UTC+1 uĂÂźytkownik
kaz napisaĂÂ:
Any VHDL compiler cannot be a useful compiler unless it respects th
user
entered registers. Though it may fit an equivalent arrangement as in
register retiming for timing purposes.

Register delay stages is obviously what we are talking about rathe
than
combinatorial/routing delays which is a concern for each registe
timing
and which the tool decides together with any constraints from user.

It is up to user to decide the register delay stages. It cannot be
technology sensitive unless you are doing some high level coding that
does
not specify registers. I don't know what this level is though.

How come a user build a design without being correct about register
delay.
How do you add streams or multiply or switch etc. and ask the tool t
do
the job?

In the systems which I have to build there are some paremetrized
components, in which latency depends on their parameters. Unfortunately
can not
publish the original designs but a simplified version of one of thos
systems
is provided as a demonstration of the method on OpenCores.
For example I have a block for finding the maximum value from certain
number of inputs. It is a tree built from elemantary comparators.
When looking for optimal implementation (in terms of resource usage and
maximum clock frequency) I have to select the number of values compared
simultaneously in such a basic comparator. My implementatio
automatically
adjusts number of stages to the number of inputs in an elementar
comparator and
in the whole system. Of course the number of stages affects the latency
(delay in number of clocks). There are many such blocks which may b
adjusted
independently.
Tryig to keep design adjusted properly (in a sense that all latencies in
parallel pipelines are equal) is really difficult and error-prone.
So thats why I needed a tool which does it for me.
Of course I have to analyze the results, and sometime introduce manual
corrections...
Does it answer the question above?

Regards,
Wojtek

so in short you regenerate some components with new latency different fro
the intended and tested one. I will just balance the latency manually an
run the test. I don't see much practical scope for automating suc
change.

Ka
--------------------------------------
Posted through http://www.FPGARelated.com

Automatic latency balancing in VHDL-implemented complex pipe

Guest

glen herrmannsfeldt

Guest

Guest

Guest

kaz

Guest

glen herrmannsfeldt

Guest

Guest

Tim Wescott

Guest

rickman

Guest

Tim Wescott

Guest

rickman

Guest

rickman

Guest

Tim Wescott

Guest

rickman

Guest

kaz

Guest

Guest

HT-Lab

Guest

kaz

Guest

kaz

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Automatic latency balancing in VHDL-implemented complex pipe

Guest

glen herrmannsfeldt

Guest

Guest

Guest

kaz

Guest

glen herrmannsfeldt

Guest

Guest

Tim Wescott

Guest

rickman

Guest

Tim Wescott

Guest

rickman

Guest

rickman

Guest

Tim Wescott

Guest

rickman

Guest

kaz

Guest

Guest

HT-Lab

Guest

kaz

Guest

kaz

Guest

Log in

Welcome to EDABoard.com

Sponsor