Speeding up computations in sequential algorithm

Marios Barlas · Nov 19, 2014

Dear all,

I'm trying to figure out how to speed up my computation on the following algorithm:

Following my previous post, I wrote a code that implements an FSMD of 3 states calculating the square root of a number according to a clock. All is working fine, but i'lm trying to figure out if I can make it more efficient by making more calculations on the same clock cycle.

I have a process and in this a case statement with me FSMD states. Since I can't use the for...generate command inside the process I tried to use some buffer variables like this :

res_buff <= (res_c-(root_c + delta_c)) when (((root_c + delta_c) <= res_c) ) else
res_c;
root_buff <= shift_right(root_c + shift_left(delta_c,1),1) when ((root_c + delta_c) <= res_c) else
shift_right(root_c,1);
delta_buff <= shift_right(delta_c,2);
-- Parallelization
res_next <= (res_buff-(root_buff + delta_buff)) when (((root_buff + delta_buff) <= res_buff) ) else
res_buff;
root_next <= shift_right(root_buff + shift_left(delta_buff,1),1) when ((root_buff + delta_buff) <= res_buff) else
shift_right(root_buff,1);
delta_next <= shift_right(delta_buff,2);

but it seems to mess up my results. From the little I know I think this is pipelining but I'm not sure how to infer this logic in VHDL. Anyone could give me an idea?

Thanks in Advance,
Marios Barlas

Brian Drummond · Nov 20, 2014

On Wed, 19 Nov 2014 03:24:54 -0800, Marios Barlas wrote:

Dear all,

I'm trying to figure out how to speed up my computation on the following
algorithm:
....
I have a process and in this a case statement with me FSMD states. Since
I can't use the for...generate command inside the process I tried to use
some buffer variables like this :
....
but it seems to mess up my results. From the little I know I think this
is pipelining but I'm not sure how to infer this logic in VHDL. Anyone
could give me an idea?

The first stage is to have a clear idea what you want to happen in each
cycle of the pipeline - I find it helps to name signals appropriately and
to organise the pipelined process into its logical stages.

Here's a simple example of a badly pipelined design (WARNING : I am not
recommending this practice!)

http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-
with.html

and my approach to cleaning it up, in response to a StackExchange
question on this topic.

http://stackoverflow.com/questions/14765205/how-can-i-speed-up-my-math-
operations-in-vhdl/14777458#14777458

These may help get you started

- Brian

rickman · Nov 20, 2014

On 11/19/2014 6:24 AM, Marios Barlas wrote:

Dear all,

I'm trying to figure out how to speed up my computation on the following algorithm:

Following my previous post, I wrote a code that implements an FSMD of 3 states calculating the square root of a number according to a clock. All is working fine, but i'lm trying to figure out if I can make it more efficient by making more calculations on the same clock cycle.

I have a process and in this a case statement with me FSMD states. Since I can't use the for...generate command inside the process I tried to use some buffer variables like this :

res_buff <= (res_c-(root_c + delta_c)) when (((root_c + delta_c) <= res_c) ) else
res_c;
root_buff <= shift_right(root_c + shift_left(delta_c,1),1) when ((root_c + delta_c) <= res_c) else
shift_right(root_c,1);
delta_buff <= shift_right(delta_c,2);
-- Parallelization
res_next <= (res_buff-(root_buff + delta_buff)) when (((root_buff + delta_buff) <= res_buff) ) else
res_buff;
root_next <= shift_right(root_buff + shift_left(delta_buff,1),1) when ((root_buff + delta_buff) <= res_buff) else
shift_right(root_buff,1);
delta_next <= shift_right(delta_buff,2);

but it seems to mess up my results. From the little I know I think this is pipelining but I'm not sure how to infer this logic in VHDL. Anyone could give me an idea?

Thanks in Advance,
Marios Barlas

To start with, I don't know what an FSMD is. Is this like an FSM
(Finite State Machine)?

I can't comment on the correctness of your code because I don't know
what it is supposed to do.

You are only showing part of your work. Should we assume this is inside
a clocked process? I am guessing yes since you talk about pipelining
and clocks. In that case, each of the signal assignment statements
above will generate registers.

The assignment to res_buff looks like it only includes signals from
outside the process. The assignment to root_buff looks the same so this
will run in parallel without depending on the assignment to res_buff.
The assignment to delta_buff is the same. So all three of these are
running at the same point in the pipeline.

The assignment to res_next depends on all three of the above, so it
would be in the second stage of the pipeline. Same for root_next and
delta_next. So all three of these signals are in the second stage of
the pipeline and will produce results with the same cycle timing.

This is a two stage pipeline producing a result every clock cycle with a
two clock cycle latency.

There are some things you can do to improve the efficiency of this
algorithm. The expression root_c + delta_c is used more than once, but
will likely produce multiple adders coded this way. This sum is
subtracted from res_c as well as compared to res_c. This can share the
same logic if coded to do so. This is where variables can be used.

Variables in a clocked process will not generate registers if they are
assigned before they are used. So you could do this...

sum_c := root_c + delta_c;
diff_c := res_c - sum_c;
res_buff <= res_c when diff_c < 0 else diff_c;

This provides the tools a better chance at using just two adders (or
subtractor - same thing) and using the carry out from the second one to
control the mux selecting the input to res_buff.

In addition the same variables can be used to control the assignment for
root_buff since it uses the same comparison, again saving logic and
likely improving the speed at least a little.

If you want to speed this up further, you can pipeline the above two
variable assignments as signal assignments giving two more stages to
your pipeline. You will need to delay the use of root_c and delta_c in
the assignment to root_buff to keep them at the same delay. Likewise
for delta_buff so all three x_buff results are ready at the same time.

You can use all the same methods to improve the assignments to res_next,
root_next and delta_next.

If you are not familiar with variable assignments, they work like
variables in a regular C program completing their assignments as they
are executed while signal assignments are not completed until the
process stops. Using a variable in a clocked process gets you the
current value while using a signal inside the process gets you the old
value no matter what.

I hope this helps some.

--

Rick

Nikolaos Kavvadias · Nov 22, 2014

Dear Marios and rickman,

> > @rickman:

An FSMD (Finite-State Machine with Datapath) is a microarchitectural paradigm for implementing non-programmable/hardwired processors with their control unit and datapath combined/merged. Essentially, the datapath actions are embedded within the actual next state and output logic decoder of your RTL FSM description (if you view it this way).

I have a published work on FSMDs: http://cdn.intechweb.org/pdfs/29207.pdf while my high-level synthesis tool, HercuLeS ( http://www.nkavvadias.com/hercules/ ) produces such FSMDs. A manual for HercuLeS is here: http://www.nkavvadias.com/hercules-reference-manual/hercules-refman.pdf.

I have based my version of FSMDs to Prof. Gajski's and Pong P. Chu's work, mostly on some of their books and published papers. I had rented a couple of Gajski's books from the local library [I was a lecturer at the time, now I ain't but still can drive 2km to closest higher education library and rent various works] and have actually bought two of P.P. Chu's works; the RTL Hardware Design using VHDL book is brilliant. Further, Gajski's work on SpecC and the classic TRs (technical reports) from his group were at some point night (by the bed) and day (by the desk) readings...

I believe Vivado HLS (aka AutoESL/xPilot) and the others do the same thing, with one key difference on how the actual RTL FSMD code is presented. Their datapath code is implemented with concurrent assignments and there are lots of control and status signals going in and out of the next state logic decoder. On the contrary I prefer datapath actions embedded within state decoding; produces a little slower and fatter (sic) hardware overall, but the user's intention in the RTL is much more clear and it is to grasp and follow.

@Marios:

I'm trying to figure out how to speed up my computation on the following algorithm:

I know this might not be an scientifically appropriate proposal from my side, but do you have a plain software description of the algorithm? E.g. when I was enumerating topological sorts I started from Knuth's pseudocode, then coded an ANSI/ISO C version, then modified it for HLS with HercuLeS, and voila: a meaningless hardware version for the code as well

Following my previous post, I wrote a code that implements an FSMD of 3
states calculating the square root of a number according to a clock. All is
working fine, but i'm trying to figure out if I can make it more efficient by > > making more calculations on the same clock cycle.

Having an FSMD with only 3 states for a square-rooting algorithm seems counter-intuitive, unless one of the states is a multi-cycle state or a "multi-state".
Or you are doing a lot of operation chaining within a single state.

Try to follow basic principles from Pong P. Chu:

1) Have a _reg and _next version for each register as declared signals in VHDL code.
2) In each state, read the _reg's and assign the _next's.
3) Donnot reassign the same _next version of a register within a single FSMD state.
4) You can totally avoid variables from your code. Not all tools provide equally mature support for synthesizing code with variables.
5) Operation chaining is possible but requires that you write _next versions and read them in the same state. Then these are plain wires and donnot implement registers. Again, you can peruse the same _next version more than once in the same state.

I had developed an ingenious technique for automatically modifying a VHDL FSMD code for adding controlled operation chaining; I was about to patent it but spared on the money of course. My technique just uses a lexer and to read more about it, see chapter III.E of this work: http://www.nkavvadias.com/publications/kavvadias_asap12_cr.pdf

There are also newer works on HercuLeS like: http://www.nkavvadias.com/publications/hercules-pci13.pdf and a journal paper accepted for publication.

Best regards
Nikolaos Kavvadias
http://www.nkavvadias.com

Nikolaos Kavvadias · Nov 22, 2014

> 5) Operation chaining is possible but requires that you write _next versions and read them in the same state. Then these are plain wires and donnot implement registers. Again, you can peruse the same _next version more than once in the same state.

Of course I meant that you cannot peruse **for writing** the same _next version more than once in the same state!

I just needed to clarify this.

Best regards
Nikolaos Kavvadias
http://www.nkavvadias.com

Speeding up computations in sequential algorithm

Marios Barlas

Guest

Brian Drummond

Guest

rickman

Guest

Nikolaos Kavvadias

Guest

Nikolaos Kavvadias

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Speeding up computations in sequential algorithm

Marios Barlas

Guest

Brian Drummond

Guest

rickman

Guest

Nikolaos Kavvadias

Guest

Nikolaos Kavvadias

Guest

Log in

Welcome to EDABoard.com

Sponsor