A basic code-hardware-timing question

M

Marwan

Guest
Peace,

Consider: -

The following represents the kind of code I am dealing with.

Task A:
(Clocked) Iterative structure (value of iteration)
begin
c <= a*b;
d <= b*f;
g <= c/a;
k <= g+d;
end

(Ignore the sizes of the following variables and any fixed point
arithmetic considerations, they are straight forward to deal with and
besides the point for my question.)

Typically later calculations depend on previous ones and series of
calculations like these take place in iterative loops or nested
iterative loops (using counters, not for loops). I have numerous such
calculation sequences.

I only want to use 1 multiplier, 1 adder and 1 divider, so I cannot
have 2 or more multiplications, additions or divisions at the same
time. So 2 options I see are: -

1- If I use blocking assignments, I will fulfil the hardware usage
constraint, however, clock speed will suffer, as all the sequence of
calculations in Task A will need to take place in one clock cycle.
And where there are nested cycles of numerous calculations, the
clocking frequency would be greatly affected.
2- If I use nonblocking assignments then each calculation will take
place on a clock cycle, but all the calculations in the sequence of
Task A take place simultaneously, which can imply the use of several
multipliers, adders and dividers.

My question: -

Is it possible to use nonblocking assignments for Task A, yet still: -

* Use only 1 multiplier, 1 adder and 1 divider
* Have one calculation per clock cycle

If not, how would you suggest this situation is dealt with in a
minimum hardware situation? And what are the timing implications of
this?

Thank you for your time and consideration.
 
Marwan wrote:
My question: -
Is it possible to use nonblocking assignments for Task A, yet still: -
* Use only 1 multiplier, 1 adder and 1 divider
* Have one calculation per clock cycle
If not, how would you suggest this situation is dealt with in a
minimum hardware situation? And what are the timing implications of
this?
No, because a practical divide
will require more than one clock cycle.
However, if this is a homework problem,
practicality may not be needed.

Non-blocking assignments are safe iff those registers
are declared and used in one block. Example:
http://mysite.verizon.net/miketreseler/count_enable.v

To spread logic over time, I like to use
a case of a counter register. I work out
timing implications by simulation and trial synthesis.

Sharing logic means adding code to mux the inputs
and outputs of the shared section. This may
or may not save resources. Note that devices
with dsp blocks often have unused hardware multipliers.

Thank you for your time and consideration.
You are welcome.

-- Mike Treseler
 
On Aug 9, 4:58 pm, "Robert Miles" <robertmi...@bellsouthNOSPAM.net>
wrote:
In the particular case of Task A, see if you can apply the following
optimization:

g <= c/a;

g <= a*b/a;

g <= b;

Won't work in all similar tasks, though.
A more general solution is needed because as you rightly stated, your
approach has limited applicability. Task A was just something quick
typed up to show mult, div, add and inter-relationship between outputs
of one calculation and the input of another... so your little trick
there highlights that small oversight.

Peace
 
Marwan wrote:
On Aug 9, 5:58 pm, Mike Treseler <mtrese...@gmail.com> wrote:
No, because a practical divide
will require more than one clock cycle.
However, if this is a homework problem,
practicality may not be needed.

How about if I did not have to have one calculation per clock cycle?
That makes things easier.

Also are you saying that I would have to somehow control for the
cycles taken by a divider and multiplier (by analogy?) within a
sequence of calculations which include calculations that can take
place in one cycle?
If the complete calculation takes longer
than 1/clock_period, I have to break it up somehow.

That sounds difficult to deal with in a synchronous system... Somehow
making everything wait until the divider is finished before moving on
to the next iteration of the set of calculations...
I think of it as a program counter in a cpu.
If you don't like that, you can break it up structurally
or use a register pipeline.

Also, why would a divide need more than one clock cycle? Surely it
can calculate asynchronously, and as long as the operating frequency
of the system is not too fast, you can get your result in one clock
cycle?
You might be able to arrange that with a 1 MHz clock.
Try it and see.

To spread logic over time, I like to use
a case of a counter register. I work out
timing implications by simulation and trial synthesis.

Where the counter counts up to the number of stages/cycles you wish to
spread your logic over?
That's it.

If I was not to reuse hardware I would need 10's of multipliers and
dividers... totally impractical. Also, I do not see how muxing inputs
and outputs could possibly be more expensive than dividers and
multipliers...
Depends what the hardware is and how much it costs.
Try it and see.
Start with a simple case.

Is there any way to make a general estimate for the number of clock
cycles taken for a Task B type situation... Or a Task A even?
Write code for a simple example.
Run a sim to check the cycle timing.
Look at the RTL viewer to check synthesis.
Edit code.
Repeat.

Thank you for your time.
You are welcome.
Good luck.

-- Mike Treseler
 
On Aug 9, 5:58 pm, Mike Treseler <mtrese...@gmail.com> wrote:
No, because a practical divide
will require more than one clock cycle.
However, if this is a homework problem,
practicality may not be needed.
How about if I did not have to have one calculation per clock cycle?

Also are you saying that I would have to somehow control for the
cycles taken by a divider and multiplier (by analogy?) within a
sequence of calculations which include calculations that can take
place in one cycle?

So if I have

(Clocked) Iterative structure (value of iteration)
begin
...
a <= b*c;
g <= d/r;
k <= e+w;
...
end

Then it could take

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle
n clock cycles
1 clock cycle
end
?

That sounds difficult to deal with in a synchronous system... Somehow
making everything wait until the divider is finished before moving on
to the next iteration of the set of calculations...

Also, why would a divide need more than one clock cycle? Surely it
can calculate asynchronously, and as long as the operating frequency
of the system is not too fast, you can get your result in one clock
cycle?

Non-blocking assignments are safe iff those registers
are declared and used in one block.  Example:http://mysite.verizon.net/miketreseler/count_enable.v

To spread logic over time, I like to use
a case of a counter register. I work out
timing implications by simulation and trial synthesis.
Where the counter counts up to the number of stages/cycles you wish to
spread your logic over?

Sharing logic means adding code to mux the inputs
and outputs of the shared section. This may
or may not save resources. Note that devices
with dsp blocks often have unused hardware multipliers.
If I was not to reuse hardware I would need 10's of multipliers and
dividers... totally impractical. Also, I do not see how muxing inputs
and outputs could possibly be more expensive than dividers and
multipliers...

Ok an important question for me is this, if I do have the following
kind of code:-

Task B
(Clocked) Iterative structure (value of iteration)
begin
.
.
.
series of nonblocking mult, divide and add operations
.
.
(Clocked) Iterative structure (value of iteration)
begin
.
.
.
series of nonblocking mult, divide and add operations
.
.
end
end

Is there any way to make a general estimate for the number of clock
cycles taken for a Task B type situation... Or a Task A even?

Thank you for your time.
 
Thank you for you time. It is kind of you to give this time to a
stranger.

Peace
 
Marwan wrote:
(snip)
How about if I did not have to have one calculation per clock cycle?
For the divider, you probably want it pipelined.
Either that or iterative (less logic).

Also are you saying that I would have to somehow control for the
cycles taken by a divider and multiplier (by analogy?) within a
sequence of calculations which include calculations that can take
place in one cycle?

(Clocked) Iterative structure (value of iteration)
begin
...
a <= b*c;
g <= d/r;
k <= e+w;
...
end
It all depends.

It is possible to write a combinatorial multiplier or divider,
but there isn't much reason to do it. Both pipeline very
easily (especially on FPGA where registers are pretty much
free). A pipelined multiplier or divider will take N
clock cycles (N might be the width of the multiplier or
quotient), but can do two operations in N+1 cycles.

A pipelined multiplier takes about 2N times as much logic
as an adder, and a divider about 3N times an adder.

(There isn't much reason to limit the number of adders,
as the logic to reuse one will be bigger than the adder itself.)

If you have N clock cycles, but no need to pipeline the
adder or multiplier, then an iterative one takes about
three or four times the logic or an adder, and still
N clock cycles to complete the operations.

Then it could take

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle
n clock cycles
1 clock cycle
end
?

That sounds difficult to deal with in a synchronous system... Somehow
making everything wait until the divider is finished before moving on
to the next iteration of the set of calculations...

Also, why would a divide need more than one clock cycle? Surely it
can calculate asynchronously, and as long as the operating frequency
of the system is not too fast, you can get your result in one clock
cycle?
Yes, but it is a waste of logic. Only if logic is especially
cheap and you don't have a faster clock is there any reason
to do that.

Non-blocking assignments are safe iff those registers
are declared and used in one block. Example:http://mysite.verizon.net/miketreseler/count_enable.v
(snip)

If I was not to reuse hardware I would need 10's of multipliers and
dividers... totally impractical. Also, I do not see how muxing inputs
and outputs could possibly be more expensive than dividers and
multipliers...
Mux are fairly expensive for FPGAs, but so bad for others.
(snip)

Is there any way to make a general estimate for the number of clock
cycles taken for a Task B type situation... Or a Task A even?
Adders should work in one clock cycle. Multipliers in N cycles,
where N is the width of one of the operands. Divider in N
cycles, where N is the width of the quotient.

-- glen
 
Thank you everyone for the responses.

I did some research on Xilinx boards and did come across allot of what
was written here... However there is one question which I am still
interested in which I cannot find a nice answer to:

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle
n clock cycles
1 clock cycle
end
How does one practically deal with such a situation, blocking
assignments would ensure that the order of processes works nicely,
However operating frequency would suffer dramatically.

I can appreciate that with a pipelined multiplier or divider, one can
get outputs clock after clock with a continuous stream of input data
after a fixed number of clock cycles with a good clock frequency...
However, how can such a process be properly set up in a situation such
as the example above? Given that there are two constraints, the series
of calculations in every iteration cycle must have: -

1- The same data needs to be available to all the calculations.
2- The calculations preceding and following a multiplication or
division, can be related outputs or inputs.
3- The outputs of the calculations in one iteration need to be
available to the next.

Non blocking assignments would work nicely if one could somehow
properly align or set up the fully pipelined multiplier/divider to be
in synch with the data being dealt with in the surrounding
calculations, however this seems ridiculous as it implies hardware
that predicts the future!

Any gurus around here with experience dealing with such a situation?
 
To clarify: -

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle (an ADDITION)
n clock cycles (a MULTIPLICATION/DIVISION)
1 clock cycle (an ADDITION)
end
 
Marwan wrote:
(snip)

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle
n clock cycles
1 clock cycle
end

How does one practically deal with such a situation, blocking
assignments would ensure that the order of processes works nicely,
However operating frequency would suffer dramatically.
How would you build it using 7400 series TTL parts?

Mostly you have to think at different levels. Think about
it at the block level (what else can it do while doing the
multiply?), down to the gate and wire level.

-- glen
 
On Aug 20, 1:17 am, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
Marwan wrote:

(snip)

(Clocked) Iterative structure (value of iteration)
begin
1 clock cycle
n clock cycles
1 clock cycle
end
How does one practically deal with such a situation, blocking
assignments would ensure that the order of processes works nicely,
However operating frequency would suffer dramatically.

How would you build it using 7400 series TTL parts?

Mostly you have to think at different levels.  Think about
it at the block level (what else can it do while doing the
multiply?), down to the gate and wire level.

-- glen
I considered your suggestion, it makes sense. However within the
context of what I am doing it can be very challenging. The solution
to the above case that i highlighted... is to pipeline, or stagger the
adders. All this will do is align the outputs of the adders with the
multipliers/dividers and the speed will remain the same.

What do you think?
 

Welcome to EDABoard.com

Sponsor

Back
Top