Style for Highly-Pipelined State Machines

K

Kevin Neilson

Guest
I'm designing another FSM and I've run into the problem I always have
when trying to pipeline them for high-speed designs. I'll show a simple
example.

STATE2: begin
if (condition)
begin
state <= STATE3;
y <= a*b+c;
end
end

The problem occurs if I need to add pipelining to meet speed
requirements: in this case, the multiplier inputs, output, and adder
output must be registered. So I have to end up doing this in the FSM:

STATE2: begin
if (condition)
begin
state <= STATE3;
a_pipe <= a;
b_pipe <= b;
end
end

and outside the FSM I have to have concurrent FSMs or other logic:

always@(posedge clk)
begin
m <= a_pipe * b_pipe;
p <= m + c;
y <= p;
end

The whole simplicity of the FSM is destroyed. Debugging and maintaining
it becomes very difficult. The output y isn't available for three
cycles, so I have to figure out what states the FSM might be in at that
point and ensure that I don't use y before it's ready. Using any sort
of FSM editor (like those bubble diagram schematic editors) is
completely precluded. A similar problem crops up when 'condition' is a
complex equation and has to be pipelined into the past. A common
example if 'condition' has a comparison that doesn't meet timing:

if (stuff && wide_counter==32'b3) ...

If this doesn't meet timing, I can instead pipeline the condition
outside the FSM:

wide_counter_eq_3 <= wide_counter==32'b2; // will be 3 on next cycle

and then inside the FSM:

if (stuff && wide_counter_eq_3) ...

but this can get messy and get screwed up if the counter doesn't always
increment from 2 to 3.

My question: what is the cleanest way to describe an FSM requiring
pipelining? Is there some sort of tool that will let me make a nice
bubble diagram but also indicate which operations need pipelined and
will then generate the proper synthesizable HDL?
-Kevin
 
Kevin Neilson wrote:

My question: what is the cleanest way to describe an FSM requiring
pipelining?
I don't know, but I use a step counter
and a case of that step counter in
a synchronous block.

-- Mike Treseler
 
KJ wrote:
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message
news:fv7i38$69n6@cnn.xsj.xilinx.com...
My question: what is the cleanest way to describe an FSM requiring
pipelining?
....
The other thing to consider is whether the latency being introduced by this
outsourced logic needs to be 'compensated for' in some fashion or is it OK
to simply wait for the acknowledge. In some instances, it is fine for the
FSM to simply wait in a particular state until the acknowledge comes back.
In others you need to be feeding new data into the hunk-o-logic on every
clock cycle even though you haven't got the results from the first back. In
that situation you still have the req/ack pair but now the ack is simply
saying that the request has been accepted for processing, the actual results
will be coming out later. Now the hunk-o-logic needs an additional output
to flag when the output is actually valid. This output data valid signal
would typically tend to feed into a separate FSM or some other logic (i.e.
'usually' not the first FSM). The first FSM controls feeding stuff in, the
second FSM or other processing logic is in charge of taking up the outputs
and doing something with it.

....

Kevin Jennings

In this case I do indeed have to continue to keep the pipe full, so
inserting wait states is not an option. And the latency of the "hunk of
logic", aka concurrent process, is actually significant because I have
to get the result and feed it back into the FSM. This example shows why:

STATE2: begin
if (condition)
begin
state <= STATE3;
y <= (a*b+c)*d;
end
end

I have to get the result (a*b+c) and then feed it back into the FSM so I
can multiply by d. Why not just let the concurrent process handle that?
Because I want to limit my resource usage to a single DSP48, so I have
to schedule the multiplications inside the FSM. But I'll have to check
out the Wishbone thing you're talking about.
-Kevin
 
Kevin Neilson wrote:

I have to get the result (a*b+c) and then feed it back into the FSM so I
can multiply by d. Why not just let the concurrent process handle that?
Because I want to limit my resource usage to a single DSP48, so I have
to schedule the multiplications inside the FSM.
I still like the idea of a step counter.

On tick one, I do x <= a * b *c;
On tick two, I do y <= x * d;
and so on ...

-- Mike Treseler
 
I think you should do is put your "piplied" .
the number of state you need to add == the number of cycle you need to
use for (a*b+c)*d. that's you exactly add the operation inside the
states. but not out side the FSM


On May 2, 2:20 pm, Kevin Neilson <kevin_neil...@removethiscomcast.net>
wrote:
KJ wrote:
"Kevin Neilson" <kevin_neil...@removethiscomcast.net> wrote in message
news:fv7i38$69n6@cnn.xsj.xilinx.com...
My question:  what is the cleanest way to describe an FSM requiring
pipelining?
...
The other thing to consider is whether the latency being introduced by this
outsourced logic needs to be 'compensated for' in some fashion or is it OK
to simply wait for the acknowledge.  In some instances, it is fine for the
FSM to simply wait in a particular state until the acknowledge comes back.
In others you need to be feeding new data into the hunk-o-logic on every
clock cycle even though you haven't got the results from the first back.  In
that situation you still have the req/ack pair but now the ack is simply
saying that the request has been accepted for processing, the actual results
will be coming out later.  Now the hunk-o-logic needs an additional output
to flag when the output is actually valid.  This output data valid signal
would typically tend to feed into a separate FSM or some other logic (i.e.
'usually' not the first FSM).  The first FSM controls feeding stuff in, the
second FSM or other processing logic is in charge of taking up the outputs
and doing something with it.

...

Kevin Jennings

In this case I do indeed have to continue to keep the pipe full, so
inserting wait states is not an option.  And the latency of the "hunk of
logic", aka concurrent process, is actually significant because I have
to get the result and feed it back into the FSM.  This example shows why:

STATE2: begin
   if (condition)
     begin
       state <= STATE3;
       y     <= (a*b+c)*d;
     end
end

I have to get the result (a*b+c) and then feed it back into the FSM so I
can multiply by d.  Why not just let the concurrent process handle that?
  Because I want to limit my resource usage to a single DSP48, so I have
to schedule the multiplications inside the FSM.  But I'll have to check
out the Wishbone thing you're talking about.
-Kevin
 
KJ wrote:
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message
news:fvfm29$on81@cnn.xsj.xilinx.com...
KJ wrote:
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message
news:fv7i38$69n6@cnn.xsj.xilinx.com...
My question: what is the cleanest way to describe an FSM requiring
pipelining?
...
The other thing to consider is whether the latency being introduced by
this outsourced logic needs to be 'compensated for' in some fashion or is
it OK to simply wait for the acknowledge. In some instances, it is fine
for the FSM to simply wait in a particular state until the acknowledge
comes back. In others you need to be feeding new data into the
hunk-o-logic on every clock cycle even though you haven't got the results
from the first back. In that situation you still have the req/ack pair
but now the ack is simply saying that the request has been accepted for
processing, the actual results will be coming out later. Now the
hunk-o-logic needs an additional output to flag when the output is
actually valid. This output data valid signal would typically tend to
feed into a separate FSM or some other logic (i.e. 'usually' not the
first FSM). The first FSM controls feeding stuff in, the second FSM or
other processing logic is in charge of taking up the outputs and doing
something with it.

...
Kevin Jennings
In this case I do indeed have to continue to keep the pipe full, so
inserting wait states is not an option. And the latency of the "hunk of
logic", aka concurrent process, is actually significant because I have to
get the result and feed it back into the FSM. This example shows why:

STATE2: begin
if (condition)
begin
state <= STATE3;
y <= (a*b+c)*d;
end
end

I have to get the result (a*b+c) and then feed it back into the FSM so I
can multiply by d. Why not just let the concurrent process handle that?
Because I want to limit my resource usage to a single DSP48, so I have to
schedule the multiplications inside the FSM. But I'll have to check out
the Wishbone thing you're talking about.

Well, just the fact that you're time sharing the DSP48 means that you're not
processing something new on every clock cycle which just screams out to me
that you'd want to implement this with a request/acknowledge type of
framework. ...

Kevin Jennings


But I *do* have to process something on every cycle. Consider that I
have to process these two equations:

y0 <= (a0*b0+c0)*d0;
y1 <= (a1*b1+c1);

Now, if you look at the structure of the DSP48, you can see that I can't
even process these two sequentially. I can send off (a0*b0+c0)*d0 to
the black box thingy you speak of, but this can't be processed without
dead cycles: I have to get the result of (a0*b0+c0) before I multiply
it with d0, and if the DSP48 is fully pipelined, that means that the
multiplier is unused for three cycles. It's similar to a superscalar
process with dependencies. So I have to reschedule: I put (a0*b0+c0)
into the pipe, then put in (a1*b1+c1) (which has no dependency on what
is in the pipe), and then when the result of (a0*b0+c0) pops out I can
feed it back into the DSP48 and multiply it with d0 to get y0. In the
meantime y1 pops out. Without this intermixed scheduling I end up with
too many dead cycles and then I need to use too many DSP48s.

Maybe what I really want is a way to take the superscalar scheduling
techniques that have been used for a long time and apply them in a
similar way to HDL.
-Kevin
 
Mike Treseler wrote:
Kevin Neilson wrote:

I have to get the result (a*b+c) and then feed it back into the FSM so I
can multiply by d. Why not just let the concurrent process handle that?
Because I want to limit my resource usage to a single DSP48, so I have
to schedule the multiplications inside the FSM.

I still like the idea of a step counter.

On tick one, I do x <= a * b *c;
On tick two, I do y <= x * d;
and so on ...

-- Mike Treseler
That is essentially what I'm doing; I'm just trying to find a
syntactically better way to design this pipelined stuff without having a
bunch of interdependent concurrent FSMs (or a single FSM and a bunch of
logic outside). -Kevin
 

Welcome to EDABoard.com

Sponsor

Back
Top