need a cheap student edition FPGA


Having two bits hot in a one-hot FSM would normally be a bad thing. But I
was wondering if anybody does this purposely, in order to fork, which
might be a syntactically nicer way to have a concurrent FSM.
I wonder about this too. I am currently doing a pipeline and
some code is shown below. So I wrote out the states without
an array so when the ModelSim comes up I don't have to expand
the states to see them. I also group signals that I want to
see associated with the states in the declarative region so
I don't have to futz too much with the ModelSim.

Some states stay on longer with the state_count condition.
I read one header record on state31 and follow it with three
of another type of data record with state32.

Now the "branching" happens because these states address
a NoBL SRAM and there is a two cycle lag between the
address and the data. Not show below, I also have clock
delays on these states, state32_1, state32_2, and so on
so when the address goes out on state32, I then have data
to process on state32_2.

In my Zen thinking about this,
I always have a When state associated with every What.

It actually gets deeper than that because there are FIFOs
involved as well. You'll need FIFOs in your design if
you are going to tackle a Sobel function. Here the trick
is to start thinking about your processses starting from
the READ data and figure out how many delays you need to
deliver an answer, then figure out where the WRITE data
should marry into the flow. I have now states out to _7.

Perhaps someone could suggest a better term than state
machine "forking"? And if there is some guidelines on how
to code ane debug pipelined architecture. I'm with Kevin,
it get's real messy, real soon.

Brad Smallridge

if(clk'event and clk='1')then
inner_cell_restart <='0';
state30<='0'; state31<='0'; state32<='0'; state33<='0';
state34<='0'; state35<='0'; state36<='0'; state37<='0';
inner_cell_rd_ad <= std_logic_vector(to_unsigned(inner_cell_start,18));
inner_cell_wr_ad <= std_logic_vector(to_unsigned(inner_cell_start,18));
state_count <= (others=>'0');
elsif(state29='1')then -- state29 automatically turns off from
--State30 Initial inner cell state
state_count <= (others=>'0');
-- State31 Read the inner cell
state31 <='0';
inner_cell_rd_ad <= inner_cell_rd_ad+1;
-- State32 Read the inner cell connections
state_count <= (others=>'0');
state_count <= state_count+1;
end if;
inner_cell_rd_ad <= inner_cell_rd_ad+1;
-- State33 Wait for SRAM to deliver first connection
state_count <= (others=>'0');
-- State34 Read connection
state_count <= (others=>'0');
state_count <= state_count+1;
end if;
. . .
"Brad Smallridge" <> wrote in message

Having two bits hot in a one-hot FSM would normally be a bad thing.
'One hot' is a particular implementation of a FSM, but from a logic
perspective (i.e. how you go about designing your state machine, the states
needed, the branching, etc.) means absolutely nothing.

But I was wondering if anybody does this purposely, in order to fork,
which might be a syntactically nicer way to have a concurrent FSM.
'Concurrent' state machines though is simply another way of saying state
machines that either are totally independent of one another, or only loosely
connected (i.e. there is some signalling going on between the two, but you
can *usually* futz with one without breaking the other).

As I mentioned in more detail in my response on 'Style for Highly-Pipelined
State Machines', I only really see two basic approaches:

- The first is some form of counting or adding states where you have a known
fixed number of states between 'doing this' and 'doing that'. This method
works but it really quickly leads to rather complicated code that is
difficult to understand and (because of the complexity) probably has some
logic holes as well that may take some time to surface. In certain designs
though this method is just fine and the results are fine and easy to
maintain. The problem though is when the realization sets in that the code
is getting out of control and how to manage it (which was the point of the
other thread).

- The second method uses request/acknowledge handshaking between the
'concurrent' state machines. This method scales very nicely from a design
perspective and is just as efficient from an implementation perspective as

Bottom line here is to realize that a 'fork in an FSM' is really a call to
think of it as two separate state machines that have a
communication/signalling requirement and don't try to force your mental
model as being 'one' state machine. After all, the entire design can be
considered to be a single state machine...but it is generally of no use to
think of it that way from a design perspective.

Now the "branching" happens because these states address
a NoBL SRAM and there is a two cycle lag between the
address and the data. Not show below, I also have clock
delays on these states, state32_1, state32_2, and so on
so when the address goes out on state32, I then have data
to process on state32_2.

In my Zen thinking about this,
I always have a When state associated with every What.

It actually gets deeper than that because there are FIFOs
involved as well. You'll need FIFOs in your design if
you are going to tackle a Sobel function. Here the trick
is to start thinking about your processses starting from
the READ data and figure out how many delays you need to
deliver an answer, then figure out where the WRITE data
should marry into the flow. I have now states out to _7.
Try looking at it now from a somewhat different perspective. Let's say you
have one state machine who's sole purpose is to generate read and write
commands and addresses to the memory but not to process the data at all. In
addition there is a second state machine who's sole purpose is to process
the data that gets read back from memory and produce some sort of result,
that maybe goes to memory, maybe goes somewhere else, it doesn't matter.

If the 'address generator' state machine needs the results from some
computation in order to procede on, then it sends a read request to the
'data processor' state machine, and waits until it gets the acknowledge. It
may sound queer, but what that does is allows you to design with any sort of
latency and does not require any apriori knowledge of how many clock ticks
it will take to get that result back, the address generator state machine is
waiting for the acknowledge back from the data processor state machine
(which in turn is waiting for the data to come back from the memory).

Now you could argue that the data processor state machine can't just
*process data*, it most likely needs to know what it is supposed to be doing
with it, and that knowledge likely lives in the address generator state
machine. Fair enough, but all that means is that the address generator
needs to be able to send commands over to the data processor. This could be
done in an ad hoc manner by setting some signal (or more likely multiple
signals) that are inputs to the data processor. This method works fine.
You can also view this interface somewhat more abstractly as the data
processor having a command port that is written to by the address
generator....and again, a simple request/acknowledge handshake is all you
need here as well. In many cases though the simple signal(s) is sufficient,
but in other cases, the two state machines interact a bit more closely and a
more well defined communication channel between the two will clear up a lot
from the perspective of getting to a functional design that is easy to

Perhaps someone could suggest a better term than state
machine "forking"?
Separate state machines that signal each other in some fashion.

And if there is some guidelines on how
to code ane debug pipelined architecture. I'm with Kevin,
it get's real messy, real soon.
Ponder a bit on breaking up things as I've suggested starting with something
small and see how it comes out. It may take a bit to get used to, but the
end result is smaller easier to understand and debug state machines (albeit
more of them) that communicate over well defined internal interfaces.
You'll also find that changes (like switching the Nobl SRAM to DRAM as an
example) can be accomodated without having to change *everything*.

Kevin Jennings
"Kevin Neilson" <> wrote in message
KJ wrote:
"Kevin Neilson" <> wrote in message
My question: what is the cleanest way to describe an FSM requiring
The other thing to consider is whether the latency being introduced by
this outsourced logic needs to be 'compensated for' in some fashion or is
it OK to simply wait for the acknowledge. In some instances, it is fine
for the FSM to simply wait in a particular state until the acknowledge
comes back. In others you need to be feeding new data into the
hunk-o-logic on every clock cycle even though you haven't got the results
from the first back. In that situation you still have the req/ack pair
but now the ack is simply saying that the request has been accepted for
processing, the actual results will be coming out later. Now the
hunk-o-logic needs an additional output to flag when the output is
actually valid. This output data valid signal would typically tend to
feed into a separate FSM or some other logic (i.e. 'usually' not the
first FSM). The first FSM controls feeding stuff in, the second FSM or
other processing logic is in charge of taking up the outputs and doing
something with it.


"Jack" <> wrote in message
Hi ~
I have some question about assign.


output a;
output b;
output c;

reg a;
reg b;
reg [2:0] count;
wire c;

assign c = (a || b) ? count + 1 : 0

Here, a and b was used to input.
I think that it's not good. but some other people said that good.
I don't know that it's is right or wrong ?

If you declare them as outputs, and then don't drive them to any
value in particular, that's not likely to be good.
"Jack" <> wrote in message
Hi ~
I have some question about assign.


output a;
output b;
output c;

reg a;
reg b;
reg [2:0] count;
wire c;

assign c = (a || b) ? count + 1 : 0

Here, a and b was used to input.
I think that it's not good. but some other people said that good.
I don't know that it's is right or wrong ?

If you declare them as outputs, and then don't drive them to any
value in particular, that's not likely to be good.
You'll also find that changes (like switching the Nobl SRAM to DRAM as an
example) can be accomodated without having to change *everything*.
That has been on my mind because there is a DRAM on my board. Not only
will the DRAM require more cycles but perhaps too a varying number of
cycles depending on the sequentiality or randomness of the addressing.

I have seen controllers on the Xilinx site, but nothing, that talks about
several ports, and how the hand shaking is handled. My FAE has said that
some multiport examples are availble.

Brad Smallridge
"Kevin Neilson" <> wrote in message
KJ wrote:
On May 5, 12:13 pm, "Brad Smallridge" <
You'll also find that changes (like switching the Nobl SRAM to DRAM as
example) can be accomodated without having to change *everything*.
Designing a request/acknowledge interface to some other process or
entity (in this case the 'other' being a DRAM controller) results in a
much easier to maintain design.

Using the exact same interface signal functionality whether one is
talking to internal FPGA memory, NoBL or SDRAM or SPI results in a
design that can be reused, retargeted and improved upon if necessary.
Kevin Jennings

This is a great example, because switching from one type of RAM to another
means you *do* have to change everything, if you want the controller to be
The methodology I use makes use of every clock cycle, DRAMs are running full
tilt, transfers from fast FPGA through a PCI bus to some other processor,
etc., the whole 9 yards.

You can certainly modularlize the code and make concurrent SMs with
handshaking and this is easy to maintain. And a lot of DRAM controllers
are designed this way. But here is the problem: while you are waiting
around for acknowledges, you have just wasted a bunch of memory bandwidth.
"Kevin Neilson" <> wrote in message
KJ wrote:
"Kevin Neilson" <> wrote in message
KJ wrote:
"Kevin Neilson" <> wrote in message
My question: what is the cleanest way to describe an FSM requiring

Well, just the fact that you're time sharing the DSP48 means that you're
not processing something new on every clock cycle which just screams out
to me that you'd want to implement this with a request/acknowledge type
of framework. ...

Kevin Jennings
But I *do* have to process something on every cycle.
You're not able to process a new set of 'a', 'b', 'c' and 'd' on every clock
cycle since the DSP48 is time shared (by your choice) and that was my point.
Time multiplexing the DSP48 to keep *it* busy on every clock cycle is not
the same thing.

Consider that I have to process these two equations:

y0 <= (a0*b0+c0)*d0;
y1 <= (a1*b1+c1);

Now, if you look at the structure of the DSP48, you can see that I can't
even process these two sequentially. I can send off (a0*b0+c0)*d0 to the
black box thingy you speak of, but this can't be processed without dead
cycles: I have to get the result of (a0*b0+c0) before I multiply it with
d0, and if the DSP48 is fully pipelined, that means that the multiplier is
unused for three cycles.
That's only true if the addition can't be done combinatorially. If it can
then the calculation of 'y0' takes two clock cycles and the DSP48 is fully
utilized. The answer pops out after two clock cycles of latency, the DSP48
hums along doing something useful on every tick.

It's similar to a superscalar process with dependencies. So I have to
reschedule: I put (a0*b0+c0) into the pipe, then put in (a1*b1+c1) (which
has no dependency on what is in the pipe), and then when the result of
(a0*b0+c0) pops out I can feed it back into the DSP48 and multiply it with
d0 to get y0. In the meantime y1 pops out. Without this intermixed
scheduling I end up with too many dead cycles and then I need to use too
many DSP48s.
And depending on just what the bottlenecks in the design are, one can do all
kinds of things. But no matter what, you still need to interface *to* that
thing, no matter what it does and no matter how wide of an input vector it
takes (i.e. a0, b0, c0, d0, a1, b1, c1...if that's what it takes). In other
words, a0, b0, c0, d0, a1, b1 and c1 all need to get in somehow; y0 and y1
both need to make it out and you need to flag when they are valid and that
flagging is functionally the same thing as handshaking.

Kevin Jennings
"Kevin Neilson" <> wrote in message
KJ wrote:
"Kevin Neilson" <> wrote in message
KJ wrote:
"Kevin Neilson" <> wrote in message
My question: what is the cleanest way to describe an FSM requiring

Well, just the fact that you're time sharing the DSP48 means that you're
not processing something new on every clock cycle which just screams out
to me that you'd want to implement this with a request/acknowledge type
of framework. ...

Kevin Jennings
But I *do* have to process something on every cycle.
You're not able to process a new set of 'a', 'b', 'c' and 'd' on every clock
cycle since the DSP48 is time shared (by your choice) and that was my point.
Time multiplexing the DSP48 to keep *it* busy on every clock cycle is not
the same thing.

Consider that I have to process these two equations:

y0 <= (a0*b0+c0)*d0;
y1 <= (a1*b1+c1);

Now, if you look at the structure of the DSP48, you can see that I can't
even process these two sequentially. I can send off (a0*b0+c0)*d0 to the
black box thingy you speak of, but this can't be processed without dead
cycles: I have to get the result of (a0*b0+c0) before I multiply it with
d0, and if the DSP48 is fully pipelined, that means that the multiplier is
unused for three cycles.
That's only true if the addition can't be done combinatorially. If it can
then the calculation of 'y0' takes two clock cycles and the DSP48 is fully
utilized. The answer pops out after two clock cycles of latency, the DSP48
hums along doing something useful on every tick.

It's similar to a superscalar process with dependencies. So I have to
reschedule: I put (a0*b0+c0) into the pipe, then put in (a1*b1+c1) (which
has no dependency on what is in the pipe), and then when the result of
(a0*b0+c0) pops out I can feed it back into the DSP48 and multiply it with
d0 to get y0. In the meantime y1 pops out. Without this intermixed
scheduling I end up with too many dead cycles and then I need to use too
many DSP48s.
And depending on just what the bottlenecks in the design are, one can do all
kinds of things. But no matter what, you still need to interface *to* that
thing, no matter what it does and no matter how wide of an input vector it
takes (i.e. a0, b0, c0, d0, a1, b1, c1...if that's what it takes). In other
words, a0, b0, c0, d0, a1, b1 and c1 all need to get in somehow; y0 and y1
both need to make it out and you need to flag when they are valid and that
flagging is functionally the same thing as handshaking.

Kevin Jennings
== NOTE: I automatically delete all Google Group posts due to uncontrolled
== NOTE: I automatically delete all Google Group posts due to uncontrolled
John_H wrote:
googler wrote:
I am trying to design an asynchronous FIFO that works with two clock
domains clk_A and clk_B. Writes to the FIFO happen on clk_A and reads
happen on clk_B. clk_A is faster than clk_B. I thought about using
gray-coded write and read pointers that is usually recommended for
designing async fifos, but the problem is that in this case the depth
of the FIFO is 688, which is not a power of 2. So it seems I cannot
use the Gray pointer technique. Is that right?
Gray counters are often easier to design as binary counters with a
binary-to-gray conversion such that you make the increment from binary
343 to binary -344 and let the Gray conversion do its stuff.
But make sure that the Gray value comes directly out of a flop before it
crosses into the other clock domain. Don't put the conversion logic
in-between the clock domains.


John Penton, posting as an individual unless specifically indicated
googler wrote:
I am trying to design an asynchronous FIFO that works with two clock
domains clk_A and clk_B. Writes to the FIFO happen on clk_A and reads
happen on clk_B. clk_A is faster than clk_B. I thought about using
gray-coded write and read pointers that is usually recommended for
designing async fifos, but the problem is that in this case the depth
of the FIFO is 688, which is not a power of 2. So it seems I cannot
use the Gray pointer technique. Is that right?

Can this be done using binary pointers instead of using Gray pointers?
Maybe we can use handshaking technique, exchanging 'ready' and
'acknowledge' between the two sides whenever the write/read pointer is
to be sent across clock domain. So suppose we want to send the write
pointer from clk_A domain to clk_B domain. When write pointer in clk_A
domain changes, it generates a 'ready' that goes to clk_B domain. This
'ready' is synchronized in clk_B domain and then clk_B reads the value
of the write pointer. Then clk_B domain sends 'acknowledge' back to
clk_A domain (where it is synchronized). Only after clk_A domain gets
this 'acknowledge' back, it can change the write pointer value. Is
this concept correct? If yes, then I guess there is still a problem.
In the above example, the write side is on faster clock. If the write
pointer cannot change until write side gets back 'acknowledge', then
we probabaly need to stall writes to the FIFO, correct?

Please advise how this can be done. Thanks.
Others have commented that a Gray scheme can be made to work. Your scheme
above is also reasonable (if rather high latency). Of course you need a
ready and ack for both the pointers. FIFOs almost always need a method to
stall the input side - unless you can guarentee (by analysis) that it can
never fill up. However, to avoid the excessive delay that you envisage you
could have two sets of registers for each pointer. A current state -
recording how many items have been written or read, and then a state for
transmission - which captures the current state and holds it stable while
the handshake is ongoing.

A final thought I have is that 688 entries is pretty big. In particular, I
think it requires a 688 input mux (or muxes) to read out the data. I
suppose that is only 11 gates deep, but if the FIFO is particularly wide, it
will also involve a lot of routing (and of course, it is possible to break
timing across an asynchronous interface). I would be thinking about a
two-stage system, with a large synchronous FIFO on the front (possibly built
with a RAM if the data is wide) and a small asynchronous FIFO on the back.
I guess it all depends on the data characteristics and the design


John Penton, posting as an individual unless specifically indicated
SweetMusic <> wrote:
solved thx^^
Uwe Bonnes

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Welcome to

