Combination loops and false paths

R

Rob Doyle

Guest
I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.

Most of the information that I've read about "false paths" assume two
clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a
false path but I think that it will ignore /all/ the timing (even the
desired timing) through that path.

What should I do?

Rob.
 
Rob Doyle wrote:
I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.
Combinatorial loops _with delay_ will simulate correctly. Otherwise
you couldn't simulate a ring oscillator.

Most of the information that I've read about "false paths" assume two
clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a
false path but I think that it will ignore /all/ the timing (even the
desired timing) through that path.
Not necessarily true. False paths have a FROM and a TO specification,
and would not affect other paths that don't start at the FROM or
don't end at the TO timing group. This allows you for example to
say that you don't care how long a control register bit takes to
get through some logic, but you want the streaming data to get through
in the standard PERIOD time.

What should I do?
You could always run your machine at 1.5 MHz. After all, how fast
was the PDP-10? Other than that, we'd probably need to analyze
this path to give any useful advice.

 
On 1/15/2013 12:54 AM, Rob Doyle wrote:
I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.

Most of the information that I've read about "false paths" assume two
clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a
false path but I think that it will ignore /all/ the timing (even the
desired timing) through that path.

What should I do?
Do you know why the tool is complaining? Did you write the code
describing the ALU? Off the top of my head, I can't think of why an ALU
would have a combinatorial loop. It should have two input data busses,
a number of control inputs, an output data bus and some output status
signals.

I don't recall the details of the 2901 bit slice and my data books are
not handy. That's the problem with paper books, you can't just shove
them on your hard drive... Does this part have an internal register
file? Even so, that means the part would have a clock and not a
combinatorial loop. Maybe this is because of some internal bus that is
shared in a way that looks like a loop even though it would never be
used that way?

I may have to find my old AMD data book. That could be an archeological
dig!

Rick
 
On Mon, 14 Jan 2013 22:54:53 -0700, Rob Doyle wrote:

I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate
correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

What should I do?
Look at the critical path reported by synthesis. Sounds like a VHDL
coding error; that delay would equate to a chain of 3-400 LUTs between FFs
which strongly suggests a mistake somewhere.

- Brian
 
On 1/15/2013 12:54 AM, Rob Doyle wrote:
I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.

Most of the information that I've read about "false paths" assume two
clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a
false path but I think that it will ignore /all/ the timing (even the
desired timing) through that path.

What should I do?

Rob.
I took a look at the block diagram and I don't see any combinatorial
loops. However, they use latches for the RAM outputs. These are
combinatorial if implemented that way. Typically latches are used
because they can provide speed advantages since the data will flow
through before being held while D type registers don't change outputs
until the clock edge.

Are the RAM and output latches in the path being reported as too long?
If so, I would recommend changing the latches to rising edge registers.
This should cut these loops. The RAM is level sensitive which may be
a problem in an FPGA. I think all of the sequential elements are edge
sensitive these days. I supposed you could make it out of latches; it's
not that many elements.

The clock runs down the center of the block diagram cutting the data
paths and preventing any internal loops I can see. Of course, there
could be combinatorial loops created by the way it is used. The only
one I see is created if you loop the ALU output Y back to the ALU input
D. This could happen if you try to put these busses on a tri-state bus.
Since tri-states aren't used in an FPGA, you are better off using
multiple separate busses to drive all the various inputs hanging on the
tri-state bus.

I think it would be very interesting to implement this in a low power
FPGA and see just how efficient it can become. What target were you
thinking of? I am currently working with the iCE40 family from Lattice
and it has very impressive power consumption. You likely could run a
design at some 10's of MHz while drawing the power level of an LED... a
small LED.

Any interest in making this a group project? Or are you keeping all the
fun to yourself?

BTW, where did you get documentation on the PDP-10 sufficient to design
an equivalent? Just from the instruction manual? Or do you have more
details?

Rick
 
rickman <gnuarm@gmail.com> wrote:
On 1/15/2013 12:54 AM, Rob Doyle wrote:

I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.
Loops with an odd number of inversions won't stabilize, but with an
even number they should be fine.

(snip)

I took a look at the block diagram and I don't see any combinatorial
loops. However, they use latches for the RAM outputs. These are
combinatorial if implemented that way. Typically latches are used
because they can provide speed advantages since the data will flow
through before being held while D type registers don't change outputs
until the clock edge.
The tools should be good enough to figure out latches.

As I wrote above, though, be sure that there is no (odd number)
of inverters in the loop.

Are the RAM and output latches in the path being reported as too long?
If so, I would recommend changing the latches to rising edge registers.
This should cut these loops. The RAM is level sensitive which may be
a problem in an FPGA. I think all of the sequential elements are edge
sensitive these days. I supposed you could make it out of latches; it's
not that many elements.
The BRAM on most FPGAs are synchronous (clocked). That might not
match what you need for some older designs. If it isn't too big,
and you really need asynchronous RAM, you have to make it out
of CLB logic.

The clock runs down the center of the block diagram cutting the data
paths and preventing any internal loops I can see. Of course, there
could be combinatorial loops created by the way it is used. The only
one I see is created if you loop the ALU output Y back to the ALU input
D. This could happen if you try to put these busses on a tri-state bus.
Since tri-states aren't used in an FPGA, you are better off using
multiple separate busses to drive all the various inputs hanging on the
tri-state bus.
As I understand it, the Xilinx tools, at least, know how to convert
tristate logic to MUX logic. I suppose in some cases that might
generate unexpected, or even false, loops.

I believe that the KA-10 was done in asynchronous (non-clocked) logic.
That might make an interesting FPGA project.

-- glen
 
You might be able to get around the async/sync ram issues by using the
other edge of your clock (and if necessary, a dual-port bram could run
off different edges for read & write).

There are also ways to build a DDR register out of two registers and 3
XOR (or XNOR) gates, without gating the clock. Google "flancter
circuit". It is STA-friendly too. That might be another trick you
could use.

If the loops were not stable, it would show up even in RTL sim
(assuming the conditions needed to make it unstable were met). Since
it works with the loops in there (in simulation), I assume it is at
least not always unstable.

Andy
 
On 1/17/2013 3:49 PM, glen herrmannsfeldt wrote:
rickman<gnuarm@gmail.com> wrote:
On 1/15/2013 12:54 AM, Rob Doyle wrote:

I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU.
At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about
combinatorial loops involving the ALU - and an associated "Minimum
period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the
simulation wouldn't stabilize. I can't really add pipelining or
registers to the design without affecting the microcode - and I don't
want to do that.

Loops with an odd number of inversions won't stabilize, but with an
even number they should be fine.

(snip)

I took a look at the block diagram and I don't see any combinatorial
loops. However, they use latches for the RAM outputs. These are
combinatorial if implemented that way. Typically latches are used
because they can provide speed advantages since the data will flow
through before being held while D type registers don't change outputs
until the clock edge.

The tools should be good enough to figure out latches.
Figure out in what context? The tool can't know when the latch is
enabled or disabled. When enabled it is transparent and so is logic.
I'm not sure what your point is. For timing a latch is combinational
logic and has to be figured into the timing paths. In fact, that is
usually why latches are used, because they improve timing.


As I wrote above, though, be sure that there is no (odd number)
of inverters in the loop.

Are the RAM and output latches in the path being reported as too long?
If so, I would recommend changing the latches to rising edge registers.
This should cut these loops. The RAM is level sensitive which may be
a problem in an FPGA. I think all of the sequential elements are edge
sensitive these days. I supposed you could make it out of latches; it's
not that many elements.

The BRAM on most FPGAs are synchronous (clocked). That might not
match what you need for some older designs. If it isn't too big,
and you really need asynchronous RAM, you have to make it out
of CLB logic.
Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed)
are also synchronous. That is why I say you have to make async RAM out
of latches.


The clock runs down the center of the block diagram cutting the data
paths and preventing any internal loops I can see. Of course, there
could be combinatorial loops created by the way it is used. The only
one I see is created if you loop the ALU output Y back to the ALU input
D. This could happen if you try to put these busses on a tri-state bus.
Since tri-states aren't used in an FPGA, you are better off using
multiple separate busses to drive all the various inputs hanging on the
tri-state bus.

As I understand it, the Xilinx tools, at least, know how to convert
tristate logic to MUX logic. I suppose in some cases that might
generate unexpected, or even false, loops.
That should not create loops unless the loop is already there in the
connected logic. Translating tristate busses create multiple sets of
multiplexors which are all distinct, preventing loops... unless the rest
of the logic connects input to output.


I believe that the KA-10 was done in asynchronous (non-clocked) logic.
That might make an interesting FPGA project.
Doing async logic in an FPGA is not so easy. You need timing info that
is hard to get.

Rick
 
On 1/15/2013 7:21 PM, rickman wrote:
On 1/15/2013 12:54 AM, Rob Doyle wrote:

I creating an FPGA implementation of a old DEC PDP-10 (KS-10,
specifically) Mainframe Computer. Why? Because I always wanted
one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the
ALU. At this stage, most of the instruction set diagnostics
simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings
about combinatorial loops involving the ALU - and an associated
"Minimum period: 656.595ns (Maximum Frequency: 1.523MHz)"
message...

My understanding is that if combination loops really existed then
the simulation wouldn't stabilize. I can't really add pipelining
or registers to the design without affecting the microcode - and I
don't want to do that.

Most of the information that I've read about "false paths" assume
two clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a
false path but I think that it will ignore /all/ the timing (even
the desired timing) through that path.

What should I do?

Do you know why the tool is complaining? Did you write the code
describing the ALU? Off the top of my head, I can't think of why an
ALU would have a combinatorial loop. It should have two input data
busses, a number of control inputs, an output data bus and some
output status signals.
Oops. Sorry I guess I relied to rickman instead of following up with
the group. I'm resending...

I guess I'm using term ALU and am2901 interchangeably. I'll be more
specific.

There is nothing wrong with the am2901 proper. It is what it is.

I don't recall the details of the 2901 bit slice and my data books
are not handy. That's the problem with paper books, you can't just
shove them on your hard drive... Does this part have an internal
register file? Even so, that means the part would have a clock and
not a combinatorial loop.

Maybe this is because of some internal bus that is shared in a way
that looks like a loop even though it would never be used that way?
Exactly.

The problems is that am2901 output goes to a bus that eventually routes
back to the am2901 input for some unused (as best I can tell)
configuration of the microcode. This all happens with no registers in
the loop.

The am2901 does have an internal dual-ported register file. Register
file writes from the ALU output are clocked. Register file reads to the
ALU input are latched only. The am2901 control inputs and register file
addresses all originate from the microcode which is registered.

The am2901 has a single input bus which is combinatorial through the ALU
to output bus. Therefore all am2901 ops require at least one register
(or the constant zero) as an ALU source.

I think I know what to do. It looks like ISE supports a
FROM-THRU-THRU-THRU-THRU-TO timing constraint - with an indefinite
number of THRUs. I think I just want to very specifically exclude the
paths that the tool is whining about and leave everything else.

I may have to find my old AMD data book. That could be an
archeological dig!
I guess that it is just a design from another day - a whole lot less
synchronous than anything I've done in an FPGA before.

I have enjoyed going back through that all. I even found my "Mick and
Brick" book. I'll probably do a VAX 11/780 next which also used
bit-sliced parts.

Bob.
 
On 1/17/2013 1:03 PM, rickman wrote:

BTW, where did you get documentation on the PDP-10 sufficient to
design an equivalent? Just from the instruction manual? Or do you
have more details?
Folks have done a remarkable job at archiving design information,
software, and hardware for these historic machines. Notably Paul Allen
(of Microsoft Fame) sponsors a museum that maintains a few of these
machines in working order. See http://www.livingcomputermuseum.org/

The folks at bitsavers.org are frantically scanning documents/books and
imaging magnetic media before it becomes unreadable.

AMD Info is at:

http://bitsavers.org/pdf/amd/

All the KS10/PDP10 information is at:

http://bitsavers.org/pdf/dec/pdp10/KS10/

Microcode source/listings and processor diagnostics are available from:

http://pdp-10.trailing-edge.com/klad_sources/index.html

A webpage that describes my/our project is at:

http://www.techtravels.org/KS10FPGA/

I put a block diagram of the KS10 CPU on the techtravels website.
Referring to that block diagram, the false paths are from the ALU output
through the DBM mux, through the DBUS Mux, and back into the ALU.
There is another false path though the SCAD. (The SCAD is a 12-bit
mini-ALU built from 3x 74S181s that is used for managing floating-point
exponents and loop constructs in the microcode).

Any interest in making this a group project? Or are you keeping all
the fun to yourself?
This is definitely a group project. Right now, I'm doing all the FPGA
work by myself -

If you are interested in participating, contact me off-list.

You used the term 'archeology'. It sure feels like that...

Rob.
 
rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)

The BRAM on most FPGAs are synchronous (clocked). That might not
match what you need for some older designs. If it isn't too big,
and you really need asynchronous RAM, you have to make it out
of CLB logic.

Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed)
are also synchronous. That is why I say you have to make async RAM out
of latches.
The are now? They didn't used to be. I am somewhat behind in the
generations of FPGAs.

(snip, I also wrote)

I believe that the KA-10 was done in asynchronous (non-clocked) logic.
That might make an interesting FPGA project.

Doing async logic in an FPGA is not so easy. You need timing info that
is hard to get.
The whole idea behind asynchronous logic is that you don't need to know
any timing information. Otherwise known as self-timed logic, there is
enough hand-shaking such that every signal changes when it is ready, no
sooner and no later. If you use dual-rail logic:

http://en.wikipedia.org/wiki/Asynchronous_system#Asynchronous_datapaths

then all timing just works.

If you mix synchronous and asynchronous logic, then things get more
interesting.

-- glen
 
Rob Doyle <radioengr@gmail.com> wrote:

(snip)

I guess I'm using term ALU and am2901 interchangeably.
I'll be more specific.

There is nothing wrong with the am2901 proper. It is what it is.
I suppose, but it was designed way before the tools we use now.

(snip)

The problems is that am2901 output goes to a bus that eventually routes
back to the am2901 input for some unused (as best I can tell)
configuration of the microcode. This all happens with no registers in
the loop.
(snip)

I guess that it is just a design from another day - a whole lot less
synchronous than anything I've done in an FPGA before.

I have enjoyed going back through that all. I even found my "Mick and
Brick" book. I'll probably do a VAX 11/780 next which also used
bit-sliced parts.
Years ago, maybe just about when it was new, I bought "Mick and Brick."

Then, about 20 years ago, it got lost in a move. A few weeks ago I
bought a used one from half.com for a low price. (In case I decide
to do some 2901 designs in FPGAs.)

The discussion on combinatorial loops reminds me of the wrap around
carry on ones complement adders. If done the obvious way, it is a
combinatorial loop, but hopefully one that, in actual use, resolves
itself.

-- glen
 
On 1/17/2013 8:57 PM, glen herrmannsfeldt wrote:
Rob Doyle<radioengr@gmail.com> wrote:

(snip)

I guess I'm using term ALU and am2901 interchangeably.
I'll be more specific.

There is nothing wrong with the am2901 proper. It is what it is.

I suppose, but it was designed way before the tools we use now.

(snip)

The problems is that am2901 output goes to a bus that eventually routes
back to the am2901 input for some unused (as best I can tell)
configuration of the microcode. This all happens with no registers in
the loop.

(snip)

I guess that it is just a design from another day - a whole lot less
synchronous than anything I've done in an FPGA before.

I have enjoyed going back through that all. I even found my "Mick and
Brick" book. I'll probably do a VAX 11/780 next which also used
bit-sliced parts.

Years ago, maybe just about when it was new, I bought "Mick and Brick."

Then, about 20 years ago, it got lost in a move. A few weeks ago I
bought a used one from half.com for a low price. (In case I decide
to do some 2901 designs in FPGAs.)

The discussion on combinatorial loops reminds me of the wrap around
carry on ones complement adders. If done the obvious way, it is a
combinatorial loop, but hopefully one that, in actual use, resolves
itself.
Mick and Brick was not just about the 2901, it covered the basic
concepts of designing a processor. One of the things that stuck with me
was the critical path they described, which I believe was in a
conditional branch calculating the next address (I guess it finally got
away from me again). I found that to be true on every processor design
I looked at, including the MISC designs I did in FPGAs on my own. These
guys had some pretty good insight into processor design.

I had my own book too and will have to dig around for it. But I am
pretty sure it is gone as I haven't seen it in other searches I've done
for other books the last ten years or so. I think I got it free from
AMD at one point. Now they are over $100 for one in good condition.
I'm not sure what "adequate" means for a book condition. They say it is
all legible, but I've seen some pretty rough books in "good" condition.

Rick
 
On 1/17/2013 9:09 PM, glen herrmannsfeldt wrote:
rickman<gnuarm@gmail.com> wrote:

(snip, I wrote)

The BRAM on most FPGAs are synchronous (clocked). That might not
match what you need for some older designs. If it isn't too big,
and you really need asynchronous RAM, you have to make it out
of CLB logic.

Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed)
are also synchronous. That is why I say you have to make async RAM out
of latches.

The are now? They didn't used to be. I am somewhat behind in the
generations of FPGAs.
Actually, I think Xilinx made their XC4000 series with clocks for
writing the distributed RAM. They had too much trouble with poor
designs trying to generate a write pulse with good timing and decided
they were better off giving the user a clock. I used the ACEX from
Altera in 2000 or so which had an async read block RAM. It made a
processor easier to design saving a clock cycle on reads. Block RAMs
have always been synchronous on the writes and now they are synchronous
on reads as well... so many generations of FPGAs...


(snip, I also wrote)

I believe that the KA-10 was done in asynchronous (non-clocked) logic.
That might make an interesting FPGA project.

Doing async logic in an FPGA is not so easy. You need timing info that
is hard to get.

The whole idea behind asynchronous logic is that you don't need to know
any timing information. Otherwise known as self-timed logic, there is
enough hand-shaking such that every signal changes when it is ready, no
sooner and no later. If you use dual-rail logic:

http://en.wikipedia.org/wiki/Asynchronous_system#Asynchronous_datapaths

then all timing just works.
You need to read that section again... Nowhere does it say the timing
"just works". It describes two ways to communicate a signal, one is to
send one of two pulses for a 1 or a 0 and the other is to use a
handshake signal which has a delay longer than the data it is clocking.
In both cases you have to use timing to generate the control signal
(or combined data and control in the first case). The advantage is that
the timing issues are "localized" to the unit rather than being global.

The problem with doing this in an FPGA is that the tools are all
designed for fully synchronous systems. This sort of local timing with
an emphasis on relative delays rather than simple maximum delays is
difficult to do using the standard tools.


If you mix synchronous and asynchronous logic, then things get more
interesting.
All real time systems are at some point synchronous. They have
deadlines to meet and often there are recurring events that have to be
synced to a clock such as an interface or an ADC. In the end an async
processor buys you very little other than saving power in the clock
tree. Even this is just a strawman as the real question is the power it
takes to get the job done, not how much power is used to distribute the
clock.

The GA144 is an array of 144 fully async processors. I have looked at
using the GA144 for real world designs twice. In each case the I/O had
to be clocked which is awkwardly supported in the chip. In the one case
the limitations made it very difficult to even analyze timing of a
clocked interface, much less meet timing. In the other case low power
was paramount and the GA144 could not match the power requirements while
I am pretty sure I can do the job with a low power FPGA. Funny
actually, the GA144 has an idle current of just 55 nA per processor or
just 7 uA for the chip. The FPGA I am working with has an idle current
of some 40 uA but including the processing the total should be under 100
uA. In the GA144 I calculated over 100 uA just driving the ADC not
counting any real processing. Actually, most of the power is used in
timing the ADC conversion. Without a high resolution clock the only way
to time the ADC conversion is to put the processor in an idle loop...

There is many a slip 'twixt cup and lip.

Rick
 
rickman <gnuarm@gmail.com> wrote:

(snip)
Mick and Brick was not just about the 2901, it covered the basic
concepts of designing a processor. One of the things that stuck with me
was the critical path they described, which I believe was in a
conditional branch calculating the next address (I guess it finally got
away from me again). I found that to be true on every processor design
I looked at, including the MISC designs I did in FPGAs on my own. These
guys had some pretty good insight into processor design.
Yes, but with 29xx for all the examples. I also have some books
on microprogramming, independent of the processor. Well, maybe
not completely independent.

I had my own book too and will have to dig around for it. But I am
pretty sure it is gone as I haven't seen it in other searches I've done
for other books the last ten years or so. I think I got it free from
AMD at one point. Now they are over $100 for one in good condition.
I'm not sure what "adequate" means for a book condition. They say it is
all legible, but I've seen some pretty rough books in "good" condition.
half.com has $1.49 (plus shipping) for acceptable condition, $5.52 for
good condition, and $8.88 for very good condition.

The one I got has the dust jacket a little worn and torn, and the
spine might be a little weak, but plenty usable.

-- glen
 
rickman <gnuarm@gmail.com> wrote:

(snip)
Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed)
are also synchronous. That is why I say you have to make async RAM out
of latches.

The are now? They didn't used to be. I am somewhat behind in the
generations of FPGAs.

Actually, I think Xilinx made their XC4000 series with clocks for
writing the distributed RAM. They had too much trouble with poor
designs trying to generate a write pulse with good timing and decided
they were better off giving the user a clock. I used the ACEX from
Altera in 2000 or so which had an async read block RAM. It made a
processor easier to design saving a clock cycle on reads. Block RAMs
have always been synchronous on the writes and now they are synchronous
on reads as well... so many generations of FPGAs...
I am not sure about writes now. BRAMs are synchronous read, but LUT
RAM better not be, as the LUTs are the same as used for logic.

Some designs just won't work with a synchronous read RAM (which
is sometimes a ROM).

(snip on asynchronous logic)

You need to read that section again... Nowhere does it say the timing
"just works". It describes two ways to communicate a signal, one is to
send one of two pulses for a 1 or a 0 and the other is to use a
handshake signal which has a delay longer than the data it is clocking.
In both cases you have to use timing to generate the control signal
(or combined data and control in the first case). The advantage is that
the timing issues are "localized" to the unit rather than being global.
I meant the one they call dual rail logic. There are two wires sending
the signal, in one of three states, 0, 1, or none, and one coming
back acknowledging the signal. The generate a signal, either the 0
or 1 goes active, until the ack comes back, at which time the output
signal is removed until the ack goes away. Full handshake both ways.

The problem with doing this in an FPGA is that the tools are all
designed for fully synchronous systems. This sort of local timing with
an emphasis on relative delays rather than simple maximum delays is
difficult to do using the standard tools.
Yes. Besides all those useless FF's in each cell.

If you mix synchronous and asynchronous logic, then things get more
interesting.

All real time systems are at some point synchronous. They have
deadlines to meet and often there are recurring events that have to be
synced to a clock such as an interface or an ADC. In the end an async
processor buys you very little other than saving power in the clock
tree. Even this is just a strawman as the real question is the power it
takes to get the job done, not how much power is used to distribute the
clock.
As I understand it, there are some current processors with asynchronous
logic blocks, such as a multiplier. The operands are clocked in and,
an unknown (data dependent) number of cycles later the result comes
out, is latched, and sent on. So, 0*0 might be very fast, were
full width operands might be much slower.

The GA144 is an array of 144 fully async processors. I have looked at
using the GA144 for real world designs twice. In each case the I/O had
to be clocked which is awkwardly supported in the chip. In the one case
the limitations made it very difficult to even analyze timing of a
clocked interface, much less meet timing. In the other case low power
was paramount and the GA144 could not match the power requirements while
I am pretty sure I can do the job with a low power FPGA. Funny
actually, the GA144 has an idle current of just 55 nA per processor or
just 7 uA for the chip. The FPGA I am working with has an idle current
of some 40 uA but including the processing the total should be under 100
uA. In the GA144 I calculated over 100 uA just driving the ADC not
counting any real processing. Actually, most of the power is used in
timing the ADC conversion. Without a high resolution clock the only way
to time the ADC conversion is to put the processor in an idle loop...
Sounds like an interesting design.

-- glen
 
On 1/19/2013 5:35 PM, glen herrmannsfeldt wrote:
rickman<gnuarm@gmail.com> wrote:

(snip)
Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed)
are also synchronous. That is why I say you have to make async RAM out
of latches.

The are now? They didn't used to be. I am somewhat behind in the
generations of FPGAs.

Actually, I think Xilinx made their XC4000 series with clocks for
writing the distributed RAM. They had too much trouble with poor
designs trying to generate a write pulse with good timing and decided
they were better off giving the user a clock. I used the ACEX from
Altera in 2000 or so which had an async read block RAM. It made a
processor easier to design saving a clock cycle on reads. Block RAMs
have always been synchronous on the writes and now they are synchronous
on reads as well... so many generations of FPGAs...

I am not sure about writes now. BRAMs are synchronous read, but LUT
RAM better not be, as the LUTs are the same as used for logic.

Some designs just won't work with a synchronous read RAM (which
is sometimes a ROM).
That's right. My processor design on the ACEX with async reads had to
be modified to work in nearly any other part with fully sync block RAM.
I could possibly clock the RAM on the negative edge with the rest of
the design clocked on the positive. Or I could do a read on every clock
using the address precursor which is available on the prior clock cycle
at the input to the address register. Both methods reduce timing
margins along with other tradeoffs.


You need to read that section again... Nowhere does it say the timing
"just works". It describes two ways to communicate a signal, one is to
send one of two pulses for a 1 or a 0 and the other is to use a
handshake signal which has a delay longer than the data it is clocking.
In both cases you have to use timing to generate the control signal
(or combined data and control in the first case). The advantage is that
the timing issues are "localized" to the unit rather than being global.

I meant the one they call dual rail logic. There are two wires sending
the signal, in one of three states, 0, 1, or none, and one coming
back acknowledging the signal. The generate a signal, either the 0
or 1 goes active, until the ack comes back, at which time the output
signal is removed until the ack goes away. Full handshake both ways.
I haven't seen the logic, but how do they generate the timing for the
handshakes? I don't think it "just works". My understanding is that
the handshakes are generated by a delay line that is designed to have a
longer delay than the logic. This is hard to do with the timing tools
designed for synchronous systems.


The problem with doing this in an FPGA is that the tools are all
designed for fully synchronous systems. This sort of local timing with
an emphasis on relative delays rather than simple maximum delays is
difficult to do using the standard tools.

Yes. Besides all those useless FF's in each cell.
I don't follow. I think typical async logic still has FFs, they just
don't use a global clock. I suppose if you have handshakes back and
forth you are making latches in the combinatorial logic if nothing else.


If you mix synchronous and asynchronous logic, then things get more
interesting.

All real time systems are at some point synchronous. They have
deadlines to meet and often there are recurring events that have to be
synced to a clock such as an interface or an ADC. In the end an async
processor buys you very little other than saving power in the clock
tree. Even this is just a strawman as the real question is the power it
takes to get the job done, not how much power is used to distribute the
clock.

As I understand it, there are some current processors with asynchronous
logic blocks, such as a multiplier. The operands are clocked in and,
an unknown (data dependent) number of cycles later the result comes
out, is latched, and sent on. So, 0*0 might be very fast, were
full width operands might be much slower.
I haven't heard of that. How would that benefit a sync processor? I
can only think that would be useful if the design were compared to one
with a very slow multiplier which required the processor to wait for
many clock cycles. A multiplier with a "ready" flag could shorten the
wait. But that can also be done for a fully sync multiplier. In fact,
in the async GA144 there is no multiply instruction. Instead there is a
multiply step instruction which can be used to do multiplies in a loop.
The loop can be terminated when the multiply has detected the rest of
the bits are all zero (or all ones maybe?). I haven't seen the code
that does this, but this is reported in some of their white papers.


The GA144 is an array of 144 fully async processors. I have looked at
using the GA144 for real world designs twice. In each case the I/O had
to be clocked which is awkwardly supported in the chip. In the one case
the limitations made it very difficult to even analyze timing of a
clocked interface, much less meet timing. In the other case low power
was paramount and the GA144 could not match the power requirements while
I am pretty sure I can do the job with a low power FPGA. Funny
actually, the GA144 has an idle current of just 55 nA per processor or
just 7 uA for the chip. The FPGA I am working with has an idle current
of some 40 uA but including the processing the total should be under 100
uA. In the GA144 I calculated over 100 uA just driving the ADC not
counting any real processing. Actually, most of the power is used in
timing the ADC conversion. Without a high resolution clock the only way
to time the ADC conversion is to put the processor in an idle loop...

Sounds like an interesting design.
I still need to verify that the LVDS input will detect the still very
low level signal from the antenna. Once I show that to work, I've got
the rest covered. If it doesn't work, I'll either need to use a
separate comparator or if that won't work I might be able to provide
some feedback to keep the detector on it's sensitive edge.

All the other parts have been analyzed well enough that I am very
confident I'll meet my goal.

BTW, I am thinking of using a cheap analog battery driven clock as an
output device. So I bought one for $4 and took it apart. It has the
tiny circuit board for the clock chip and crystal and a very simple coil
driving a gear that turns 180° each tick. The rest of the clock is the
same as any analog clock except it is *all* plastic. Plastic gears,
plastic pivot, plastic box. I guess once you do the timing with
electronics there is no longer a need for the fancy stuff in the
mechanism. Checking on Aliexpress I found the mechanisms for only $2!
Sometimes technology is amazing in just how cheaply it can be produced.

Rick
 
Rob Doyle wrote:


I guess that it is just a design from another day - a whole lot less
synchronous than anything I've done in an FPGA before.

Yes, some PDP-10s were not rigidly clocked at all, so that when there were
no carries from the ALU after a few ns, the operation was considered
complete and the result stored. Really nasty way to design a machine!

I have enjoyed going back through that all. I even found my "Mick and
Brick" book. I'll probably do a VAX 11/780 next which also used
bit-sliced parts.
No, not true. The 780 was all 74S chips (some LS on non-critical paths)
but nothing LSI at all. The 730 and 750 used TI mask-programmed logic
array parts. I actually read the print set of a 780 about 30 years ago
and at one time knew the design pretty well.

Jon
 
Jon Elson <jmelson@wustl.edu> wrote:

(snip)
I guess that it is just a design from another day - a whole lot less
synchronous than anything I've done in an FPGA before.

Yes, some PDP-10s were not rigidly clocked at all, so that when there were
no carries from the ALU after a few ns, the operation was considered
complete and the result stored. Really nasty way to design a machine!

I have enjoyed going back through that all. I even found my "Mick and
Brick" book. I'll probably do a VAX 11/780 next which also used
bit-sliced parts.

No, not true. The 780 was all 74S chips (some LS on non-critical paths)
but nothing LSI at all. The 730 and 750 used TI mask-programmed logic
array parts. I actually read the print set of a 780 about 30 years ago
and at one time knew the design pretty well.
Story I knew from about 30 years ago was that the 730 was build
from 2900 series parts. That was supposed to be related to
H-float being included, (no extra charge) when it wasn't for
the earlier models. So, H-float on the 730 was faster than the
software emulation on the 750. (But then again, I never tried.)

-- glen
 

Welcome to EDABoard.com

Sponsor

Back
Top