Mill: FPGA version?

J

Joseph H Allen

Guest
I've been wondering if there could be a successful (not too large,
reasonable clock frequency) soft-core FPGA version of a low end variant of
the Mill. One issue with FPGAs is that only two-port memory is available.
It's not so hard to make extra read ports with these, but extra write ports
are a big problem. This pretty much precludes any sane implementation of
Tomasulo's algorithm for OOE.

If I understand "the belt" properly, the number of write ports is set to
just the number of pipelines (or maybe to two to support the spilling). The
pipes need to be able to read from anywhere, but we can do that with extra
read ports and one extra mux stage. Anyway, it seems like a nice way to get
superscalar on an FPGA.

Perhaps Out of the Box Computing has an FPGA version now already for bring
up? Very curious what clock frequency they are getting..

--
/* jhallen@world.std.com AB1GO */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
 
On 11/22/2013 7:20 AM, Joseph H Allen wrote:
I've been wondering if there could be a successful (not too large,
reasonable clock frequency) soft-core FPGA version of a low end variant of
the Mill. One issue with FPGAs is that only two-port memory is available.
It's not so hard to make extra read ports with these, but extra write ports
are a big problem. This pretty much precludes any sane implementation of
Tomasulo's algorithm for OOE.

If I understand "the belt" properly, the number of write ports is set to
just the number of pipelines (or maybe to two to support the spilling). The
pipes need to be able to read from anywhere, but we can do that with extra
read ports and one extra mux stage. Anyway, it seems like a nice way to get
superscalar on an FPGA.

Perhaps Out of the Box Computing has an FPGA version now already for bring
up? Very curious what clock frequency they are getting..

We're working on an FPGA, but it's behind patents, fund raising, and
talks, so it won't be soon :-(

As for your technical question, I am a software guy and don't know the
innards of FPGAs so I have bounced your question to our hardware team
and will post their answer.

Ivan
 
In comp.arch.fpga Joseph H Allen <jhallen@theworld.com> wrote:
I've been wondering if there could be a successful (not too large,
reasonable clock frequency) soft-core FPGA version of a low end variant of
the Mill. One issue with FPGAs is that only two-port memory is available.
It's not so hard to make extra read ports with these, but extra write
ports are a big problem. This pretty much precludes any sane
implementation of Tomasulo's algorithm for OOE.

I have wondered about an FPGA implementation of the 360/91,
I believe where Tomasulo's algorithm was developed.
As far as I know, though, not enough details are available.

-- glen
 
In article <l6oir7$am4$1@speranza.aioe.org>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
In comp.arch.fpga Joseph H Allen <jhallen@theworld.com> wrote:
I've been wondering if there could be a successful (not too large,
reasonable clock frequency) soft-core FPGA version of a low end variant of
the Mill. One issue with FPGAs is that only two-port memory is available.
It's not so hard to make extra read ports with these, but extra write
ports are a big problem. This pretty much precludes any sane
implementation of Tomasulo's algorithm for OOE.

I have wondered about an FPGA implementation of the 360/91,
I believe where Tomasulo's algorithm was developed.
As far as I know, though, not enough details are available.

Heh, if you are really brave, start Herrmannsfeldt Corporation (like Amdahl
Corporation): make an FPGA implementation of any recent (64-bit) IBM
mainframe- probably there is a market even if it's a slow, simple in-order
single-issue design. Use the Hercules emulator as the predictor in your
design verification suite.

--
/* jhallen@world.std.com AB1GO */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
 
On 11/22/2013 7:20 AM, Joseph H Allen wrote:
I've been wondering if there could be a successful (not too large,
reasonable clock frequency) soft-core FPGA version of a low end variant of
the Mill. One issue with FPGAs is that only two-port memory is available.
It's not so hard to make extra read ports with these, but extra write ports
are a big problem. This pretty much precludes any sane implementation of
Tomasulo's algorithm for OOE.

If I understand "the belt" properly, the number of write ports is set to
just the number of pipelines (or maybe to two to support the spilling). The
pipes need to be able to read from anywhere, but we can do that with extra
read ports and one extra mux stage. Anyway, it seems like a nice way to get
superscalar on an FPGA.

The belt is not implemented as SRAM, nor as a shift register. The belt
is implemented in the same manner as a conventional register bypass,
with bypass muxes.

In the case of a conventional, the bypass mux selects are based upon
qualified equals comparators. So if a source of the current operation
has the same register number as the result of a previous operation, then
that previous value is used as long as it is not old enough to have been
written to the register file.

In the case of the belt, the belt result selects are based upon
qualified equals comparators also, but what is being compared are belt
"temporal location" tags. The register file isn't used roughly 80% of
the time, so why use it and its associated "physical location" tags? A
register file can take 3 or 4 clocks for a result written to it to show
up at a functional unit's source. That turns out to be on the order of
average life time of a result's "temporal location" tag before it gets
"incremented off the end" of the belt.

So in the case of both a conventional bypass and the belt, each
functional unit's source has a multiplexer to select from each possible
slot result output. So each slot source has a quick path thru the result
select mux for latency 1 results, and nearly a full pipe stage path thru
the mux for all the rest.

I hope that answers your question - I wasn't sure if it was regarding
"normal" belt operation or spilling/filling for calls, returns,
interrupts, traps and the like.

--
Arthur Kahlich
Chief Engineer
Out-of-the-Box Computing
 
In article <l6ocds$h6s$1@dont-email.me>,
Ivan Godard <ivan@ootbcomp.com> wrote:

So in the case of both a conventional bypass and the belt, each
functional unit's source has a multiplexer to select from each possible
slot result output. So each slot source has a quick path thru the result
select mux for latency 1 results, and nearly a full pipe stage path thru
the mux for all the rest.

On an FPGA the "all the rest" bypass MUXing is too expensive to do in one
stage. The way I know to deal with this is to spread the bypass muxing
across a pipeline which feeds the function unit. This bypass pipeline is
the same length as the datapath pipeline. The FU results are broadcast to
all of these pipeline stages- if a request needs that particular result,
it's MUXed in. With one FU, you only need 2:1 MUXes. With N FUs, you need
(N+1):1 MUXes in each stage (unless there is clustering..).

Anyway, on an FPGA you have SRAM, so you definitely want to use it for the
register file, but you are stuck with 2 addressed ports per SRAM. So I'm
envisioning that each FU has its own 2-port register file, plus copies so
other FUs can read results (older than in the bypass). In other words, one
write port, many read ports is OK.

That you can make a compiler for the implied temporal location write
addressing, tells me that you should also be able to make one for a more
conventional design where there is explicit physical write addressing, but
with the restriction that each FU can only write to 1/N dedicated locations
in the register file.

With 4 FUs perhaps the load unit only can write to register 0, 4, 8, 12,
etc. ALU1 can only write to register 1, 5, 9, 13, etc.. and so on. It
seems this is a reasonable architectural restriction. I'm wondering now if
there were any other machines like this this (well except for FP / integer
split).

I like also that load completons get the most recent store results to avoid
aliasing.. but I thought other systems did this also (basically bypass the
dcache). Hmm, now that I think about it it's not fully solving the alias
problem. You are not free to move a load completion earlier than any store
that you can't prove is alias hazard free.

--
Arthur Kahlich
Chief Engineer
Out-of-the-Box Computing

Thanks!

--
/* jhallen@world.std.com AB1GO */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
 
On 11/23/2013 7:12 AM, Joseph H Allen wrote:
In article <l6ocds$h6s$1@dont-email.me>,
Ivan Godard <ivan@ootbcomp.com> wrote:

So in the case of both a conventional bypass and the belt, each
functional unit's source has a multiplexer to select from each possible
slot result output. So each slot source has a quick path thru the result
select mux for latency 1 results, and nearly a full pipe stage path thru
the mux for all the rest.

On an FPGA the "all the rest" bypass MUXing is too expensive to do in one
stage. The way I know to deal with this is to spread the bypass muxing
across a pipeline which feeds the function unit. This bypass pipeline is
the same length as the datapath pipeline. The FU results are broadcast to
all of these pipeline stages- if a request needs that particular result,
it's MUXed in. With one FU, you only need 2:1 MUXes. With N FUs, you need
(N+1):1 MUXes in each stage (unless there is clustering..).

Anyway, on an FPGA you have SRAM, so you definitely want to use it for the
register file, but you are stuck with 2 addressed ports per SRAM. So I'm
envisioning that each FU has its own 2-port register file, plus copies so
other FUs can read results (older than in the bypass). In other words, one
write port, many read ports is OK.

That you can make a compiler for the implied temporal location write
addressing, tells me that you should also be able to make one for a more
conventional design where there is explicit physical write addressing, but
with the restriction that each FU can only write to 1/N dedicated locations
in the register file.

With 4 FUs perhaps the load unit only can write to register 0, 4, 8, 12,
etc. ALU1 can only write to register 1, 5, 9, 13, etc.. and so on. It
seems this is a reasonable architectural restriction. I'm wondering now if
there were any other machines like this this (well except for FP / integer
split).

I think that I can reply to this without putting my foot in it (Art is
even more swamped than I am).

Your suggestions would be appropriate were we trying to actually make a
usable CPU in an FPGA, but that's not our plan; the Mill FPGA is purely
a (faster than software) simulator for a regular hardware Mill. Its
purpose is to test out correctness and to speed up software development,
and we don't care how fast it runs. Consequently, the FPGA will work
*exactly* the same way that the corresponding chip Mill will, using the
same compilers. If that means that each Mill clock requires five, or 50,
FPGA stages and the bypass requires multiple SRAMS and muxes, then so
be it. It's still going to be 10,000 times as fast as software sim.


I like also that load completons get the most recent store results to avoid
aliasing.. but I thought other systems did this also (basically bypass the
dcache). Hmm, now that I think about it it's not fully solving the alias
problem. You are not free to move a load completion earlier than any store
that you can't prove is alias hazard free.

Load retires (completions) retain the same ordering vis a vis stores as
program order. Only load issue is moved on a Mill. Of course, when the
compiler can prove the absence of aliasing then it can also move load
retire too, but that's the compiler; the hardware does not do moves, and
views a compiler-revised order as "program order". The Mill is not
responsible for compiler bugs :)
 

Welcome to EDABoard.com

Sponsor

Back
Top