Building the 'uber processor'

M

mikegw

Guest
Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour will
do).

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

b) Can anyone recommend a good book that I can read and hopefully be able to
ask more informed questions?

Cheers

Mike
 
I don't know of any good books.. but.. FPGA's can run rings around code...
especially if you can define what you want them to do. that's the tricky
part... and as far as parallel processing is concerned.. they will blow your
mind.. or sit there flashing a light...Xilinx are working on a JAVA compiler
for FPGA's. I think its a student partnership thing so am not sure how good
it is but it converts java into hardware.

And FPGA's will eat any cluster.. just see above.. But if you can't define
the problem in a way the FPGA can handle then it would be no faster. FPGA's
are literally OR's AND's and flip flops (latches) and that's what you need
to start with.. they also have adders and even processors.. small memories
and stuff like that.. if you need large memory they can do that too. its
hardware.. want SDRAM ? just connect it up and write a program to access it.
(just don't forget to refresh it too :)

There are already a number of super cluster FPGA projects around. One of
the fusion reactor projects uses several hundred of them .. I read an
article once.. don't remember the web site sorry.


Simon


"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message
news:bo4na0$5qk$1@tomahawk.unsw.edu.au...
Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a
most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place
add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour
will
do).

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

b) Can anyone recommend a good book that I can read and hopefully be able
to
ask more informed questions?

Cheers

Mike
 
"Simon Peacock" <nowhere@to.be.found> wrote in message
news:3fa621eb@news.actrix.gen.nz...
I don't know of any good books.. but.. FPGA's can run rings around code...
especially if you can define what you want them to do. that's the tricky
part... and as far as parallel processing is concerned.. they will blow
your
mind.. or sit there flashing a light...Xilinx are working on a JAVA
compiler
for FPGA's. I think its a student partnership thing so am not sure how
good
it is but it converts java into hardware.

And FPGA's will eat any cluster.. just see above.. But if you can't define
the problem in a way the FPGA can handle then it would be no faster.
FPGA's
are literally OR's AND's and flip flops (latches) and that's what you need
to start with.. they also have adders and even processors.. small memories
and stuff like that.. if you need large memory they can do that too. its
hardware.. want SDRAM ? just connect it up and write a program to access
it.
(just don't forget to refresh it too :)

There are already a number of super cluster FPGA projects around. One of
the fusion reactor projects uses several hundred of them .. I read an
article once.. don't remember the web site sorry.


Simon
Thanks

Just so I understand you, if I want to "realise" my c code in a FPGA
array, I can upload the code, data and the processing array. Run it and
download the data?

The code (not actually mine I am just seeing if this is all possible) is
basically applying an equation on a data set looping for all particles for
each time step. The tricky bit (in at least the programming sense) is to
constantly calculate the relative positions of each particle to calculate
their effect on each other.

I would really like it if there exists such a book that could take someone
who has a c/c++ program and hold their hand through a whole "Realisation" of
that code.

Cheers

Mike
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:<bo4na0$5qk$1@tomahawk.unsw.edu.au>...
Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour will
do).

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

b) Can anyone recommend a good book that I can read and hopefully be able to
ask more informed questions?

Cheers

Mike
Hi Mike

Think of a coprocessor as a black box with input output channels that
sits in your PC. The computing elements may be a fraction of the speed
of a 3GHz P4 at some things or maybe many orders of magnitude more. I
am guessing that your app needs FP calculations, maybe IEEE, maybe any
adhock FP will do. The IEEE is still costly to do in FPGA but see a
previous post for some pointers. An adhock FP may be all thats needed
but you would have to do a similar version in SW for a unaccelerated
node to get same results.

Where FPGA boards really shine is when you can arrange for them to be
in series with streaming data that that may be orders faster than a PC
could normally handle. If your data is on HD and has to come through
PCI bus then you are IO bound. That may be ok if you can perform N
million comps per word transfered such as say crypto but if you needed
to do minimal comps per point, FPGA can be wrong solution.

Figure how much parallelism you can extract. P4 may run at 3GHz. An
FPGA board may run at 50MHz to 200MHz, if you perform integer *+ that
may limit to 100MHz. So you need to be doing atleast 30x more in
parallel just to match 1 P4. If you can do an order more in parallel
than that, then you could be doing fine as long as you aren't IO
bound. Consider a faster PCI bus that will get you a few x more
throughput. Consider if you can dump one time all data into onboard
ram on PCI board, ie get the PC out of the eqn except for basic system
support.

Take alook at TimeLogic Decypher board as an example of Bioinformatics
that get accelerated at similar rates to your app, but AFAIK its
mostly pattern matching & integer comps.

Can't say I heard of any books on this matter as its still immature
field!

Good luck

johnjaksonATusaDOTcom
 
Hi Mike,

mikegw wrote:
a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?
To parallel what Jon said earlier - the biggest gotcha that seems to
bite people is IO bandwidth. It's not necessarily hard to develop
highly pipelined FPGA designs that will crunch your numbers at 100M
sample/sec, but can you keep it busy?

I read of an interesting approach a while ago - do a search for
Pilchard, it's an FPGA coprocessor board developed at a Hong Kong
university. Basically it fits in the standard PC memory module form
factor, with custom Linux drivers to access it. The bandwidth on the
memory bus is much greater than on PCI.

Regards,

John
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:<bo4na0$5qk$1@tomahawk.unsw.edu.au>...
Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour will
do).
in München, Germany there is a research group that uses Xilinx a lot
they do some 'particle' search I think FPGAs are mostly used to filter
out the data coming from then experiment. as you are also in heavy
research area maybe good idea to contact them - I have no addresses
but there are not so many nuclear labs so the one I mentioned should
be easy to find for you

antti
 
"John Williams" <jwilliams@itee.uq.edu.au> wrote in message
news:bo6l2c$u39$1@bunyip.cc.uq.edu.au...
Hi Mike,

mikegw wrote:
a) Will a FPGA co-processor board(s) offer a speed improvement in
running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

To parallel what Jon said earlier - the biggest gotcha that seems to
bite people is IO bandwidth. It's not necessarily hard to develop
highly pipelined FPGA designs that will crunch your numbers at 100M
sample/sec, but can you keep it busy?

As we will be stepping time, the data (particle information position etc...)
will be the output of the previous 'step'. The only bit that might be messy
is to calculate the relative distances between particles.

I think that these devices might be the way to go. To me it seems odd that
we seem to be taking a step back to the old analogue computer days when you
'built' your program.



I read of an interesting approach a while ago - do a search for
Pilchard, it's an FPGA coprocessor board developed at a Hong Kong
university. Basically it fits in the standard PC memory module form
factor, with custom Linux drivers to access it. The bandwidth on the
memory bus is much greater than on PCI.
I took a look, it seems to be fairly interesting. Given my particular data
set I might be on the wrong track thinking of an accelerator card. Maybe a
stand alone device which the input is up-loaded and it is sent forth to do.

So much to learn.........

Mike
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> writes:
As we will be stepping time, the data (particle information position etc...)
will be the output of the previous 'step'. The only bit that might be messy
is to calculate the relative distances between particles.

I think that these devices might be the way to go. To me it seems odd that
we seem to be taking a step back to the old analogue computer days when you
'built' your program.
This is getting away from hardware, and you haven't said how much
expertise you have there to use on the problem, but I remember a
series of books published by MIT press in the '90s. Each was the
summary of a different phd thesis. One of those described break-
throughs in the simulation of many body problems that led to orders
of magnitude increase in speed for running the simulation. I don't
know whether those results would apply in your case or not.

It seems to be a general rule that hardware can speed up a problem
by k-fold, where k is a modestly small number usually. But finding
a better algorithm can speed up a problem by n-fold, where n is the
number of items you have to deal with. With both you might get k*n.

So much to learn.........
Someone once said to me "it takes six or eight years to really learn
something well, and you don't have very many six or eights, so don't
you go waste one." Now I realize I really should have understood what
he meant then.
 
On Mon, 3 Nov 2003 16:05:52 +1100, "mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote:
Hello all,
Hi Mike,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour will
do).
So that we may better help you, please answer the following questions:

Is the arithmetic Floating Point (FP) or Integer?

If mixed, what is the ratio of the two?
(i.e. 10000 integer ops to every floating point op)
(If the ratio is greater than 100000:1, could you do the integer
stuff in the FPGAs, and the FP in a host X86 processor?)

If floating point:
Does it need to be IEEE FP (i.e. identical to a software execution
on the same data set)
OR
(Floating point with N bits of mantissa, M bits of exponent, X guard bits,
etc...)

What is the ratio of Mult, Div, Add, Sub, Sqrt, Sin, Cos, Exp, Log, ...
(Are integer aproximation useable??)

For integer operations, how many bits of precission are needed?
Is this precision required all the way through the algorithm, or can the
precission be adjusted at each step?

How many arithmetic/logic ops per data item?

What is the data set size needed before calculations can start
(i.e. 20 3D points, 10 scan lines, a 512 by 512 2D set, ...)

Can the calculations be partitioned in multiple identical sets that
perform the same operation on different parts of the total data set.

If partitioning is possible how much communications (number of data
items) is needed to be passed between the separate calculation
clusters? How often does this need to happen (what is the
inter-processor bandwidth).

How much local data is created while calculations take place?
(What bandwith is needed to support it)

How much table/look up data is required by the algorithm?
(What bandwith is needed to support it)

Can data be thought of as a continuous stream in and out, or is
it 1 big chunk that must all arrive, then calculate till done, then
spit out a result (what is size of input chunk and output gems). Is there
a constant flow of chunks (Size, arrival rate, expected FP/Int ops per
chunk?)

Since you want an Über processor, do you have an Über hardware designer?
(It takes considerable effort to create one of these, especially if what
you start with is an Über software designer. It is an order of magnitude
easier to get a HW engineer to write passable SW than it is to get a SW
programmer to design passable HW.)

Are you aware that SW is basically written for sequential execution, or
extremely chunky parallelism (threads). Hardware design (for Über
processors) typically require Ultra parallelism (100s to 1000s of
operation running in parallel), which means that your algorithms will
have to totally re-arranged to match such application specific hardware.
Although this is daunting, there are hundreds of real life systems that
have done this (i.e. your basic question of "does this make sense" to
consider FPGAs to create an application specific co-processor is YES).
Implementing these successful systems was never achieved by just taking
the SW (C/C++ for example) and re-crafting as hardware. You will need to
go back to the basics of the algorithm's intent, then design for the
extreme parallelism that the FPGAs offer. This is not always possible, as
discussed by others who have answered your original question.

Are you thinking of a single co-processor board in a PC or something more
like a Bewoulf cluster with each node having its own accelerator board?

There are many more such questions, but this would be a good start.

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?
Can't answer this without far more information from you. See above :)

Note that your:
"so can we reconfigure our FPGA boards to speed calculation?"

is no trivial thing. The design of the hardware may take many months
to do even if you have a Über hardware designer.

b) Can anyone recommend a good book that I can read and hopefully be able to
ask more informed questions?
There is an annual conference held in Napa California where all the people
that do this type of thing meet. It is teh IEEE FCCM conference. You
would be well served by looking at the titles of the proceddings for the
last 7 years at http://www.fccm.org/ . You can probably get copies of the
proceedings from the IEEE for way too much money.

Happiness to you too.

Philip





Philip Freidin
Fliptronics
 
mikegw wrote:

Just so I understand you, if I want to "realise" my c code in a FPGA
array, I can upload the code, data and the processing array. Run it and
download the data?

The code (not actually mine I am just seeing if this is all possible) is
basically applying an equation on a data set looping for all particles for
each time step. The tricky bit (in at least the programming sense) is to
constantly calculate the relative positions of each particle to calculate
their effect on each other.
Mike,

Surely, you might put something like a processor into an FPGA where
you can download your code and data. But you will very likely not
gain very much from this as you are still stuck with your
"program code execution" paradigm.

Depending on the application, you might get a little gain by
placing a very special processor into the FPGA that is optimised
for your application. DSPs are a good example here. They have
special features that makes them very fast for some algorithms.
This would also require that you have a special compiler, that
compiles the code (that you want to reuse) optimized for your
special processor. But many things you would probably anyways
need to code in assembly language, because there is no direct
translation from an high-level language to a special machine
feature possible. As far as I know, this is the same for DSPs.

However, a real speed-up you will achieve by throwing the
processor concept over board and thinking just in distributed
state machines. This is a completely different thing compared
to implementing an algorithm in some language.
At first, you have to be an experienced digital designer to do
that. (Btw, you have to be the same when designing a special
CPU, of course.)

Regards,
Mario
 
Thanks

Just so I understand you, if I want to "realise" my c code in a FPGA
array, I can upload the code, data and the processing array. Run it and
download the data?
Yes. But you are likely to spend a lot of effort designing the
processing array.

The code (not actually mine I am just seeing if this is all possible) is
basically applying an equation on a data set looping for all particles for
each time step. The tricky bit (in at least the programming sense) is to
constantly calculate the relative positions of each particle to calculate
their effect on each other.
I guess that if you post the equation (maybe a simplified version),
the precision you need and the number of elements in a typical data
set you will get a pretty good estimate from this group about how well
this can be solved in FPGAs.

Kolja Sulimma
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message
news:bo5bfr$ad4$1@tomahawk.unsw.edu.au...
Just so I understand you,
if I want to "realise" my c code in a FPGA array,
I can upload the code, data and the processing array.
Run it and download the data?
No.

Short answer:

C/Pascal/etc compile to machine code instructions to run on a
general-purpose processor,
only one executes at a time.

VHDL/Verilog compile to a description of many specific-purpose hardware
processes,
all executing at once.


Longer answer:

Microprocessors execute a single conceptual process at a time.

In the real world there are many processes running concurrently.

Conventional micros and software require blocks of sequential instructions.

Occam was a language to describe processing in terms of communicating
sequential processes.
These could then be farmed out over multiple processors and done in
parallel.
The transputer was designed in tandem with occam, optimised for this
programming model and communication between processors.

My old tutor said that hardware engineers grasped these concepts much
faster, because they are already comfortable with thinking in terms of many
things happening at once in hardware. Software engineers had to unlearn
their usual sequential thinking.

In the past, the general-purpose microprocessor was a great alternative to
single-purpose machines.
The latter could be much faster but took ages do design and build and
modify.

FPGA chips change that balance of power.

Like occam, VHDL and Verilog allow you to describe processing in terms of
communicating sequential processes
(occam has been used as a hardware description language).

However, instead of creating machine-code instructions to perform a process,
they create descriptions of hardware to do all these processes. The 'fitter'
then fits the design into particular makes of FPGA.

I can see that conventional programmers would love to be able to just chuck
their old C programs into an FPGA and have it run faster, but I feel this is
not sensible (although Handel C seems to be trying it). No pain, no gain.

I didn't find VHDL all that hard to pick up. In fact it is quite liberating
to throw off the shackles of conventional software design. Instead of
getting a single micro to rapidly poll, process and toggle dozens of
real-time inputs and outputs, I can now simply declare dozens of independent
hardware processors.

Benefits depend on the problem you want to solve. You can beat
microprocessors easily at some tasks but not others. Ideal tasks are simple
and easily scaled up, like a systolic processor for finding matches in DNA
sequences, or sifting keys for the enigma machine. The wartime machine
weighed tons, used kilowatts, and clocked at 5 kHz. It would beat many
modern chips, which shows the advantage of customised hardware. You might be
able to make an equivalent weighing grammes, using milliwatts, and clocked
at 50 MHz! I wonder if the government kept the 95% of Enigma messages that
they didn't have time to crack? I'm sure military historians would be
interested in the contents...
 
To add more support to the IO bandwidth being one of the major issues, one
thing that I see often getting overlooked when people start clustering
machines with regular networking is the overhead of just running the network
connections. There was an interesting article in EE Times sometime last
year (don't recall which one) showing how much of a GHz Pentium it took just
to run a 1 Gb Ethernet connection. If I recall, it was on the order of 50%
of the processor, assuming of course you were keeping the Ether busy. Of
course, just plugging in PCI boards has the same issue if all the data has
to move on the PCI bus, as the bus itself becomes the bottleneck.

If you are serious about building a monster machine out of multiple
processors, don't overlook the data movement aspect.

Now, it just so happens that the architectures we use on our boards have IO
capabilities that scale with system size, and that isn't a coincidence, as
our customers build large multiprocessor systems out of them. The
underlying support for this is inherent in the SHARC and TigerSHARC
processors from Analog Devices, which have a built in IO Processor for
moving data into and out of the DSP's large internal memory so the core can
number crunch while data movement happens in the background. These DSPs
also have multiple high speed point to point interconnects called link ports
(the TigerSHARC 101S has four 250 MByte/sec links as well as its 64 bit 100
MHz external bus) which can be used for shipping data around. We also use
large FPGAs and connect them to the DSPs using these links for the data
flows.

While some will argue that the best approach is a bunch of GHz PCs, and
others will say use traditional DSP, and yet more will say FPGA, there is no
one magic approach that applies to all systems. Usually some combination of
these processing types will get the job done, it's a matter of deciding
which parts of your system are better served by which. And this of course is
dependent on the type of number crunching you need and the associated data
movement requirements.

-----
Ron Huizen
BittWare

"John Williams" <jwilliams@itee.uq.edu.au> wrote in message
news:bo6l2c$u39$1@bunyip.cc.uq.edu.au...
Hi Mike,

mikegw wrote:
a) Will a FPGA co-processor board(s) offer a speed improvement in
running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

To parallel what Jon said earlier - the biggest gotcha that seems to
bite people is IO bandwidth. It's not necessarily hard to develop
highly pipelined FPGA designs that will crunch your numbers at 100M
sample/sec, but can you keep it busy?

I read of an interesting approach a while ago - do a search for
Pilchard, it's an FPGA coprocessor board developed at a Hong Kong
university. Basically it fits in the standard PC memory module form
factor, with custom Linux drivers to access it. The bandwidth on the
memory bus is much greater than on PCI.

Regards,

John
 
Another area to research could be electric circuit simulation (ie
Spice). There are similarities: each circuit node can influence every
other (at least potentially). Spice basically revolves around
inverting a mega-matrix. There's been quite a lot of work put into
building hardware accelerators for Spice. You may be able to leverage
off that.

antti@case2000.com (Antti Lukats) wrote:

:"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:<bo4na0$5qk$1@tomahawk.unsw.edu.au>...
:> Hello all,
:>
:> Firstly I would like to say that other than knowing what a FPGA is on a most
:> basic level my knowledge about the subject is nil. I am looking at this
:> from an application that needs a solution. I have seen about the place add
:> on boards for PC's that act as co-processors. This is the interesting bit
:> to me. Our research group is looking into building a computer (cluster
:> perhaps) for calculation of particle dynamics, similar to CFD in
:> application. Our programs are in C/C++ running on Linux ( any flavour will
:> do).
:
:in München, Germany there is a research group that uses Xilinx a lot
:they do some 'particle' search I think FPGAs are mostly used to filter
:eek:ut the data coming from then experiment. as you are also in heavy
:research area maybe good idea to contact them - I have no addresses
:but there are not so many nuclear labs so the one I mentioned should
:be easy to find for you
:
:antti
 
Philip Freidin <philip@fliptronics.com> wrote in message news:<2hseqvsmbtbs1o23kedfpbjsucv884m80f@4ax.com>...
On Mon, 3 Nov 2003 16:05:52 +1100, "mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote:
Hello all,

Hi Mike,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil. I am looking at this
from an application that needs a solution. I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application. Our programs are in C/C++ running on Linux ( any flavour will
do).

So that we may better help you, please answer the following questions:

Is the arithmetic Floating Point (FP) or Integer?

If mixed, what is the ratio of the two?
(i.e. 10000 integer ops to every floating point op)
(If the ratio is greater than 100000:1, could you do the integer
stuff in the FPGAs, and the FP in a host X86 processor?)

rest snipped

Alot of really good points covered by Philip and also above posts. Its
clear that many FPGA pros would like to get their grubby hands on such
a project as long as it pays ofcourse. I can only wish there were more
of these projects but its a bit off the normal path.

I am going to suggest instrumenting the code to find out the answers
unless its obvious by inspecting the code. The FP question is
paramount though. Alot of DSP written in C often uses lazy math
because the FP is basically free. If you were to largely eliminate FP
the way we used to > 20yrs ago, you can often get just as good a
result and get a much better understanding of whats really important
and where precision & dynamic range are really needed and where its
wasted.

In the DSP world, 16-18bits has proven to be adequate for most tasks
as long as special care is taken to keep signals in range. Many
algorithms can block up the range so only one exponent is needed for a
whole group of points, and this exponent can be as little as a common
divide by 1,2,4 etc in the FFT case. I suspect it won't be as easy for
your problem.

As an aside, most cpus don't even perform integer math very well for
DSP tasks, for instance when rounding, many uninformed programmers
will use >> to do division by 2^N not realizing this introduces
roundoff errors biasing the signed signal to negative. To do it
correctly requires more integer ops to check for msbs & lsbs etc but
it reduces the need for extra width and can bring the results much
closer to an all FP result converted back to same size int.

One approach I used to turn a Wavelet algorithm into RTL HW was to
rewrite the original C code along with the Matlab guy so that the main
engine module called small C functions to eval each and every
operator, *s +s, /s, and even movs etc. These could also tally the use
counts and do more precise math on words that might be odd bit widths.
My busses were 17,18 & 37bits wide. By tweaking the widths of various
operators we could reduce the cost of HW func blocks to an acceptable
value in an ASIC. When Dr Matlab was happy with the C code, all the
funcs were easily turned into equiv Verilog (or VHDL) modules, params
became ports. The operator counts are needed to design suitable
datapaths with fixed/variable arithmetic units. The C program that had
been calling these funcs in a C dataflow was then used to construct a
FSM to arrange for data to keep moving between the various arithmetic
modules/funcs and multiple memories. It gets harder because the HW is
10 stages of overlapping pipelines, very difficult to express in C
HDL. Even still it took 6months just to get from Fortran-C to RTL
Verilog. Ofcourse the C & Verilog results were identical and very
close to FP Matlab model.


Since Sharcs, Transputers & Occam were mentioned, I would mention that
the ADI links on the Sharcs (& TI chips IIRC) are a variation of the
Transputer links that supported Occam channels across multiple cpus.
If only a modern Transputer existed that was comparable to todays
embedded cpus/DSPs. KROC anyone? Then it would be perfectly reasonable
to build the project in Occam or C-CSP and spread it accross as many
tightly coupled cpus as needed. The resulting code will be HW like but
with alot less pain and could also lead to HW synthesis (HandelC).

I happen to be working on such a Uber Transputer plus compiler but its
still some ways off. The native programming language for this is V++
or Verilog + C + Occam + asm all rolled into one (horrid) language. A
mini version of SystemVerilog but tuned to run natively on the
event/process scheduler already in the cpu.

The Verilog part allows HW to be described directly either
behavioural, dataflow, RTL and most of that remains synthesizeable if
written so. Processes can be engineered back & forth between synthed
HW (coprocessors) & SW if cpu is in FPGA. Some might call this a HW
accelerator on the cheap or a simulation engine, but thats like
calling conventional cpus Turing accelerators.

The Occam part is just the !? alt,par,seq,chan model in C syntax. The
underlying scheduler is not so different from the HW event timing
wheel.

The C part adds data types and allows conventional seq programming and
importing of code. Asm touches the HW directly. The compiler is based
on the lcc design.

I am curious to know what folks think of combining HDL,CSP,C rather
than keeping HW & SW miles apart as in conventional engineering.

regards
John

johnjaksonATusaDOTcom
(ignore yahoo)
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote:
"John Williams" <jwilliams@itee.uq.edu.au> wrote in message
news:bo6l2c$u39$1@bunyip.cc.uq.edu.au...
mikegw wrote:
a) Will a FPGA co-processor board(s) offer a speed improvement in
running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

To parallel what Jon said earlier - the biggest gotcha that seems to
bite people is IO bandwidth. It's not necessarily hard to develop
highly pipelined FPGA designs that will crunch your numbers at 100M
sample/sec, but can you keep it busy?

As we will be stepping time, the data (particle information position etc...)
will be the output of the previous 'step'. The only bit that might be messy
is to calculate the relative distances between particles.

I think that these devices might be the way to go. To me it seems odd that
we seem to be taking a step back to the old analogue computer days when you
'built' your program.
When using FPGAs instead of CPUs without major changes in your
algorithms you could simple build a CPU with improved datapath so you
may improve the number of operations per cycle but if you manage to
get 6 operations done in the number of cycles a cpu does one, you end
up with no gain as a actual cpu will have ~6 times the number of
cycles as your fpga [1].

When inventing an algorithm that uses the benefits from an fpga you
could end up with magnitutes of speedup. For example when solving SAT,
you may create a fpga with exact the formula and a counter which will
test one set of variables per cycle instead of speeding up your
integer operations. This approach is of course very limited to the
size of your fpga.

So it doesn't seem 'odd' to me that you have to leave "normal"
sequential algorithms and think about complete new ideas.

bye Thomas

[1] with io limitations like pci this is _very_ optimitic for an fpga.
 
"john jakson" <johnjakson@yahoo.com> wrote in message
news:adb3971c.0311042304.43735b02@posting.google.com...

[snip much]

I am curious to know what folks think of combining HDL,CSP,C rather
than keeping HW & SW miles apart as in conventional engineering.
See Celoxica's products.

And observe how they've backed away from a reasonably pure CSP
approach like yours, and put more emphasis on the pure-C thing.

Software people have an irrational and passionate distaste for
fine-grained parallelism. I don't think you have much chance
of changing their collective mind.

That observation should not distract you from an interesting
and (one hopes) ultimately fruitful project.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, Hampshire, BH24 1AW, UK
Tel: +44 (0)1425 471223 mail: jonathan.bromley@doulos.com
Fax: +44 (0)1425 471573 Web: http://www.doulos.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
"Jonathan Bromley" <jonathan.bromley@doulos.com> wrote in message news:<bobfb4$2gq$1$8300dec7@news.demon.co.uk>...
"john jakson" <johnjakson@yahoo.com> wrote in message
news:adb3971c.0311042304.43735b02@posting.google.com...

[snip much]

I am curious to know what folks think of combining HDL,CSP,C rather
than keeping HW & SW miles apart as in conventional engineering.

See Celoxica's products.

And observe how they've backed away from a reasonably pure CSP
approach like yours, and put more emphasis on the pure-C thing.

Software people have an irrational and passionate distaste for
fine-grained parallelism. I don't think you have much chance
of changing their collective mind.
Just as HW people generally view plain C as a likely HDL with great
discomfort as has been seen by the no of deceased C HDL companies. But
Celoxica is aiming at a different crowd, not hardcore HW Asic guys. I
am familiar with HandelC to some degree, I see them a couple of times
a year at different shows. Every time I see them I get better insight
but the question remains, why use plain C with CSP semantics (Occam is
underneath it right) when HDLs are far better at describing HW. Every
conversation with Celoxica people tells me that you still have to
describe the parallelism directly which is why it can be synthesized.
It would surprise me if its possible to use HandelC in a meaningfull
way and not use any of the inherited Occam keywords.

Then again HDLs are not too good for describing purely seq processes
or SW so bridging the 2 worlds is difficult with either HDL or general
seq SW languages. So I am addresing the audience that is comfortable
on both sides and wants to move processes between HW & SW. This is not
quite the same thing as SystemvVerilog as that is clearly only aimed
at big $ ASIC engineers. VHDL perhaps already is a bridge language but
I have never been partial to it. If Celoxica included a Transputer IP
core with their tool, I imagine HandelC could also be a bridge since
code could either run as synthed HW or run as plain Occam style code.

Regards
John

johnjaksonATusaDOTcom
 
Hi,

I have been following this thread with great interest.

If you need a processor with links to/from the processor register file
then MicroBlaze could be the answer.

MicroBlaze has 18 direct links (in the current version, the ISA allows
up to 2048) and 8 new instructions for sending
or receiving data to/from the register file.

The connection is called LocalLink (or FSL) and has this features
- Unshared non-arbitrated communication
- Control and Data support
- Uni-directional point-to-point
- FIFO based
- 600 MHz standalone operation

Göran
john jakson wrote:

"Jonathan Bromley" <jonathan.bromley@doulos.com> wrote in message news:<bobfb4$2gq$1$8300dec7@news.demon.co.uk>...


"john jakson" <johnjakson@yahoo.com> wrote in message
news:adb3971c.0311042304.43735b02@posting.google.com...

[snip much]



I am curious to know what folks think of combining HDL,CSP,C rather
than keeping HW & SW miles apart as in conventional engineering.


See Celoxica's products.

And observe how they've backed away from a reasonably pure CSP
approach like yours, and put more emphasis on the pure-C thing.

Software people have an irrational and passionate distaste for
fine-grained parallelism. I don't think you have much chance
of changing their collective mind.




Just as HW people generally view plain C as a likely HDL with great
discomfort as has been seen by the no of deceased C HDL companies. But
Celoxica is aiming at a different crowd, not hardcore HW Asic guys. I
am familiar with HandelC to some degree, I see them a couple of times
a year at different shows. Every time I see them I get better insight
but the question remains, why use plain C with CSP semantics (Occam is
underneath it right) when HDLs are far better at describing HW. Every
conversation with Celoxica people tells me that you still have to
describe the parallelism directly which is why it can be synthesized.
It would surprise me if its possible to use HandelC in a meaningfull
way and not use any of the inherited Occam keywords.

Then again HDLs are not too good for describing purely seq processes
or SW so bridging the 2 worlds is difficult with either HDL or general
seq SW languages. So I am addresing the audience that is comfortable
on both sides and wants to move processes between HW & SW. This is not
quite the same thing as SystemvVerilog as that is clearly only aimed
at big $ ASIC engineers. VHDL perhaps already is a bridge language but
I have never been partial to it. If Celoxica included a Transputer IP
core with their tool, I imagine HandelC could also be a bridge since
code could either run as synthed HW or run as plain Occam style code.

Regards
John

johnjaksonATusaDOTcom
 
"Kolja Sulimma" <news@sulimma.de> wrote in message
news:b890a7a.0311040449.77eabb34@posting.google.com...
Thanks

Just so I understand you, if I want to "realise" my c code in a FPGA
array, I can upload the code, data and the processing array. Run it
and
download the data?
Yes. But you are likely to spend a lot of effort designing the
processing array.

I guess that if you post the equation (maybe a simplified version),
the precision you need and the number of elements in a typical data
set you will get a pretty good estimate from this group about how well
this can be solved in FPGAs.

Kolja Sulimma
I will post the equations if I am able. This particular project is not mine
and as such I do not know if I am able to post their work. I will know in
the next week. But in the meantime the basic premise of the calculations is
as follows....

From time zero until time x, for each time step, calculate for n particles
(Typically hundreds to thousands) their position in the next time step.
Factors affecting the new position are 1)interaction between each particle
2)particle velocity and mass 3)media that the particle is in.

Currently the system is simplified by locating the particles into
neighbourhoods so that effects from distant particles is ignored.


Again thanks all for your help

Mike
 

Welcome to EDABoard.com

Sponsor

Back
Top