CPU vs. FPGA vs. RAM

V

Valentin Tihomirov

Guest
The massive parallelism is considered as the main advantage of of FPGAs.
Meantime, the bottleneck of modern systems is a memory performance. How do I
benefit in e.g. image processing using wide low speed FPGA over hi-speed
running CPU when image is located in SRAM? Today more and more FPGAs are
equipped with embedded RAM. How can FPGA benefit from the concurrent
processing having to serialize memory access?
 
In article <3f92b7c1$1_1@news.estpak.ee>,
Valentin Tihomirov <valentin@abelectron.com> wrote:
The massive parallelism is considered as the main advantage of of FPGAs.
Meantime, the bottleneck of modern systems is a memory performance. How do I
benefit in e.g. image processing using wide low speed FPGA over hi-speed
running CPU when image is located in SRAM? Today more and more FPGAs are
equipped with embedded RAM. How can FPGA benefit from the concurrent
processing having to serialize memory access?
What do you mean by "memory performance?" Latency for sequential
access? Latency for parallel accesses? Throughput for a single
stream? THroughput for multiple streams sharing memory? THroughput
for multiple streams from independant memories?



--
Nicholas C. Weaver nweaver@cs.berkeley.edu
 
What do you mean by "memory performance?" Latency for sequential
access? Latency for parallel accesses? Throughput for a single
stream? THroughput for multiple streams sharing memory? THroughput
for multiple streams from independant memories?
I don't think it matters. He just saying that FPGAs provide such an
abundance of functional units that memory performance (however you
call it -- the ability for a given system to provide data to the
function units) is limiting. He's got a 10,000 pound gorilla and he's
trying to feed it bananas with a teaspoon.

Jake
 
"Valentin Tihomirov" <valentin@abelectron.com> wrote in message news:<3f92b7c1$1_1@news.estpak.ee>...
The massive parallelism is considered as the main advantage of of FPGAs.
Meantime, the bottleneck of modern systems is a memory performance. How do I
benefit in e.g. image processing using wide low speed FPGA over hi-speed
running CPU when image is located in SRAM? Today more and more FPGAs are
equipped with embedded RAM. How can FPGA benefit from the concurrent
processing having to serialize memory access?
Good question, Valentin. I personally think you'll see a lot more use
being made of the on-chip embedded RAM as 'cache' memory than
previously. Of course, FPGAs are applied to so many problems that
this use may never be applicable and it has already been applied in
many situations. Bottom line is that the on-chip memory is going to
be faster than off-chip memory. It's usefulness in keeping the FPGA's
functional units supplied with data may come in the form of
transforming it into a more useful cache structure.

That may also mean that a larger part of an FPGA design will be
relegated to the function of cache controller. Perhaps using block
RAMs as L2-type cache and distributed local memory as an L1-type cache
and trying to keep the external memory pipe as active as possible in
the most efficient way depending on the functionality implemented.

Jake
 
Followup to: <d6ad3144.0310191907.619e5da3@posting.google.com>
By author: jakespambox@yahoo.com (Jake Janovetz)
In newsgroup: comp.arch.fpga
What do you mean by "memory performance?" Latency for sequential
access? Latency for parallel accesses? Throughput for a single
stream? THroughput for multiple streams sharing memory? THroughput
for multiple streams from independant memories?

I don't think it matters. He just saying that FPGAs provide such an
abundance of functional units that memory performance (however you
call it -- the ability for a given system to provide data to the
function units) is limiting. He's got a 10,000 pound gorilla and he's
trying to feed it bananas with a teaspoon.
This is of course why FPGAs has extremely high speed onboard memory in
small chunks that can be independently wired. The quantities are
limited, of course, just like the caches on a CPU, but a lot of FPGA
designs wouldn't be possible/practical without it.

-hpa


--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
 
In article <d6ad3144.0310191907.619e5da3@posting.google.com>,
Jake Janovetz <jakespambox@yahoo.com> wrote:
What do you mean by "memory performance?" Latency for sequential
access? Latency for parallel accesses? Throughput for a single
stream? THroughput for multiple streams sharing memory? THroughput
for multiple streams from independant memories?

I don't think it matters. He just saying that FPGAs provide such an
abundance of functional units that memory performance (however you
call it -- the ability for a given system to provide data to the
function units) is limiting. He's got a 10,000 pound gorilla and he's
trying to feed it bananas with a teaspoon.
It matters ALOT. Latency and large memory, you're fucked.
Overlapping latency is not so bad. Streaming access from multiple
banks? If it's predictible enough, you can be running those pins at
266 MHz DDR at full rate all the time.



--
Nicholas C. Weaver nweaver@cs.berkeley.edu
 
I don't think it matters. He just saying that FPGAs provide such an
abundance of functional units that memory performance (however you
call it -- the ability for a given system to provide data to the
function units) is limiting. He's got a 10,000 pound gorilla and he's
trying to feed it bananas with a teaspoon.

It matters ALOT. Latency and large memory, you're fucked.
Overlapping latency is not so bad. Streaming access from multiple
banks? If it's predictible enough, you can be running those pins at
266 MHz DDR at full rate all the time.
Hmm. Best case comparisons. Let's see:
For an XC2P125 thats about 1200 I/O streaming 533 Mbps each resulting
in a total of 640 Gbps of data assuming that you need no address lines
or address cycles.
Internally you have 556 BRAMs with 72 Databits each running at 266MHz
thats almost 11 Tgps, random access, 10008 address lines provided
additionally.

That's an order of magnitude better internal bandwidth PLUS random
access.

I agree with the original poster that it is difficult to find real
world applications that can use the potential computational power of
an fpga. Would be nice if there where only Smith-Waterman like
algorithms out there.

For reconfigurable computing an order of magnitude larger internal
memory would help alot and would not cos that much more. (Think HSRA)
But I do not think that RC people are the customers that the current
FPGAs are optimized for.

Kolja Sulimma
 
In article <b890a7a.0310210440.ae989cf@posting.google.com>,
Kolja Sulimma <news@sulimma.de> wrote:

For reconfigurable computing an order of magnitude larger internal
memory would help alot and would not cos that much more. (Think HSRA)
But I do not think that RC people are the customers that the current
FPGAs are optimized for.
Actually, mixed Dram/Logic (what gets your the ~10x greater memory
density for the HSRA) has effectively failed. It required too much
process fiddling, which nobody wants to do these days.

Also, then you have the joy of the DRAM page access time anyway, so
you save on the pin crossings, but the rest of the latency is still
there, and still measured in several clock cycles.



As for the streaming access example, its still an order of magnitude
greater than a CPUs, where you only have 128 pins at 533 Mbps/pin.
But if it is random access from a single block (e.g. pointer chasing),
why not jsut get an Athlon 64 and write it in software?
--
Nicholas C. Weaver nweaver@cs.berkeley.edu
 
Followup to: <bn3f57$41o$1@agate.berkeley.edu>
By author: nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver)
In newsgroup: comp.arch.fpga
Actually, mixed Dram/Logic (what gets your the ~10x greater memory
density for the HSRA) has effectively failed. It required too much
process fiddling, which nobody wants to do these days.

Also, then you have the joy of the DRAM page access time anyway, so
you save on the pin crossings, but the rest of the latency is still
there, and still measured in several clock cycles.
It is, but you get *HUGE* amounts of data for each access. Existing
DRAMs mux away an amazing amount of the data read for each array
access.

I suspect that DRAM integration is going to happen sooner or later,
however, it's not going to happen just yet.

-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
 

Welcome to EDABoard.com

Sponsor

Back
Top