FPGA tool benchmarks on Linux systems

B

B. Joshua Rosen

Guest
I've put together a webpage on the performance of NCSim and Xilinx
tools on various systems, specifically a dual PIII, dual Xeon, Athlon 64
3400+ and an Athlon 64 3800+.

http://www.polybus.com/linux_hardware/index.htm
 
Jason Zheng wrote:

3. You are comparing state-of-the-art AMD workstations with mediocre
Intel servers. It's like comparing oranges with apples.
I have some measurements with more current Xeon processors. Unfortunately
I had only one not so state of the art Opteron to measure.

These results were published by me in one local Mentor Graphics conference
(these are only small part of the numbers). The simulations are done with
Modelsim for a ~8Mgate chip (+all memories). The numbers are simulation time
in seconds.


RTL One CPU active
Sun V880 UIII/900 3531
P4 Xeon 2.2/512k 2224
P4 Xeon 2.4/512k 2087
P4 Xeon 2.8/512k 1928
P4 Xeon 3.06/512k 1634
P4 Xeon 3.4EMT (32b) 1239
AMD Opteron 848(32b) 1584

RTL Both CPUs active
Sun V880 UIII/900 3520
P4 Xeon 2.2 2540
P4 Xeon 2.4 2680
P4 Xeon 2.8 2650
P4 Xeon 3.06 2120
P4 Xeon 3.4EMT (32bit) 1450
AMD Opteron 848(32b) 1587

One thing that amazes me is that in Xeons even with RTL simulation the performance
degrades very guickly. I guess with 4 processors Xeons degrade very badly. In
Opterons there was no degradation to be seen.

For the gate level simulations the results are almost identical, altough the dataset
is 15-20x larger and simulation times for the same case are longer. Also if 64b mode
was used Opteron became faster and Xeon EMT was little slower (very small differences
compared to 32b mode tough).


--Kim
 
On Tue, 01 Mar 2005 11:49:57 -0800, Jason Zheng wrote:

Dave Colson wrote:
I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.


That depends on the process scheduler in the kernel. The most
significant benefit is that your multi-threaded application such as HDL
simulator can run multiple threads at the same time. For example the
following verilog structure lends itself to multi-threading:
Parallel simulators are apparently a much harder problem then you might
suspect. A number of years ago I was discussing this issue with the CTO of
IKOS (since bought by Mentor). To me it seemed that simulation should be
a highly parallel problem but he claimed that there had be a number of
attempts at parallel software simulators (as opposed to hardware
acceleration engines) and that no one had succeeded. With the advent of
multi-core processors this year I suspect that the issue will be
revisited. In the mean time a dual processor machine is useful for running
multiple simultaneous simulations, like regression suites, assuming that
you have more than one license.
 
I think we should all encourage the FPGA- and EDA-tool-vendors to adapt
there software for parallel algorithms (especially place and route), as the
dual-cores are really coming soon and most of us will buy the fastest
machine they can get for reasonable money. In fact, a parallel algorithm
would already help a little bit today for P4s with hyper-threading.

Regards,

Thomas

www.entner-electronics.com

"B. Joshua Rosen" <bjrosen@polybus.com> schrieb im Newsbeitrag
news:pan.2005.03.01.21.40.50.480146@polybus.com...
On Tue, 01 Mar 2005 11:49:57 -0800, Jason Zheng wrote:

Dave Colson wrote:
I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.


That depends on the process scheduler in the kernel. The most
significant benefit is that your multi-threaded application such as HDL
simulator can run multiple threads at the same time. For example the
following verilog structure lends itself to multi-threading:


Parallel simulators are apparently a much harder problem then you might
suspect. A number of years ago I was discussing this issue with the CTO of
IKOS (since bought by Mentor). To me it seemed that simulation should be
a highly parallel problem but he claimed that there had be a number of
attempts at parallel software simulators (as opposed to hardware
acceleration engines) and that no one had succeeded. With the advent of
multi-core processors this year I suspect that the issue will be
revisited. In the mean time a dual processor machine is useful for running
multiple simultaneous simulations, like regression suites, assuming that
you have more than one license.
 
Christian Schneider wrote:

Thanks for all the benchmaks! Very interessting information!

If I interpret the data correctly, two CPU result in the same simulation
time, so they are of no benefit? That's a pity!
The 2 CPU result is 2 copies of the same simulation running at the same time.
There are no multithreaded RTL simulators available commercially.

That measurement shows how the memory bus and machine architeture scales.

--Kim
 
"Dave Colson" <dscolson@rcn.com> writes:

I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.
I would rather have two single CPU systems. In some cases it's
cheaper.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
 
"B. Joshua Rosen" <bjrosen@polybus.com> writes:

Parallel simulators are apparently a much harder problem then you
might suspect. A number of years ago I was discussing this issue
I find this quite surprising as well. It has been discussed a few
times in comp.lang.verilog (like in http://tinyurl.com/6z9jv).

IKOS (since bought by Mentor). To me it seemed that simulation
should be a highly parallel problem but he claimed that there had be
a number of attempts at parallel software simulators (as opposed to
hardware acceleration engines) and that no one had succeeded. With
I think many of the EDA vendors are expecting linear speedup so they
can apply a linear pricing policy. If the license cost was flat, e.g.
they viewed a cluster as a single fast machine I would be happy to
accept less than linear speedup and just throw sub-1k$ PC's at my
simulation to increase its performance.

I'm also surprised that there aren't many parallel synthesis tools and
place & route tools (Xilinx par support a very coarse grain
parallelism) out there either. Must be a great opportunity for MPI
programmers with EDA knowledge...

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
 
Jason Zheng <xin.zheng@jpl.nasa.gov> writes:

Place-n-route can already be done in pseudo-parallel fashion with
xilinx's modular design. You can simply run two processes that each
Actually the Xilinx par tool has supported a coarse grained
parallelism for many years (using the -m option, on Solaris that is).
I remember having my par job running on half a dozen or so dual and
quad sparc systems.

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
 
One thing that amazes me is that in Xeons even with RTL simulation the
performance
degrades very guickly. I guess with 4 processors Xeons degrade very badly.
In
Opterons there was no degradation to be seen.
Well, at my previous workplace, there was a dual Pentium3/S (1.26GHz,
512K cache) server. Running two *independent* memory-intensive jobs
simultaneously basically incurred a 60% performance hit. Another way
to put it is this; if job-A takes one 1 hour by itself, and job-B takes
1 hour by itself -- launching both A+B simultaneously causes the
completion time to increase to 1.6 hours (for both jobs.) ACKK!!!
(This was for NC-Verilog 4.0.)

For the gate level simulations the results are almost identical, altough
the dataset
is 15-20x larger and simulation times for the same case are longer. Also
if 64b mode
was used Opteron became faster and Xeon EMT was little slower (very small
differences
compared to 32b mode tough).
That's very interesting to know. At my current workplace, we have an
'unofficial' (i.e., unsanctioned by management -- we're a Solaris
department!) Athlon/64 3200+. From my firsthand experience, Cadence's
WarpRoute and Buildgates/PKS5 benefit tremendously from 64-bit x86_64 vs
32-bit IA32 mode, something like a +30% boost in throughput. (The job's
RAM footprint increases a bit, as expected and noted in the product's
documentation.) The main problem is that a lot of older CAD-tools
just "don't work right" under the 64-bit Linux O/S. Ironically, our ancient
"signalscan" waveform viewer still runs, while our tool-guy can't figure out
why Tetramax U-2003.06-SP1 refuses to work...

It's really funny when a manager comes along and asks why the engineers
like the Athlon/64 so much, and the engineers tell him "because it runs
a synthesis job up to 3x faster than our fastest Solaris boxes."
 
Another research project in this arena is DVS.

http://csdl.computer.org/comp/proceedings/pads/2003/1970/00/19700173abs.htm

and http://www.cs.mcgill.ca/~carl/dvs.pdf

And as mentioned the communications bottleneck is an issue.

Maybe the IBM Cell Processor is what we've been waiting for? :)

http://www-1.ibm.com/press/PressServletForm.wss?MenuChoice=pressreleases&TemplateName=ShowPressReleaseTemplate&SelectString=t1.docunid=7502&TableName=DataheadApplicationClass&SESSIONKEY=any&WindowTitle=Press+Release&STATUS=publish

/Ed
 
Dave Colson wrote:
I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.
ACK. But it would have been nice to see multithreaded simulations, which
benefit from more CPUs. Especially since simulations are parallelizable
and some vendors support clusters. So I think that this is just a small
step ... which is not done yet. And the dual core CPUs are ante portas ...

BR,
Chris

"Christian Schneider" <please_reply_to_the@newsgroup.net> wrote in message
news:d02amb$r5a$1@online.de...

Thanks for all the benchmaks! Very interessting information!

If I interpret the data correctly, two CPU result in the same simulation
time, so they are of no benefit? That's a pity!

BR,
Chris

Kim Enkovaara wrote:

Jason Zheng wrote:


3. You are comparing state-of-the-art AMD workstations with mediocre
Intel servers. It's like comparing oranges with apples.


I have some measurements with more current Xeon processors.

Unfortunately

I had only one not so state of the art Opteron to measure.

These results were published by me in one local Mentor Graphics

conference

(these are only small part of the numbers). The simulations are done

with

Modelsim for a ~8Mgate chip (+all memories). The numbers are simulation
time
in seconds.


RTL One CPU active
Sun V880 UIII/900 3531
P4 Xeon 2.2/512k 2224
P4 Xeon 2.4/512k 2087
P4 Xeon 2.8/512k 1928
P4 Xeon 3.06/512k 1634
P4 Xeon 3.4EMT (32b) 1239
AMD Opteron 848(32b) 1584

RTL Both CPUs active
Sun V880 UIII/900 3520
P4 Xeon 2.2 2540
P4 Xeon 2.4 2680
P4 Xeon 2.8 2650
P4 Xeon 3.06 2120
P4 Xeon 3.4EMT (32bit) 1450
AMD Opteron 848(32b) 1587

One thing that amazes me is that in Xeons even with RTL simulation the
performance
degrades very guickly. I guess with 4 processors Xeons degrade very
badly. In
Opterons there was no degradation to be seen.

For the gate level simulations the results are almost identical, altough


the dataset
is 15-20x larger and simulation times for the same case are longer. Also
if 64b mode
was used Opteron became faster and Xeon EMT was little slower (very
small differences
compared to 32b mode tough).


--Kim
 
B. Joshua Rosen wrote:
I've put together a webpage on the performance of NCSim and Xilinx
tools on various systems, specifically a dual PIII, dual Xeon, Athlon 64
3400+ and an Athlon 64 3800+.

http://www.polybus.com/linux_hardware/index.htm

A nice testbench but...

1. What's the amount of memory used on each system?
2. I think you miss-used the term "cpu-bound." Rather, they are all
memory-bound computations. Had they been purely cpu-bound, the xeon
machines might have won. The point that you really want to make is that
the AMD cpus have on-chip memory controller and thus much less memory
latency, which makes memory-intensive applications run much faster than
Intel cpus.
3. You are comparing state-of-the-art AMD workstations with mediocre
Intel servers. It's like comparing oranges with apples.

Lastly, might I ask, are you affliated with AMD?

-jz
 
Thanks for all the benchmaks! Very interessting information!

If I interpret the data correctly, two CPU result in the same simulation
time, so they are of no benefit? That's a pity!

BR,
Chris

Kim Enkovaara wrote:
Jason Zheng wrote:

3. You are comparing state-of-the-art AMD workstations with mediocre
Intel servers. It's like comparing oranges with apples.


I have some measurements with more current Xeon processors. Unfortunately
I had only one not so state of the art Opteron to measure.

These results were published by me in one local Mentor Graphics conference
(these are only small part of the numbers). The simulations are done with
Modelsim for a ~8Mgate chip (+all memories). The numbers are simulation
time
in seconds.


RTL One CPU active
Sun V880 UIII/900 3531
P4 Xeon 2.2/512k 2224
P4 Xeon 2.4/512k 2087
P4 Xeon 2.8/512k 1928
P4 Xeon 3.06/512k 1634
P4 Xeon 3.4EMT (32b) 1239
AMD Opteron 848(32b) 1584

RTL Both CPUs active
Sun V880 UIII/900 3520
P4 Xeon 2.2 2540
P4 Xeon 2.4 2680
P4 Xeon 2.8 2650
P4 Xeon 3.06 2120
P4 Xeon 3.4EMT (32bit) 1450
AMD Opteron 848(32b) 1587

One thing that amazes me is that in Xeons even with RTL simulation the
performance
degrades very guickly. I guess with 4 processors Xeons degrade very
badly. In
Opterons there was no degradation to be seen.

For the gate level simulations the results are almost identical, altough
the dataset
is 15-20x larger and simulation times for the same case are longer. Also
if 64b mode
was used Opteron became faster and Xeon EMT was little slower (very
small differences
compared to 32b mode tough).


--Kim
 
I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.


"Christian Schneider" <please_reply_to_the@newsgroup.net> wrote in message
news:d02amb$r5a$1@online.de...
Thanks for all the benchmaks! Very interessting information!

If I interpret the data correctly, two CPU result in the same simulation
time, so they are of no benefit? That's a pity!

BR,
Chris

Kim Enkovaara wrote:
Jason Zheng wrote:

3. You are comparing state-of-the-art AMD workstations with mediocre
Intel servers. It's like comparing oranges with apples.


I have some measurements with more current Xeon processors.
Unfortunately
I had only one not so state of the art Opteron to measure.

These results were published by me in one local Mentor Graphics
conference
(these are only small part of the numbers). The simulations are done
with
Modelsim for a ~8Mgate chip (+all memories). The numbers are simulation
time
in seconds.


RTL One CPU active
Sun V880 UIII/900 3531
P4 Xeon 2.2/512k 2224
P4 Xeon 2.4/512k 2087
P4 Xeon 2.8/512k 1928
P4 Xeon 3.06/512k 1634
P4 Xeon 3.4EMT (32b) 1239
AMD Opteron 848(32b) 1584

RTL Both CPUs active
Sun V880 UIII/900 3520
P4 Xeon 2.2 2540
P4 Xeon 2.4 2680
P4 Xeon 2.8 2650
P4 Xeon 3.06 2120
P4 Xeon 3.4EMT (32bit) 1450
AMD Opteron 848(32b) 1587

One thing that amazes me is that in Xeons even with RTL simulation the
performance
degrades very guickly. I guess with 4 processors Xeons degrade very
badly. In
Opterons there was no degradation to be seen.

For the gate level simulations the results are almost identical, altough

the dataset
is 15-20x larger and simulation times for the same case are longer. Also
if 64b mode
was used Opteron became faster and Xeon EMT was little slower (very
small differences
compared to 32b mode tough).


--Kim
 
Dave Colson wrote:
I was told that the benefit of two CPUs is that you can run another
application while simulating and not have your computer slow down
because the other application will run from the other CPU.


That depends on the process scheduler in the kernel. The most
significant benefit is that your multi-threaded application such as HDL
simulator can run multiple threads at the same time. For example the
following verilog structure lends itself to multi-threading:

fork
begin
...
end
begin
...
end
join

Now whether that piece of code is actually run as two threads is up to
the HDL compiler's design. It might just run as a single thread. Even if
it were as two threads, the kernel might decide that another process is
important and only give one cpu to the hdl simulator.

The real advantage of the AMD cpu is that each cpu has its own memory
interface, and a high-bandwidth link to each other. Intel cpus have a
different architecture, where all cpus share the same memory interface.
AMD's design is more scalable: adding cpus to the mainboard only
slightly affects the memory bandwidth each cpu receives, whereas Intel
cpus will get much less memory bandwidth as the number of cpu goes up.
Although the memory bandwidth can be improved with higher FSB frquency
(1066MHz now) and larger L2 (2MB now) and L3 cache (8MB-16MB?), the
Intel approach does not scale. This is why AMD Opetron can easily have
8-way configuration and you rarely even see quad Xeon.

-jz
 
On 2005-03-01, B. Joshua Rosen <bjrosen@polybus.com> wrote:
Parallel simulators are apparently a much harder problem then you might
suspect. A number of years ago I was discussing this issue with the CTO of
Savant tries to do parallel VHDL simulation:
http://www.ececs.uc.edu/~paw/savant/

"The SAVANT project has been integrated with UC's WARPED parallel simulation
research project and provides an end-to-end VHDL-to-batch simulation
capability. WARPED provides a general purpose discrete event simulation API
that can be executed in parallel or sequentially. Built on top of WARPED is a
VHDL simulation kernel called TyVIS that links with the C++ code generated from
SAVANT for batch sequential or parallel simulation."

But as it is a research project, I don't know how well it succeeds.
 

Welcome to EDABoard.com

Sponsor

Back
Top