Need to speed up Stratix compiles.

rickman · Mar 3, 2004

Max wrote:

On 03 Mar 2004 18:46:34 +0100, Petter Gustad wrote:

I would like to get see synthesis and place and route tools I could
run on a cluster of cheap PC's. I would be happy with less than linear
speedups, e.g. using a 16-node cluster to get a 8x speedup.

I doubt you'd get anywhere near. Trying to implement those algorithms
efficiently on the sort of loosely-coupled architecture you propose
would be nigh-on impossible. It's not easy on a single SMP box, but
it's doable.

A quad Xeon (8 x CPU) box would cost less than four single decent-spec
machines anyway.

Not if the four machines are sitting around all night running screen
savers.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Jim Granville · Mar 3, 2004

rickman wrote:

I have always been surprised that the FPGA vendors don't put some effort
into evaluating platforms and releasing the results.

Would seem a very good idea.

On this topic, I see Intel released a new Xeon with 3GHz
and 4MB (!) cache, and they claim 25% faster.
Of course, you pay - $3692 (Qty column not given )

The PR claims This is the last release before intel adds 64 bit
extensions....

Petter Gustad · Mar 3, 2004

Max <mtj2@btopenworld.com> writes:

On 03 Mar 2004 18:46:34 +0100, Petter Gustad wrote:

I would like to get see synthesis and place and route tools I could
run on a cluster of cheap PC's. I would be happy with less than linear
speedups, e.g. using a 16-node cluster to get a 8x speedup.

I doubt you'd get anywhere near. Trying to implement those algorithms
efficiently on the sort of loosely-coupled architecture you propose
would be nigh-on impossible. It's not easy on a single SMP box, but
it's doable.

I disagree. Synthesis as well as P&R involve exploring many
alternatives and sort/explore by some underestimate of expense/delay
(typically using a A* search algorithm or similar). This can be done
in parallel. The datasets can be copied to each node and there will be
very little information which has to be exchanged over the
interconnect. Of course there is not much to gain if your P&R takes 1
minute, but for larger designs and/or more accurate wire delay models
(e.g. non-linear delay modeling and physical synthesis) the benefit
will be larger.

This has been implemented in some ASIC tools already. Actually Xilinx
has been doing some very simple parallel processing in ISE (on Solaris
and now Linux) for a long time. Multiple iterations of "par" can run
in parallel on multiple hosts, then you pick the best result. This is
of course, extremely coarse grained compared to what I indicated
above.

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Petter Gustad · Mar 3, 2004

Max <mtj2@btopenworld.com> writes:

A quad Xeon (8 x CPU) box would cost less than four single
decent-spec machines anyway.

My experience is the opposite. I've heard from users in the high
performance computing industry that the most cost efficient systems
are clusters of dual CPU nodes (assuming your application will run
efficiently on a cluster).

A 4 CPU Xeon system like a Dell PowerEdge 6650 with 4x Xeon, 3.0GHz
and 4GB RAM costs $28,070. A single PowerEdge 750 (1U server) with
3.4GHz P4 (higher clock frequency, but smaller cache) with 1GB RAM
costs $3,165.

8 CPU Xeon SMP's (Profusion architecture) are very expensive. A
Proliant 8500 costs $100,000+ if memory serves me right.

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Petter Gustad · Mar 3, 2004

Max <mtj2@btopenworld.com> writes:

On 03 Mar 2004 22:58:46 +0100, Petter Gustad wrote:

A 4 CPU Xeon system like a Dell PowerEdge 6650 with 4x Xeon, 3.0GHz
and 4GB RAM costs $28,070. A single PowerEdge 750 (1U server) with
3.4GHz P4 (higher clock frequency, but smaller cache) with 1GB RAM
costs $3,165.

The hyperthreaded Xeons run as two processors, so a quad Xeon board
appears to a HT-aware OS as an 8-CPU system.

Then you would call a system with single P4 with HyperThreading a dual
processor system as well then? This would be a little "unfair" when
comparing to a full dual-core CPU like the rumored UltraSparc-IV.

Why pay for all the extra high-end hardware in a top-end server if you
don't need it? When I was last looking at building systems like this,

My point was that you usually get lots of extra high-end hardware when
you buy large SMP systems, especially when you need to go beyond
4-way. Also, it's usually cheaper to get 4x4GB RAM rather than 16GB
RAM for a single MB (unless you have a large enough number of DIMM
slots).

about 18 months or so ago, a quad-Xeon mobo from Supermicro was
$2000, and the processors were around $450 apiece.

This is pretty good, I was not aware of the low cost of the Supermicro
MB. You would end up at close to $4000, e.g. in the same ballpark as
buying 4 P4 systems. So if the application was performing better on
the SMP than on the to the cluster I would definitely go with the SMP.

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Max · Mar 3, 2004

On 03 Mar 2004 22:58:46 +0100, Petter Gustad wrote:

A 4 CPU Xeon system like a Dell PowerEdge 6650 with 4x Xeon, 3.0GHz
and 4GB RAM costs $28,070. A single PowerEdge 750 (1U server) with
3.4GHz P4 (higher clock frequency, but smaller cache) with 1GB RAM
costs $3,165.

The hyperthreaded Xeons run as two processors, so a quad Xeon board
appears to a HT-aware OS as an 8-CPU system.

Why pay for all the extra high-end hardware in a top-end server if you
don't need it? When I was last looking at building systems like this,
about 18 months or so ago, a quad-Xeon mobo from Supermicro was
<$2000, and the processors were around $450 apiece.

--
Max

Kenneth Land · Mar 4, 2004

I was just trying to be helpful by sharing my experience.
We're only interested in speeding up Quartus builds in this thread and some
have been suggesting more memory (32 GB in some instances) and faster
drives. I've done both in two different machines and the biggest
improvement came from tweaking the memory subsystem, not adding more memory
above 512MB or a faster drive.

The 7200 RPM drive is very much faster as can be seen with much much faster
boot times. Didn't mean much on Quartus builds though. Seems Quartus needs
(for my Nios system) a fast CPU with at least several hundred MB's of
tweaked memory.

I write not slow image processing algorithms and use as many wires as the
system can provide. If its an 8 bit cpu then I use 8 bit optimizations, if
its 32 bit then 32 bit optimizations. Haven't tried 64 bit yet, but I plan
too. Can't imagine any developer worth their salt that wouldn't.

Ken

"Max" <mtj2@btopenworld.com> wrote in message
news:5vtb40lq1kmtcfqefbhdr69ei29kpq6h60@4ax.com...

On Tue, 2 Mar 2004 20:05:37 -0600, Kenneth Land wrote:

On the disk speed issue I have one data point. I upgraded my 1GHz PIII-M
laptop drive from a slow 4200 RPM to the fastest 7200 RPM available (for
laptops) and my Nios system build went from about 16 min. to about 15
min.
Not worth the pain and expense of swapping the drive.

Not in a low-spec machine like that, no. The options in a laptop are
limited, and there's no way to increase the disk controller bandwidth.
But the effect on a powerful workstation of installing a RAID with a
high-bandwidth controller and drives such as U-320 SCSI can have a
dramatic impact. As always though, it depends on the application.

On memory, I upgraded the memory in my 3.2 GHz P4 from 512 to 1GB and
there
was no noticable difference until I set the memory from 333MHz to 400MHz
dual channel. Then my system build went from 5 min. to 4 min. - 20%.

That doesn't mean a lot. You only need to add more memory if you're
running out of it ;o)

--
Max

Hal Murray · Mar 4, 2004

This has been implemented in some ASIC tools already. Actually Xilinx
has been doing some very simple parallel processing in ISE (on Solaris
and now Linux) for a long time. Multiple iterations of "par" can run
in parallel on multiple hosts, then you pick the best result. This is
of course, extremely coarse grained compared to what I indicated
above.

Last I checked, multiple PAR runs didn't gain much if you
had a well floor-planned system. That was a long time ago.
Has anything changed?

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Paul Leventis (at home) · Mar 4, 2004

Hi Rick,

We can all speculate about the relative merits of processor
enhancements, but these machines are very complex and the only real way
to tell what helps is to try it. Since we are not all ancient Greeks
philosophizing in our armchairs, it would be a good idea to pick a
design and to run it on a few different workstations, hopefully
including an AMD64.

I agree. That's why my original posting makes reference to some SPEC
results showing that 64-bit code on Athlon64 is ~5% slower than the same
programs compiled in 32-bit code. One specific SPEC sub-component is a tool
called VPR, which is an academic place & route tool for FPGAs. It shows a
8% slow-down. While by no means comprehensive, I think this gives an idea
of how much speed to expect out of 64-bit vs. 32-bit code, at least for now.

I've forwarded your comments on how nice it would be to see some results for
different system configurations on to the relevant groups in Altera. My
personal experience (going from PII to PIII to P4) has been that SPEC2000 is
a pretty good proxy for Quartus performance, especially for place & route
limited designs.

Regards,

Paul

Paul Leventis (at home) · Mar 4, 2004

Hi Max,

Is there any possibility of making Quartus multi-threaded? That
strikes me as the most likely way to get a dramatic performance
increase, though I know it's not always easy to achieve with heuristic
apps.

It's hard to use fine-grained parallelism on place-and-route tools like
Quartus. This doesn't mean that people (academia, industry) haven't tried
and aren't still trying, but I wouldn't hold my breath. See my previous
posting on this topic:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=Paul+Leventis+multi-threaded

Of course, coarse-grained parallelism (running multiple place-and-route runs
on multiple machines) is much easier. Quartus II ships with a pretty cool
tool called Design Space Explorer. This tool tries out a whole bunch of
Quartus settings and random seeds on your design in order to find the
settings that optimize performance. This requires multiple runs of Quartus.
DSE is capable of farming these runs off to multiple CPUs/computers through
LSF or a built-in distributed computing engine.

To find out more about DSE and coarse-grained parallelism, please see the
section entitled "DSE Advanced Information" in
http://www.altera.com/literature/hb/qts/qts_qii52008.pdf

Regards,

Paul Leventis
Altera Corp.

rickman · Mar 4, 2004

"Paul Leventis (at home)" wrote:

I agree. That's why my original posting makes reference to some SPEC
results showing that 64-bit code on Athlon64 is ~5% slower than the same
programs compiled in 32-bit code. One specific SPEC sub-component is a tool
called VPR, which is an academic place & route tool for FPGAs. It shows a
8% slow-down. While by no means comprehensive, I think this gives an idea
of how much speed to expect out of 64-bit vs. 32-bit code, at least for now.

I've forwarded your comments on how nice it would be to see some results for
different system configurations on to the relevant groups in Altera. My
personal experience (going from PII to PIII to P4) has been that SPEC2000 is
a pretty good proxy for Quartus performance, especially for place & route
limited designs.

That is very interesting information. I was not aware of the AMD 64-bit
code was running slower than 32-bit code. I am sure that you won't see
much of that on the AMD web site. I may check in the PC building
newsgroups to see what results they are finding. They seem to be a
bunch that get to the skinny of things like this.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Petter Gustad · Mar 4, 2004

hmurray@suespammers.org (Hal Murray) writes:

Last I checked, multiple PAR runs didn't gain much if you
had a well floor-planned system. That was a long time ago.

True. If you don;t have a highly congested design with a high degree of
utilization you will probably not gain that much.

My point was that this was an example of a *very simple* parallelism
done by Xilinx. It would be more optimal (and much more difficult) to
make a parallel version of a single iteration of "par".

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Max · Mar 4, 2004

On 04 Mar 2004 00:30:13 +0100, Petter Gustad wrote:

The hyperthreaded Xeons run as two processors, so a quad Xeon board
appears to a HT-aware OS as an 8-CPU system.

Then you would call a system with single P4 with HyperThreading a dual
processor system as well then? This would be a little "unfair" when
comparing to a full dual-core CPU like the rumored UltraSparc-IV.

Windows XP sees my dual-Xeon workstation as a quad CPU machine, so it
can schedule four separate simultaneous threads. If it's behaving as a
quad-processor, then I'm not sure what else I should call it.

Why pay for all the extra high-end hardware in a top-end server if you
don't need it? When I was last looking at building systems like this,

My point was that you usually get lots of extra high-end hardware when
you buy large SMP systems, especially when you need to go beyond
4-way. Also, it's usually cheaper to get 4x4GB RAM rather than 16GB
RAM for a single MB (unless you have a large enough number of DIMM
slots).

I agree, it's difficult to buy a ready-made high-end system that
doesn't have redundant PSUs, hot-swap RAID etc. This is why I haven't
bought off the shelf for over 5 years now, but buy the components I
actually want and assemble it myself.

about 18 months or so ago, a quad-Xeon mobo from Supermicro was
$2000, and the processors were around $450 apiece.

This is pretty good, I was not aware of the low cost of the Supermicro
MB. You would end up at close to $4000, e.g. in the same ballpark as
buying 4 P4 systems. So if the application was performing better on
the SMP than on the to the cluster I would definitely go with the SMP.

Bare high-end mobos are cheaper than most people think. At the time, I
paid around $850 for a Supermicro P4DC6+ dual Xeon board, but I
haven't looked at current prices. There is a hike when you want more
than two physical processors, though - presumably due to low demand
and less competition. The P4 Xeons are hugely cheaper than the PIII
versions for some reason.

If you're an AMD fan, then Tyan make nice multi-CPU boards at sensible
prices.

--
Max

Petter Gustad · Mar 4, 2004

Max <mtj2@btopenworld.com> writes:

Windows XP sees my dual-Xeon workstation as a quad CPU machine, so

I hardly ever use Windows so I haven't had a chance to observe this. I
don't have a HT system at hand now, but what does

grep ^processor /proc/cpuinfo

return on a Linux based HT system?

If you're an AMD fan, then Tyan make nice multi-CPU boards at
sensible prices.

We have a small cluster of Quad Opterons at work. They give superb
performance when I run Synopsys Design Compiler and similar tools.
Unfortunately I can't run Quartus II (3.0) on these as I have
mentioned earlier.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Max · Mar 5, 2004

On 04 Mar 2004 22:48:21 +0100, Petter Gustad wrote:

I hardly ever use Windows so I haven't had a chance to observe this. I
don't have a HT system at hand now, but what does

grep ^processor /proc/cpuinfo

return on a Linux based HT system?

Sorry, don't know from personal experience. I don't use Linux much,
and when I do, I run it under VMware, which emulates a uniprocessor.

I've heard from other users that Linux understands HT, but I don't
know any more than that, really. I daresay the folks in some of the
hardware groups could say more - try
alt.comp.periphs.mainboard.supermicro

--
Max

Petter Gustad · Mar 5, 2004

Max <mtj2@btopenworld.com> writes:

On 04 Mar 2004 22:48:21 +0100, Petter Gustad wrote:

I hardly ever use Windows so I haven't had a chance to observe this. I
don't have a HT system at hand now, but what does

grep ^processor /proc/cpuinfo

return on a Linux based HT system?

Sorry, don't know from personal experience. I don't use Linux much,
and when I do, I run it under VMware, which emulates a uniprocessor.

I got the answer from a local Linux group. It appears as 4 processors:

$ grep ^processor /proc/cpuinfo
processor : 0
processor : 1
processor : 2
processor : 3

So it will be difficult to distinglish between two physical CPU
packages, dual-core and HT...

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Marius Vollmer · Mar 5, 2004

Petter Gustad <newsmailcomp5@gustad.com> writes:

So it will be difficult to distinglish between two physical CPU
packages, dual-core and HT...

I think you can look at the "flags":

$ grep ^flags /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm
flags : fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm

This is on a uni-processor Xeon box. The "ht" flag might hint at
hyperthreading, but I'm not sure...

Need to speed up Stratix compiles.

rickman

Guest

Jim Granville

Guest

Petter Gustad

Guest

Petter Gustad

Guest

Petter Gustad

Guest

Max

Guest

Kenneth Land

Guest

Hal Murray

Guest

Paul Leventis (at home)

Guest

Paul Leventis (at home)

Guest

rickman

Guest

Petter Gustad

Guest

Max

Guest

Petter Gustad

Guest

Max

Guest

Petter Gustad

Guest

Marius Vollmer

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Need to speed up Stratix compiles.

rickman

Guest

Jim Granville

Guest

Petter Gustad

Guest

Petter Gustad

Guest

Petter Gustad

Guest

Max

Guest

Kenneth Land

Guest

Hal Murray

Guest

Paul Leventis (at home)

Guest

Paul Leventis (at home)

Guest

rickman

Guest

Petter Gustad

Guest

Max

Guest

Petter Gustad

Guest

Max

Guest

Petter Gustad

Guest

Marius Vollmer

Guest

Log in

Welcome to EDABoard.com

Sponsor