true multicore simulator or just marketing gimmicks?

J

Jason Zheng

Guest
Hi,

I have read from several EDA trade magazines that Synopsis, Mentor, and
Cadence are offering multicore-capable EDA tools. I have asked this
question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was that
it would be very difficult to do. Have the engineers at the big 3
already figured out how to do multi-threaded simulation, or are those
claims about being multicore-capable really just marketing gimmicks?
Anyone?

~Jason Zheng
 
In article <20091006133205.7b370591@starslayer.jpl.nasa.gov>,
Jason Zheng <Xin.Zheng@jpl.nasa.gov> wrote:
I have read from several EDA trade magazines that Synopsis, Mentor, and
Cadence are offering multicore-capable EDA tools. I have asked this
question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was that
it would be very difficult to do. Have the engineers at the big 3
already figured out how to do multi-threaded simulation, or are those
claims about being multicore-capable really just marketing gimmicks?
Anyone?
Many of the back-end tools have been multi-threaded for a little while -
I know the static timing tools are, probably the Place-and-Route
tools as well. Some of the FPGA tools are starting to support
it in the back end too.

I haven't heard anything yet about simulators however. Did the references
you read specifically call out simulation?

--Mark
 
In article <20091006133205.7b370591@starslayer.jpl.nasa.gov>,
Jason Zheng <Xin.Zheng@jpl.nasa.gov> wrote:

I have read from several EDA trade magazines that Synopsis, Mentor,
and Cadence are offering multicore-capable EDA tools. I have asked
this question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was
that it would be very difficult to do. Have the engineers at the big
3 already figured out how to do multi-threaded simulation, or are
those claims about being multicore-capable really just marketing
gimmicks? Anyone?

Many of the back-end tools have been multi-threaded for a little
while - I know the static timing tools are, probably the
Place-and-Route tools as well. Some of the FPGA tools are starting
to support it in the back end too.

I haven't heard anything yet about simulators however. Did the
references you read specifically call out simulation?

Synopsis VCS has a vague reference to multicore support for
"Application-Level Parallelism". I take that means they can run
simulator, code coverage, and assertion in parallel (duh). Cadence blog
mentions that their incisive has "first phase multicore support," which
is again very vague:

http://www.synopsys.com/TOOLS/VERIFICATION/FUNCTIONALVERIFICATION/Pages/VCSMulticore-faq.aspx
http://www.synopsys.com/Tools/Verification/FunctionalVerification/Documents/vcs-ds.pdf

http://www.cadence.com/Community/blogs/fv/archive/2009/09/30/the-power-of-parallel-thinking-multi-core-eda.aspx

~Jason Zheng
 
On Wed, 7 Oct 2009 07:54:41 -0700 (PDT)
Chris Briggs <chris@engim.com> wrote:

Want to know if it's real? Contact your simulator vendor and ask for
an eval license. If they can't or won't let you try it, and on your
design, it's not real yet. I've tried it on VCS and it's real. Haven't
tried the others.

-cb
Chris can you elaborate why you think VCS has real multicore support?
Can you run a single Verilog/VHDL testbench and see that more than one
CPU core gets above 100%? How does it scale with more cores available?

~Zheng
 
Want to know if it's real? Contact your simulator vendor and ask for
an eval license. If they can't or won't let you try it, and on your
design, it's not real yet. I've tried it on VCS and it's real. Haven't
tried the others.

-cb
 
On Oct 7, 11:01 am, Jason Zheng <Xin.Zh...@jpl.nasa.gov> wrote:
Chris can you elaborate why you think VCS has real multicore support?
Can you run a single Verilog/VHDL testbench and see that more than one
CPU core gets above 100%? How does it scale with more cores available?

~Zheng
I don't have access to it right now (I'm at a different job), and I
was using it under NDA, plus I tried it a while ago (late alpha/early
beta), so I'm not sure how much I can elaborate. Based on what they've
published, I can say the design-level parallelism is where you can get
the most gain and fully load multiple cores. The application-level
parallelism was easier to enable but isn't likely to gain you as much;
think about how much of a performance hit you take for enabling waves
or coverage. How many cores you can use effectively will depend on
your hardware (obviously) and how well your design can be partitioned.

Talk to Synopsys for more info.

-cb
 
hairyotter <marco.bramb@gmail.com> writes:

We have one testcase in which on an 8-core machine we got a 7x
speedup.
Impressive.

It was a verilog testbench and a gatelevel netlist.
Do you have any numbers for behavioural and simulations?

Petter
--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
 
Hi Zheng.
It actually does.

We have one testcase in which on an 8-core machine we got a 7x
speedup.
It was a verilog testbench and a gatelevel netlist.

However, even if that testcase was real and ours, not SNPS, it was a
perfect scenario to unleash the way VCS parallelizes the simulation
(which I'm not at liberty to discuss here).
On a more conventional case, you'll see lower speedups.

The VCS solution is not yet perfect and/or completely automated, but
it's there for real. I'll be buying licenses for my team for 2010

Marco
 
Hi Marco,

Is this the "Design-Level Parallelism" that Chris was referring to? Was
the 7x speed-up in comparison to? I'm assuming that this is a single
testbench that involves a single UUT. Did the simulation keep all 8
cores busy? What do you mean by not completely automated?

If you are not at liberty to discuss in public, would you be able to
e-mail me directly?

thanks,

~Zheng

On Thu, 8 Oct 2009 23:26:43 -0700 (PDT) hairyotter
<marco.bramb@gmail.com> wrote:

Hi Zheng.
It actually does.

We have one testcase in which on an 8-core machine we got a 7x
speedup.
It was a verilog testbench and a gatelevel netlist.

However, even if that testcase was real and ours, not SNPS, it was a
perfect scenario to unleash the way VCS parallelizes the simulation
(which I'm not at liberty to discuss here).
On a more conventional case, you'll see lower speedups.

The VCS solution is not yet perfect and/or completely automated, but
it's there for real. I'll be buying licenses for my team for 2010

Marco
 
On Tue, 06 Oct 2009 13:32:05 -0700, Jason Zheng wrote:

I have read from several EDA trade magazines that Synopsis, Mentor, and
Cadence are offering multicore-capable EDA tools. I have asked this
question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was that
it would be very difficult to do. Have the engineers at the big 3
already figured out how to do multi-threaded simulation, or are those
claims about being multicore-capable really just marketing gimmicks?
Anyone?
Hi Jason,

As others note, Amdahl's law may limit the benefit from multicore for any
one simulation run.

The other way to benefit from any parallelism is to run more than one
simulation at once. For verification regression, you typically want to
run hundreds or thousands of tests, and this wins. It works just as well
on multicore, or on processor farms.

This can get pricey on licensing, which is why advocates of this approach
tend to use open source tools like Verilator. Rich Porter, CTO of Art of
Silicon, recently spoke on their experience of this approach at the
National Microelectronics Institute in Bristol. They use a farm of
several hundred cores running Linux, a standard batch scheduler and
Verilator, so that complete system regression becomes a routine activity
completed in a very short time.

HTH,


Jeremy
 
On Tue, 13 Oct 2009 04:16:00 -0500
Jeremy Bennett <jeremy.bennett@embecosm.com> wrote:

This can get pricey on licensing, which is why advocates of this
approach tend to use open source tools like Verilator. Rich Porter,
CTO of Art of Silicon, recently spoke on their experience of this
approach at the National Microelectronics Institute in Bristol. They
use a farm of several hundred cores running Linux, a standard batch
scheduler and Verilator, so that complete system regression becomes a
routine activity completed in a very short time.
That's a very good idea. I was planning on doing something similar with
Icarus Verilog but was concerned about scaling to large designs. In
my past experience, Icarus Verilog can be very fast with small
designs, almost comparable to NC-Verilog. But as the design become
larger, the speed advantage of commercial simulators quickly becomes
overwhelming. Have you any experience with Verilator's scalability
limits?

thanks,

~Zheng
 
Jason Zheng wrote:
On Tue, 13 Oct 2009 04:16:00 -0500
Jeremy Bennett <jeremy.bennett@embecosm.com> wrote:


That's a very good idea. I was planning on doing something similar with
Icarus Verilog but was concerned about scaling to large designs. In
cver ? I use it for my gate level simulations its pretty good. Andy
thanks,

~Zheng
 
On Tue, 13 Oct 2009 19:45:12 +0100
Andy Botterill <andy@plymouth2.demon.co.uk> wrote:

Jason Zheng wrote:
On Tue, 13 Oct 2009 04:16:00 -0500
Jeremy Bennett <jeremy.bennett@embecosm.com> wrote:


That's a very good idea. I was planning on doing something similar
with Icarus Verilog but was concerned about scaling to large
designs. In

cver ? I use it for my gate level simulations its pretty good. Andy
In my experience, cver also has scaling issues comparing to commercial
simulators like NC-verilog.
 
On Tue, 6 Oct 2009 13:32:05 -0700, Jason Zheng <Xin.Zheng@jpl.nasa.gov> wrote:
Hi,

I have read from several EDA trade magazines that Synopsis, Mentor, and
Cadence are offering multicore-capable EDA tools. I have asked this
question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was that
it would be very difficult to do. Have the engineers at the big 3
already figured out how to do multi-threaded simulation, or are those
claims about being multicore-capable really just marketing gimmicks?
Most if not all EDA tools are written in C or C++. Neither of these
languages have any notion of threads builtin. This means relying on
libraries (pthreads or win32 threads). Writing new multithreaded apps,
whilst not easy, is not that hard. Converting old apps to be
multithreaded is more of a headache. Making code threadsafe usually
means significant re-writing. I doubt if anyone needs to do any
"figuring-out", unless its the folks that are trying to find ways of
auto-parallelizing code. But those are the compiler vendors, not the EDA
developers.

For the speedup that multicore can bring, check out Amdahl's law.

A bientot
Paul
[not speaking for Mentor Graphics]
--
Paul Floyd http://paulf.free.fr
 
In reply to both Petter and Jason

Sorry for the delay I'm answering with.

Petter: no, I do not have data on behavioural. It was a temp license
and we only run a trial on the testcase that was most pressing for us.
My team is ASIC, so we get the netlist from the customer.

However, and here I'm responding to Jason as well, the tool is
implemented in a way that sorts of "splits" your simulation into
smaller, parallel ones.
This means that the fewer communications between blocks, the better
the sim speed will be.

This will be independent of rtl vs gate, it's just how the
partitioning works.

The partitioning is also the portion that is not automated: you have
to tell the simulator how many partitions to create and at which
levels of hierarchy to split.

The tool has a way of helping you partition, but only after you run
the simulation once.

Going back to other posts: regressions in parallel are a very good
idea and are being done. However, sometimes you have SINGLE
simulations that take a week to complete.
The problem is not when you're doing the regression, is when you're
debugging the testbench / code and you find a mistake / bug towards
the end of the simulation.
If you could speed the sim up by 7, you'd be fixing bugs much more
quickly.

Another case I can think about is the validation of the ATPG patterns.
You have to simulate those, and we're talking 30-50 GB of RAM needed
for the simulation.
On a 128G machine you can run 4 of those, and the remaining CPUs are
just idling.
It would be better if they could be used to speed things up, given how
much you paid for them

Ciao, Marco.
 
On Tue, 13 Oct 2009 06:53:59 -0700, Jason Zheng wrote:

That's a very good idea. I was planning on doing something similar with
Icarus Verilog but was concerned about scaling to large designs. In my
past experience, Icarus Verilog can be very fast with small designs,
almost comparable to NC-Verilog. But as the design become larger, the
speed advantage of commercial simulators quickly becomes overwhelming.
Have you any experience with Verilator's scalability limits?
Hi Zheng,

I've never put anything really big through Verilator, but it runs the
OpenRISC Reference SoC (about 150k gates of logic) at 130kHz on my
elderly PC. Rich Porter at Art of Silicon reckons Verilator is
consistently quicker than the best commercial simulators.

It's important to remember that Verilator is *not* an event driven
simulator - it compiles synthesizable Verilog to cycle accurate 2-state,
zero delay C++/SystemC. That has many benefits besides speed
(particularly due to its good linting), but does mean it can't show you
anything about 4-state behavior or intra-cycle timings.

The same approach should work for Icarus Verilog. I haven't tried it on
anything really big, but the same OpenRISC example ran at 1.4KHz. From
past experience, I reckon that is probably 10-50 times slower than the
best commercial alternatives. However for the price of a single
commercial simulator license you could buy a rack with 100+ cores running
Linux, so you would still be better off.

Best wishes,


Jeremy
 
The same approach should work for Icarus Verilog. I haven't tried it
on anything really big, but the same OpenRISC example ran at 1.4KHz.
From past experience, I reckon that is probably 10-50 times slower
than the best commercial alternatives. However for the price of a
single commercial simulator license you could buy a rack with 100+
cores running Linux, so you would still be better off.
I'm thinking along the same line. The problem is to speed up a single
testbench in simulators like Icurus or cver using multiple cores such
that debugging does not take days. Synopsis's approach could be
something worth trying on free simulators, given an easy-to-use
framework to partition designs.

~Zheng
 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jason Zheng wrote:
The same approach should work for Icarus Verilog. I haven't tried it
on anything really big, but the same OpenRISC example ran at 1.4KHz.
From past experience, I reckon that is probably 10-50 times slower
than the best commercial alternatives. However for the price of a
single commercial simulator license you could buy a rack with 100+
cores running Linux, so you would still be better off.

I'm thinking along the same line. The problem is to speed up a single
testbench in simulators like Icurus or cver using multiple cores such
that debugging does not take days. Synopsis's approach could be
something worth trying on free simulators, given an easy-to-use
framework to partition designs.
It would also be helpful to simply push the runtime of Icarus Verilog
down. I'm aware of the relative performance issues, and I'm motivated
to work on them. I've also every once in a while pondered possible
ways to make the run time more multi-core friendly, but I haven't
stumbled on the right magic cookie.

What I have dabbled with is breaking simulations up at higher levels
of abstraction, i.e. the board level, so that at least the DUT and
the test bench can be run in separate processes. I have a "simbus"
package that helps there. Here's some documentation:

<http://iverilog.wikia.com/wiki/SIMBUS>

- --
Steve Williams "The woods are lovely, dark and deep.
steve at icarus.com But I have promises to keep,
http://www.icarus.com and lines to code before I sleep,
http://www.picturel.com And lines to code before I sleep."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFK121RrPt1Sc2b3ikRAof1AJ9JJSFzpWzQQlD6gfW2XYc96iZurgCfScRE
uwZki82eBUEBdIAFSpylMzM=
=wbk6
-----END PGP SIGNATURE-----
 
On Oct 13, 1:56 pm, Jason Zheng <Xin.Zh...@jpl.nasa.gov> wrote:
On Tue, 13 Oct 2009 19:45:12 +0100

Andy Botterill <a...@plymouth2.demon.co.uk> wrote:
Jason Zheng wrote:
On Tue, 13 Oct 2009 04:16:00 -0500
Jeremy Bennett <jeremy.benn...@embecosm.com> wrote:

That's a very good idea. I was planning on doing something similar
with Icarus Verilog but was concerned about scaling to large
designs. In

cver ? I use it for my gate level simulations its pretty good. Andy

In my experience, cver also has scaling issues comparing to commercial
simulators like NC-verilog.
I want to comment on Pragmatic C's development of our new commercial
CVC
compiled simulator and to comment on what I think is the most obvious
algorithm for multi-core Verilog simulators.

We no longer support open source Cver. Instead, we have put our
effort
into developing a commercial Verilog compiler called CVC. We think
it is about the same speed as VCS, but priced about the same as
Modelsim.
CVC supports full P1364-2005 including generate and the new multi-
dimensional
arrays and wires. CVC compiles to X86 machine code and also has an
old XL
style interpreter interface for debugging.

See the http://www.veripool.org/wiki/veripool/Verilog_Simulator_Benchmarks
page for one data point. Obviously Verilator will be faster since it
uses cycle based simulation, but interesting is that NC comes close.
I think NC is so fast on the published Verilator benchmark because NC
collapses ports while VCS and CVC use non strength reducing continuous
assignments for ports. We think usually port collapsing is not a good
idea, but it works for this benchmark.

We think the comment that even old Cver does not scale to large
designs
is not true. Although, if not scaling meant CVC interpreted
simulation was
slow, then the comment is correct. Compiled CVC can run an 38 million
gate reasonably realistic SDF annotation accurate path delay and
timing
check simulation in about 8GB.

CVC does not support multi-core simulation because we are not
convinced
that design level parallel simulation helps much. First, the
event queue parallelism in Verilog simulators can't be parallelized
(in general) because the communication overhead is way too high. For
example, running the event queue on one processor and event evaluation
on another runs much slower than just running the simulation on one
processor because of frequent event expression synchronization
blocking.

Multi-core licensing makes sense, i.e. buy one 4 core system and
run 4 separate regression simulations without the need to buy 4
separate
systems. There may be slight slowdown from memory contention, but the
CPU vendors are constantly improving that.

I think the most obviously design level parallel simulation approach
is
to turn a design into what are effectively multiple old Valid Logic
Swift (now called Smart Model) style components. Communication at
port levels of large sections of a design can then use the vpi_
value change call backs and vpi_ put/get value mechanism (or other
internal
method). Since normally design section port communication rates are
low,
this can work. Automatic partitioning can be a problem, but it can be
solved
with VCS style tab files. The user can either hand code the
partitioning
instruction tab files or a tool can be written to create template
files
that the user can then hand edit if needed.

The reason CVC does not implement this is that we do not see much
improvement over the approach of running separate regression
simulations
for the various parts of a design using (possibly sampled) tester
values
for each part of a simulation. It is much harder to debug a design
where there is some type of mechanism (vpi_?) between the design
sections
and across cores (even though the core are just different processes).
 
On Tuesday, October 6, 2009 1:32:05 PM UTC-7, Jason Zheng wrote:
Hi,

I have read from several EDA trade magazines that Synopsis, Mentor, and
Cadence are offering multicore-capable EDA tools. I have asked this
question before on this forum about when the true multi-threaded
simulation engines will be avaialble, and the answer back then was that
it would be very difficult to do. Have the engineers at the big 3
already figured out how to do multi-threaded simulation, or are those
claims about being multicore-capable really just marketing gimmicks?
Anyone?

~Jason Zheng
Take a look at dynetix.com, and email founder twc@dynetix.com

RNB333/at/live.com
 

Welcome to EDABoard.com

Sponsor

Back
Top