question about the Cadence software

Martin Lefebvre · May 11, 2004

Greetings,

I have a question about the software solutions provided by Cadence. I work
with a company that uses the following pieces of software:

NC-Verilog
Assura DRC
Assura LVS
Assura RCX
Buildgates

as well as several Virtuso tools

My question is: can those tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

Thanks

Andrew Beckett · May 11, 2004

Martin,

Assura certainly has a multiprocessing option (read the manuals for more
details).

I don't think NC does. Buildgates has a means of being able to distribute jobs
(I think, it's not really my area), and so that could probably take advantage of
two processors (although it's not really any different than two machines).

And of course you could always run two jobs at once ;-)

An application has to be written to take account of parallel processing - and
not all applications are really suitable for this. Some (such as some of the
routers, like wroute and nanoroute) are obvious applications for parallelism.
Some (like spectre) can take advantage for some situations (with spectre, the
model equation evaluation may be done in parallel on multiprocessor machine).
Some tasks however are essentially serial.

Andrew.

On Mon, 10 May 2004 23:15:46 -0400, Martin Lefebvre
<dadexter@enterprise.diginex.net> wrote:

Greetings,

I have a question about the software solutions provided by Cadence. I work
with a company that uses the following pieces of software:

NC-Verilog
Assura DRC
Assura LVS
Assura RCX
Buildgates

as well as several Virtuso tools

My question is: can those tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

Thanks

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

gennari · May 11, 2004

I'd be surprised to find a true CAD algorithm that cannot be parallelized.
You can almost always create a parallel algorithm that divides the layout or
schematic into multiple pieces to be simulated, and then works on the parts
in parallel. The I/O often can't be parallelized, but the simulation can.
It's not easy though, so it's probably that the Cadence developers would
rather spend the time adding features and fixing bugs than spending all the
work making the more complex simulators/algorithms parallel. Do you have any
examples of simulations or algorithms that are essentially serial?

Frank

"Andrew Beckett" <andrewb@DELETETHISBITcadence.com> wrote in message
news:8sl0a0ddlsvlfh93sfc67bj3ls870c0p5t@4ax.com...

Martin,

Assura certainly has a multiprocessing option (read the manuals for more
details).

I don't think NC does. Buildgates has a means of being able to distribute
jobs
(I think, it's not really my area), and so that could probably take
advantage of
two processors (although it's not really any different than two machines).

And of course you could always run two jobs at once ;-)

An application has to be written to take account of parallel processing -
and
not all applications are really suitable for this. Some (such as some of
the
routers, like wroute and nanoroute) are obvious applications for
parallelism.
Some (like spectre) can take advantage for some situations (with spectre,
the
model equation evaluation may be done in parallel on multiprocessor
machine).
Some tasks however are essentially serial.

Andrew.

On Mon, 10 May 2004 23:15:46 -0400, Martin Lefebvre
dadexter@enterprise.diginex.net> wrote:

Greetings,

I have a question about the software solutions provided by Cadence. I
work
with a company that uses the following pieces of software:

NC-Verilog
Assura DRC
Assura LVS
Assura RCX
Buildgates

as well as several Virtuso tools

My question is: can those tools be made to use both CPU's on SMP
machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

Thanks

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

Andrew Beckett · May 12, 2004

Frank,

From an academic point of view, you're right. From a practical point of view
it's whether the overhead of the communication between the parallel pieces is
worth it or not.

For example, circuit simulation is done (generally) by solving a set of
simultaneous nonlinear differential equations, which is done using iterative
matrix techniques. Clearly time is sequential, so you can't have different time
slots being evaluated in parallel. So that means breaking the matrix down into
smaller pieces and solving those in parallel. The difficulty is that in many
circuits there is some interaction between whereever you put the boundary, so to
solve that accurately you'd have to allow each independent matrix solution to
influence each other - which means you have to have several iterations whilst
each independently solved matrix solution takes into account its neighbours (put
simply). By the time you've done all this, it probably takes more effort than
you gain. Or at best is very complex to implement, and is not significantly
faster.

This isn't to say it can't be done - usually by compromising accuracy (at least
a bit). So fastspice type algorithms have more scope to be paralleled than pure
spice type algorithms. In spectre we've chosen to allow paralleling up the
evaluation of the circuit equations prior to each matrix solution - and with
complex models such as bsim3v3 and bsim4 there is some clear benefit in doing
this.

So it's not to do whether Cadence developers would rather spend time adding
features and fixing bugs than developing parallel algorithms. Cadence HAVE
created some parallel/distributed algorithms for certain tools which clearly
benefit from it, but if it's a huge amount of work and you only gain 20%
speedup, then there are usually other things you can optimise and get a better
result quicker.

For example, Dracula has had a distributed mode for many, many years. This
allows you to parallel up a DRC job around the network. Assura has a couple of
multiprocessing modes - being hierarchical in nature it can farm off different
cells to different processors, as well as independent layer processing tasks to
different processors. As I mentioned, some of the routers have parallel options,
because each detailed route is working within a global routing cell which has
been determined earlier, and are somewhat independent then.

Clearly the focus on applying these algorithms will be on cases where jobs take
reasonable CPU time - but they've got to be cases where the cost of the
interactions between your parallel pieces don't outweigh the benefits of solving
smaller pieces of work at the same time.

Andrew.

On Tue, 11 May 2004 15:52:27 -0700, "gennari" <gennari@eecs.berkeley.edu> wrote:

I'd be surprised to find a true CAD algorithm that cannot be parallelized.
You can almost always create a parallel algorithm that divides the layout or
schematic into multiple pieces to be simulated, and then works on the parts
in parallel. The I/O often can't be parallelized, but the simulation can.
It's not easy though, so it's probably that the Cadence developers would
rather spend the time adding features and fixing bugs than spending all the
work making the more complex simulators/algorithms parallel. Do you have any
examples of simulations or algorithms that are essentially serial?

Frank

"Andrew Beckett" <andrewb@DELETETHISBITcadence.com> wrote in message
news:8sl0a0ddlsvlfh93sfc67bj3ls870c0p5t@4ax.com...
Martin,

Assura certainly has a multiprocessing option (read the manuals for more
details).

I don't think NC does. Buildgates has a means of being able to distribute
jobs
(I think, it's not really my area), and so that could probably take
advantage of
two processors (although it's not really any different than two machines).

And of course you could always run two jobs at once ;-)

An application has to be written to take account of parallel processing -
and
not all applications are really suitable for this. Some (such as some of
the
routers, like wroute and nanoroute) are obvious applications for
parallelism.
Some (like spectre) can take advantage for some situations (with spectre,
the
model equation evaluation may be done in parallel on multiprocessor
machine).
Some tasks however are essentially serial.

Andrew.

On Mon, 10 May 2004 23:15:46 -0400, Martin Lefebvre
dadexter@enterprise.diginex.net> wrote:

Greetings,

I have a question about the software solutions provided by Cadence. I
work
with a company that uses the following pieces of software:

NC-Verilog
Assura DRC
Assura LVS
Assura RCX
Buildgates

as well as several Virtuso tools

My question is: can those tools be made to use both CPU's on SMP
machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

Thanks

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

gennari · May 12, 2004

I agree you can't parallelize along the time scale. I believe there are ways
to use graph partitioning on the circuit so that the matrix can be
reordered, reducing the amount of communication due to connected circuit
components. I've seen a lecture on this, but I think I was asleep half the
time so I don't remember the details. You don't get perfect parallelism, but
you do get a decent speedup, especially on an SMP machine where the matrix
is in shared memory. It is very complex to implement though, and might not
fit into the current simulation database. I suspect it would require
rewriting at least part of the simulator.

I have been working on a pattern matcher system for searching through a full
chip layout. It used to run inside of Cadence, but I found that the Cadence
database was not optimal for the types of queries I was performing - I
needed something special case. I eventually wrote my own tool, and just
recently I parallelized the code by spatially splitting the layout into many
small pieces. I got a speedup of 29X on 32 processors in a distributed
system. That's why this parallel processing stuff is in my head right now. I
think the same kind of parallel algorithm could be applied to things like
layout systhesis, place & route, DRC, LVS, etc. - anything that involves
processing the entire layout without global dependencies. Some OPC
algorithms work like this. So that's why I think any of these types of
algorithms can be parallelized. I don't think you can use the Cadence
database though. It seems like some more specialized database might be
necessary. It might not fit into the Cadence tools then, so maybe that's why
some of the Cadence products aren't parallel. Well, you probably know more
about the reasons then I do. I'm just trying to figure out how the CAD tools
work and why they are the way they are.

Frank

"Andrew Beckett" <andrewb@DELETETHISBITcadence.com> wrote in message
news:cea3a09682c1vqj4lfh2fev9pfn1pu3r97@4ax.com...

Frank,

From an academic point of view, you're right. From a practical point of
view
it's whether the overhead of the communication between the parallel pieces
is
worth it or not.

For example, circuit simulation is done (generally) by solving a set of
simultaneous nonlinear differential equations, which is done using
iterative
matrix techniques. Clearly time is sequential, so you can't have different
time
slots being evaluated in parallel. So that means breaking the matrix down
into
smaller pieces and solving those in parallel. The difficulty is that in
many
circuits there is some interaction between whereever you put the boundary,
so to
solve that accurately you'd have to allow each independent matrix solution
to
influence each other - which means you have to have several iterations
whilst
each independently solved matrix solution takes into account its
neighbours (put
simply). By the time you've done all this, it probably takes more effort
than
you gain. Or at best is very complex to implement, and is not
significantly
faster.

This isn't to say it can't be done - usually by compromising accuracy (at
least
a bit). So fastspice type algorithms have more scope to be paralleled than
pure
spice type algorithms. In spectre we've chosen to allow paralleling up the
evaluation of the circuit equations prior to each matrix solution - and
with
complex models such as bsim3v3 and bsim4 there is some clear benefit in
doing
this.

So it's not to do whether Cadence developers would rather spend time
adding
features and fixing bugs than developing parallel algorithms. Cadence HAVE
created some parallel/distributed algorithms for certain tools which
clearly
benefit from it, but if it's a huge amount of work and you only gain 20%
speedup, then there are usually other things you can optimise and get a
better
result quicker.

For example, Dracula has had a distributed mode for many, many years. This
allows you to parallel up a DRC job around the network. Assura has a
couple of
multiprocessing modes - being hierarchical in nature it can farm off
different
cells to different processors, as well as independent layer processing
tasks to
different processors. As I mentioned, some of the routers have parallel
options,
because each detailed route is working within a global routing cell which
has
been determined earlier, and are somewhat independent then.

Clearly the focus on applying these algorithms will be on cases where jobs
take
reasonable CPU time - but they've got to be cases where the cost of the
interactions between your parallel pieces don't outweigh the benefits of
solving
smaller pieces of work at the same time.

Andrew.

On Tue, 11 May 2004 15:52:27 -0700, "gennari" <gennari@eecs.berkeley.edu
wrote:

I'd be surprised to find a true CAD algorithm that cannot be
parallelized.
You can almost always create a parallel algorithm that divides the layout
or
schematic into multiple pieces to be simulated, and then works on the
parts
in parallel. The I/O often can't be parallelized, but the simulation can.
It's not easy though, so it's probably that the Cadence developers would
rather spend the time adding features and fixing bugs than spending all
the
work making the more complex simulators/algorithms parallel. Do you have
any
examples of simulations or algorithms that are essentially serial?

Frank

"Andrew Beckett" <andrewb@DELETETHISBITcadence.com> wrote in message
news:8sl0a0ddlsvlfh93sfc67bj3ls870c0p5t@4ax.com...
Martin,

Assura certainly has a multiprocessing option (read the manuals for
more
details).

I don't think NC does. Buildgates has a means of being able to
distribute
jobs
(I think, it's not really my area), and so that could probably take
advantage of
two processors (although it's not really any different than two
machines).

And of course you could always run two jobs at once ;-)

An application has to be written to take account of parallel
processing -
and
not all applications are really suitable for this. Some (such as some
of
the
routers, like wroute and nanoroute) are obvious applications for
parallelism.
Some (like spectre) can take advantage for some situations (with
spectre,
the
model equation evaluation may be done in parallel on multiprocessor
machine).
Some tasks however are essentially serial.

Andrew.

On Mon, 10 May 2004 23:15:46 -0400, Martin Lefebvre
dadexter@enterprise.diginex.net> wrote:

Greetings,

I have a question about the software solutions provided by Cadence. I
work
with a company that uses the following pieces of software:

NC-Verilog
Assura DRC
Assura LVS
Assura RCX
Buildgates

as well as several Virtuso tools

My question is: can those tools be made to use both CPU's on SMP
machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

Thanks

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

--
Andrew Beckett
Senior Technical Leader
Custom IC Solutions
Cadence Design Systems Ltd

Spaller · May 13, 2004

"gennari" <gennari@eecs.berkeley.edu> wrote in message
news:c7shsn$sfa$1@agate.berkeley.edu...

I agree you can't parallelize along the time scale

Well, it can be done since the real world seems to manage it with c as speed
limit making limiting cones.

I believe there are ways
to use graph partitioning on the circuit so that the matrix can be
reordered, reducing the amount of communication due to connected circuit
components

Sure. Split up clock domains into their own simulator and only fret on the
fifo's between them.

I've seen a lecture on this, but I think I was asleep half the
time so I don't remember the details. You don't get perfect parallelism,
but
you do get a decent speedup, especially on an SMP machine where the matrix
is in shared memory. It is very complex to implement though, and might not
fit into the current simulation database. I suspect it would require
rewriting at least part of the simulator.

I have been working on a pattern matcher system for searching through a
full
chip layout. It used to run inside of Cadence, but I found that the
Cadence
database was not optimal for the types of queries I was performing - I
needed something special case.

There is only one thread possible with the CDBA/skill database
infrastructure.

OpenAccess 2.0 improved this somewhat by allowing a thread per cell view,
but if a skill evaluation took place anywhere in there then the whole thing
would have to be reduced back to a single thread. OpenAccess 2.2 is rumored
to allow more threads per cell view, but with the skill bottleneck one might
not get anywhere (unless, of course, skill is totally avoided in one's
database, but that's not likely in common databases).

I eventually wrote my own tool, and just
recently I parallelized the code by spatially splitting the layout into
many
small pieces. I got a speedup of 29X on 32 processors in a distributed
system.

I suspect that speedup ignores the tedious linear translation of the data
into the many pieces and back again.

That's why this parallel processing stuff is in my head right now. I
think the same kind of parallel algorithm could be applied to things like
layout systhesis, place & route, DRC, LVS, etc. - anything that involves
processing the entire layout without global dependencies. Some OPC
algorithms work like this.

Still got to worry about boundary conditions between the pieces.
Non-trivial.

So that's why I think any of these types of
algorithms can be parallelized.

I agree. But it's still hard to do, and most programmers on this planet are
clueless about how to do threading or setting up patching conditions between
the parallelized pieces.

I don't think you can use the Cadence
database though.

Yep. Or at least not directly. Typically translate the data into your
specialized data structures and then piece it back together again into the
database. Fork and fold.

It seems like some more specialized database might be
necessary.

Well, one could try OpenAccess 2.2 someday, but there's still that skill
issue in the way.

It might not fit into the Cadence tools then, so maybe that's why
some of the Cadence products aren't parallel. Well, you probably know more
about the reasons then I do.

Heh. Reminds me of an old hardware boss of mine ruminating about software:
"Why speedup software when you can wait for faster hardware?"

spaller

--

....

gennari · May 14, 2004

"Spaller" <spaller@prodigy.take.this.out.net> wrote in message
news

WEoc.47940$n%5.1568@newssvr29.news.prodigy.com...

"gennari" <gennari@eecs.berkeley.edu> wrote in message
news:c7shsn$sfa$1@agate.berkeley.edu...
I agree you can't parallelize along the time scale

Well, it can be done since the real world seems to manage it with c as
speed
limit making limiting cones.

I don't understand this. Can you please explain? Limiting cones?

I believe there are ways
to use graph partitioning on the circuit so that the matrix can be
reordered, reducing the amount of communication due to connected circuit
components

Sure. Split up clock domains into their own simulator and only fret on
the
fifo's between them.
I've seen a lecture on this, but I think I was asleep half the
time so I don't remember the details. You don't get perfect parallelism,
but
you do get a decent speedup, especially on an SMP machine where the
matrix
is in shared memory. It is very complex to implement though, and might
not
fit into the current simulation database. I suspect it would require
rewriting at least part of the simulator.

I have been working on a pattern matcher system for searching through a
full
chip layout. It used to run inside of Cadence, but I found that the
Cadence
database was not optimal for the types of queries I was performing - I
needed something special case.

There is only one thread possible with the CDBA/skill database
infrastructure.

See, that's the problem. Back when SKILL was created I guess no one thought
about multiple threads. It's probably not going to change any time soon. At
least you can create multiple processes in SKILL if you're able to run other
executables from the command line to do the work.

OpenAccess 2.0 improved this somewhat by allowing a thread per cell view,
but if a skill evaluation took place anywhere in there then the whole
thing
would have to be reduced back to a single thread. OpenAccess 2.2 is
rumored
to allow more threads per cell view, but with the skill bottleneck one
might
not get anywhere (unless, of course, skill is totally avoided in one's
database, but that's not likely in common databases).

I eventually wrote my own tool, and just
recently I parallelized the code by spatially splitting the layout into
many
small pieces. I got a speedup of 29X on 32 processors in a distributed
system.

I suspect that speedup ignores the tedious linear translation of the data
into the many pieces and back again.

No, this includes the load from disk (zip files as well). That's why it's a
speedup of 29 and not 32. The I/O time can be fast compared to the
simulation/runtime, especially when the file is cached in memory. The
partitioning time is faster than the load time and can be pipelined with the
load. The file can also be preprocessed to improve loading by using a table
of contents (for GDSII). In some cases the load time is significant, and
that will limit performance. If the runtime is limited by disk I/O then it's
damn fast.

That's why this parallel processing stuff is in my head right now. I
think the same kind of parallel algorithm could be applied to things
like
layout systhesis, place & route, DRC, LVS, etc. - anything that involves
processing the entire layout without global dependencies. Some OPC
algorithms work like this.

Still got to worry about boundary conditions between the pieces.
Non-trivial.

I use overlapping partitions but the method of dealing with boundary
conditions depends on the application. In some cases it's easy and in other
cases it's hard. It depends on whether the algorithm only needs to read data
from adjacent areas or if it has to modify the data as well.

So that's why I think any of these types of
algorithms can be parallelized.

I agree. But it's still hard to do, and most programmers on this planet
are
clueless about how to do threading or setting up patching conditions
between
the parallelized pieces.

Yes, threads are especially difficult to port to some OSes (i.e. Windows).
There are many machine/architecture specific bugs and shared memory is
messy. Distributed communication is even harder in some sense.

I don't think you can use the Cadence
database though.

Yep. Or at least not directly. Typically translate the data into your
specialized data structures and then piece it back together again into the
database. Fork and fold.

I wish Cadence GDSII streaming was faster. That would speed up the
translation.

It seems like some more specialized database might be
necessary.

Well, one could try OpenAccess 2.2 someday, but there's still that skill
issue in the way.

I've never used OpenAccess, but it sounds interesting to me.

It might not fit into the Cadence tools then, so maybe that's why
some of the Cadence products aren't parallel. Well, you probably know
more
about the reasons then I do.

Heh. Reminds me of an old hardware boss of mine ruminating about
software:
"Why speedup software when you can wait for faster hardware?"

The great thing about CAD tools is that hardware does get faster - but the
designs you're working with also get larger by at least the same scaling
factor!

spaller

Frank

--

...

Spaller · May 14, 2004

"gennari" <gennari@eecs.berkeley.edu> wrote in message
news:c81at2$2hrr$1@agate.berkeley.edu...

"Spaller" <spaller@prodigy.take.this.out.net> wrote in message
newsWEoc.47940$n%5.1568@newssvr29.news.prodigy.com...
"gennari" <gennari@eecs.berkeley.edu> wrote in message
news:c7shsn$sfa$1@agate.berkeley.edu...
I agree you can't parallelize along the time scale

Well, it can be done since the real world seems to manage it with c as
speed
limit making limiting cones.

I don't understand this. Can you please explain? Limiting cones?

Baryons are limited in their interactions as determined by the speed of
light. Hence, circuits, which are comprised of baryons, are constrained as
well by the same speed limit. In special relativity, these future
interactions are sometimes referred to as cones where the cones expand down
a timeline from a vertex at each baryon. So baryons can only interact once
their cones intersect (in that 4th dimension of time).

I believe there are ways
to use graph partitioning on the circuit so that the matrix can be
reordered, reducing the amount of communication due to connected
circuit
components

Sure. Split up clock domains into their own simulator and only fret on
the
fifo's between them.
I've seen a lecture on this, but I think I was asleep half the
time so I don't remember the details. You don't get perfect
parallelism,
but
you do get a decent speedup, especially on an SMP machine where the
matrix
is in shared memory. It is very complex to implement though, and might
not
fit into the current simulation database. I suspect it would require
rewriting at least part of the simulator.

I have been working on a pattern matcher system for searching through
a
full
chip layout. It used to run inside of Cadence, but I found that the
Cadence
database was not optimal for the types of queries I was performing - I
needed something special case.

There is only one thread possible with the CDBA/skill database
infrastructure.

See, that's the problem. Back when SKILL was created I guess no one
thought
about multiple threads. It's probably not going to change any time soon.
At
least you can create multiple processes in SKILL if you're able to run
other
executables from the command line to do the work.

Yeah, you could get multiprocessing to work in skill, but its ugly, and
slow, and probably coughs up blood on lock files. A mathematician that once
worked for me (now he works on molecular simulators) created a scheme
implementation (scheme being a relative of the lisp-like skill) that was
thread-safe, so it can be done with the proper motivation and talent.

OpenAccess 2.0 improved this somewhat by allowing a thread per cell
view,
but if a skill evaluation took place anywhere in there then the whole
thing
would have to be reduced back to a single thread. OpenAccess 2.2 is
rumored
to allow more threads per cell view, but with the skill bottleneck one
might
not get anywhere (unless, of course, skill is totally avoided in one's
database, but that's not likely in common databases).

I eventually wrote my own tool, and just
recently I parallelized the code by spatially splitting the layout
into
many
small pieces. I got a speedup of 29X on 32 processors in a distributed
system.

I suspect that speedup ignores the tedious linear translation of the
data
into the many pieces and back again.

No, this includes the load from disk (zip files as well). That's why it's
a
speedup of 29 and not 32. The I/O time can be fast compared to the
simulation/runtime, especially when the file is cached in memory. The
partitioning time is faster than the load time and can be pipelined with
the
load. The file can also be preprocessed to improve loading by using a
table
of contents (for GDSII). In some cases the load time is significant, and
that will limit performance. If the runtime is limited by disk I/O then
it's
damn fast.

That's better than what I would have normally expected. Good work.

That's why this parallel processing stuff is in my head right now. I
think the same kind of parallel algorithm could be applied to things
like
layout systhesis, place & route, DRC, LVS, etc. - anything that
involves
processing the entire layout without global dependencies. Some OPC
algorithms work like this.

Still got to worry about boundary conditions between the pieces.
Non-trivial.

I use overlapping partitions but the method of dealing with boundary
conditions depends on the application. In some cases it's easy and in
other
cases it's hard. It depends on whether the algorithm only needs to read
data
from adjacent areas or if it has to modify the data as well.

Yes, it's the updates along the boundary patches that are usually difficult
to deal with. Sheaf theory doesn't help much there. It's this additional
pass through the boundary area that often proves daunting since the udpate
cycle can propagate out beyond the originally chosen boundary area.

So that's why I think any of these types of
algorithms can be parallelized.

I agree. But it's still hard to do, and most programmers on this planet
are
clueless about how to do threading or setting up patching conditions
between
the parallelized pieces.

Yes, threads are especially difficult to port to some OSes (i.e. Windows).
There are many machine/architecture specific bugs and shared memory is
messy. Distributed communication is even harder in some sense.

Time spent in mutexes is hideous, and bogs things down. Too fine a
granularity and no speedup is achieved.

I don't think you can use the Cadence
database though.

Yep. Or at least not directly. Typically translate the data into your
specialized data structures and then piece it back together again into
the
database. Fork and fold.

I wish Cadence GDSII streaming was faster. That would speed up the
translation.

It's gotten much faster with the 5.X series of software. The PIPO guys seem
to have done well there. Running truss on the process is enlightening to
see what's going on ;-)

It seems like some more specialized database might be
necessary.

Well, one could try OpenAccess 2.2 someday, but there's still that skill
issue in the way.

I've never used OpenAccess, but it sounds interesting to me.

Unfortunately, openness is a matter of money. You have to pay to download
the sources from SI2, and you never want to download OA2.1 since it is buggy
and simply doesn't work, so don't waste your money there (startups beware!).
Must be that axiom about every other release being worthwhile. Bad news is
that OA2.2, which is not available yet for download (when?), is rumored to
be completely alien to OA2.0, and so using it is like writing to a
completely different database from OA2.0.

There are also some strangenesses that the OA techfile is extended by DFII
so you can't really get very far without understanding what the bloody hell
those extensions are. If you stick inside your own OA universe and don't
communicate with OA databases generated by DFII you should be ok, though.
God and maybe Cadence knows how that will play out in OA2.2.

It might not fit into the Cadence tools then, so maybe that's why
some of the Cadence products aren't parallel. Well, you probably know
more
about the reasons then I do.

Heh. Reminds me of an old hardware boss of mine ruminating about
software:
"Why speedup software when you can wait for faster hardware?"

The great thing about CAD tools is that hardware does get faster - but the
designs you're working with also get larger by at least the same scaling
factor!

Heh. Reimplementing the same old tired linear solutions are still linear.

However, that 4GB wall of RAM for 32b processes is still the same as it ever
was (to paraphrase the Talking Heads)

spaller

--

....

Simon S. IBM · May 16, 2004

Martin Lefebvre <dadexter@enterprise.diginex.net> wrote in message news:<pan.2004.05.11.03.15.44.969359@enterprise.diginex.net>...

can tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

One dual-cpu quagmire I've not resolved is performance when we
have HYPERTHREADED CPUs (as evidenced by the HT flag
in the RedHat Enterprise 3.0 /proc/cpuinfo file).

What that means to me (a Linux amateur) is that I can't tell
for the life of me whether I have multiple CPUs or just one CPU
that is 'tricked' into something called hyperthreading (which
seems to be NOT the same as multiple CPUs) yet which appears
to SOME software to speed up operations (slowing down others).

The reason this hyperthreading conundrum matters to me is I
found that ASSURA 3.12 ran twelve times SLOWER on my hyperthreaded
machines (versus similar non-hyperthreaded controls), while
SOC Encounter 3.1 ran more than two times faster on the SAME
hyperthreaded machines (versus the same controls!).

Can somebody please un confuse me as to ...
a) How to tell between hyperthreading vs REALLY multiple CPUs?
b) Why I could possibly see the results above I repeatedly see?

Simon

daytripper · May 16, 2004

On 16 May 2004 10:23:04 -0700, cmos_nand_gate@yahoo.com (Simon S. IBM) wrote:

Martin Lefebvre <dadexter@enterprise.diginex.net> wrote in message news:<pan.2004.05.11.03.15.44.969359@enterprise.diginex.net>...
can tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

One dual-cpu quagmire I've not resolved is performance when we
have HYPERTHREADED CPUs (as evidenced by the HT flag
in the RedHat Enterprise 3.0 /proc/cpuinfo file).

What that means to me (a Linux amateur) is that I can't tell
for the life of me whether I have multiple CPUs or just one CPU
that is 'tricked' into something called hyperthreading (which
seems to be NOT the same as multiple CPUs) yet which appears
to SOME software to speed up operations (slowing down others).

The reason this hyperthreading conundrum matters to me is I
found that ASSURA 3.12 ran twelve times SLOWER on my hyperthreaded
machines (versus similar non-hyperthreaded controls), while
SOC Encounter 3.1 ran more than two times faster on the SAME
hyperthreaded machines (versus the same controls!).

Can somebody please un confuse me as to ...
a) How to tell between hyperthreading vs REALLY multiple CPUs?
b) Why I could possibly see the results above I repeatedly see?

Simon

I don't use Linux, so I can't provide a hint where to look for "physical
processors" vs "logical processors".

As to the second question: the "logical processors" are sharing the higher
level caches, which raises the potential for "cache thrash".

If simultaneously executing threads are well-behaved wrt memory access you may
actually see a throughput increase; but if they are constantly colliding at
the cache - each causing the other to "miss" - throughput actually decreases
vs single-threaded execution.

/daytripper

Spaller · May 25, 2004

"Simon S. IBM" <cmos_nand_gate@yahoo.com> wrote in message
news:3520d403.0405160923.5a1be81b@posting.google.com...

Martin Lefebvre <dadexter@enterprise.diginex.net> wrote in message
news:<pan.2004.05.11.03.15.44.969359@enterprise.diginex.net>...
can tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

One dual-cpu quagmire I've not resolved is performance when we
have HYPERTHREADED CPUs (as evidenced by the HT flag
in the RedHat Enterprise 3.0 /proc/cpuinfo file).

What that means to me (a Linux amateur) is that I can't tell
for the life of me whether I have multiple CPUs or just one CPU
that is 'tricked' into something called hyperthreading (which
seems to be NOT the same as multiple CPUs) yet which appears
to SOME software to speed up operations (slowing down others).

Yes, the first time I noticed this was when the uptime command never got
below 1 on my first HT box, and top reported two cpu's on it. So I actually
cracked open the case to see what on earth IT had delivered. Sure enough,
only one cpu socket was populated.

The reason this hyperthreading conundrum matters to me is I
found that ASSURA 3.12 ran twelve times SLOWER on my hyperthreaded
machines (versus similar non-hyperthreaded controls), while
SOC Encounter 3.1 ran more than two times faster on the SAME
hyperthreaded machines (versus the same controls!).

Assura has some form of env var for number of threads, yes? Might this be
set higher than one to get the slow results? Or was there some sort of BIOS
like HT control turned off/on that led to the above results?

Can somebody please un confuse me as to ...
a) How to tell between hyperthreading vs REALLY multiple CPUs?

Open the box like I did? Look at BIOS options to see if there is an HT
thing in there? I was, er, cough, running on IBM boxes.

b) Why I could possibly see the results above I repeatedly see?

Probably there are Xeons in your box, and historically they have tiny
caches. Assura would seem to be more main bus bound. HT should work better
on the same code shared across the threads and small data sizes. Probably
not what's happening with the large volumes of data Assura has to deal with.
I didn't know SoC Enflounder was multithreaded, or is that the
Plato/NanoRoute stuff that you're seeing the speedup on?

spaller

--

....

gennari · May 26, 2004

I did some experimenting with hyperthreaded Xeons while working on a project
to parallelize my software. In my case, the program ran nearly twice as fast
with two real processors and about 1.4 times as fast when running with two
threads on a hyperthreaded Xeon. I don't know how to actually tell whether
you have two CPUs or a single hyperthreaded CPU. It seems like
hyperthreading is Intel's way of tricking people into thinking they are
getting a dual processor system for the price of 1.5 processors.

A factor of 12 slower is difficult to believe. If it really was that much
slower, then it's likely a cache conflict issue or memory allocation issue.
It could be that the threads are constantly invalidating each other's cache
entries so that all memory accesses result in cache misses. Some programs
run almost twice as fast with hyperthreading because they are limited by the
processor instruction rate and more efficiently share the other system
resources.

Frank

"Spaller" <spaller@prodigy.take.this.out.net> wrote in message
news:6LCsc.72704$td7.6032@newssvr25.news.prodigy.com...

"Simon S. IBM" <cmos_nand_gate@yahoo.com> wrote in message
news:3520d403.0405160923.5a1be81b@posting.google.com...
Martin Lefebvre <dadexter@enterprise.diginex.net> wrote in message
news:<pan.2004.05.11.03.15.44.969359@enterprise.diginex.net>...
can tools be made to use both CPU's on SMP machines?
We have 10 dual Opteron 1.6G machines running Linux smp, and the
2nd CPU is idling during simulations...

One dual-cpu quagmire I've not resolved is performance when we
have HYPERTHREADED CPUs (as evidenced by the HT flag
in the RedHat Enterprise 3.0 /proc/cpuinfo file).

What that means to me (a Linux amateur) is that I can't tell
for the life of me whether I have multiple CPUs or just one CPU
that is 'tricked' into something called hyperthreading (which
seems to be NOT the same as multiple CPUs) yet which appears
to SOME software to speed up operations (slowing down others).

Yes, the first time I noticed this was when the uptime command never got
below 1 on my first HT box, and top reported two cpu's on it. So I
actually
cracked open the case to see what on earth IT had delivered. Sure enough,
only one cpu socket was populated.

The reason this hyperthreading conundrum matters to me is I
found that ASSURA 3.12 ran twelve times SLOWER on my hyperthreaded
machines (versus similar non-hyperthreaded controls), while
SOC Encounter 3.1 ran more than two times faster on the SAME
hyperthreaded machines (versus the same controls!).

Assura has some form of env var for number of threads, yes? Might this be
set higher than one to get the slow results? Or was there some sort of
BIOS
like HT control turned off/on that led to the above results?

Can somebody please un confuse me as to ...
a) How to tell between hyperthreading vs REALLY multiple CPUs?

Open the box like I did? Look at BIOS options to see if there is an HT
thing in there? I was, er, cough, running on IBM boxes.

b) Why I could possibly see the results above I repeatedly see?

Probably there are Xeons in your box, and historically they have tiny
caches. Assura would seem to be more main bus bound. HT should work
better
on the same code shared across the threads and small data sizes. Probably
not what's happening with the large volumes of data Assura has to deal
with.
I didn't know SoC Enflounder was multithreaded, or is that the
Plato/NanoRoute stuff that you're seeing the speedup on?

spaller

--

...

Keith S. · May 27, 2004

Spaller wrote:

Baryons are limited in their interactions as determined by the speed of
light. Hence, circuits, which are comprised of baryons, are constrained as
well by the same speed limit. In special relativity, these future
interactions are sometimes referred to as cones where the cones expand down
a timeline from a vertex at each baryon. So baryons can only interact once
their cones intersect (in that 4th dimension of time).

Ah, that explains it then. And I thought Baryons were people
who sat in the House of Lords, while cones sat on motorways.

- Keith

question about the Cadence software

Martin Lefebvre

Guest

Andrew Beckett

Guest

gennari

Guest

Andrew Beckett

Guest

gennari

Guest

Spaller

Guest

gennari

Guest

Spaller

Guest

Simon S. IBM

Guest

daytripper

Guest

Spaller

Guest

gennari

Guest

Keith S.

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

question about the Cadence software

Martin Lefebvre

Guest

Andrew Beckett

Guest

gennari

Guest

Andrew Beckett

Guest

gennari

Guest

Spaller

Guest

gennari

Guest

Spaller

Guest

Simon S. IBM

Guest

daytripper

Guest

Spaller

Guest

gennari

Guest

Keith S.

Guest

Log in

Welcome to EDABoard.com

Sponsor