Simulator framework

A

Andreas Ehliar

Guest
I'm wondering if there are any good frameworks available which allows you
to parallelize RTL simulations over both multicore computers and clusters.

Bonus points if it is also easy to plug in models written in C or SystemC,
etc.

I have tried google but haven't been able to come up with any relevant
search terms to locate such a beast.

Does anyone on this group have any good examples of this?

/Andreas
 
On Jun 25, 11:37 am, Andreas Ehliar <ehliar-nos...@isy.liu.se> wrote:
I'm wondering if there are any good frameworks available which allows you
to parallelize RTL simulations over both multicore computers and clusters..
There have been many attempts to write event-driven simulators that
take advantage of the parallelism available. As far as I am aware,
none have been successful for general and realistic designs
(occasional marketing blurbs notwithstanding). The parallelism is too
fine-grained, and the extra cost of synchronization exceeds the
savings from parallel execution.

You might be able to find university research papers claiming
success. What that really means is that they are using home-grown
simulators that are orders of magnitude slower than state-of-the-art
commercial simulators. That makes the synchronization overhead
relatively smaller by those orders of magnitude, and makes it appear
to work.
 
On 2008-06-28, sharp@cadence.com <sharp@cadence.com> wrote:
There have been many attempts to write event-driven simulators that
take advantage of the parallelism available. As far as I am aware,
none have been successful for general and realistic designs
(occasional marketing blurbs notwithstanding). The parallelism is too
fine-grained, and the extra cost of synchronization exceeds the
savings from parallel execution.

You might be able to find university research papers claiming
success. What that really means is that they are using home-grown
simulators that are orders of magnitude slower than state-of-the-art
commercial simulators. That makes the synchronization overhead
relatively smaller by those orders of magnitude, and makes it appear
to work.
It seems quite counter-intuitive, since there obviously is a massive
amount of parallelism in an RTL description.

I did a quick experiment to see how feasible it would be to parallelize
a hypothetical SoC simulation with several complex components connected
to a simple bus. Assuming the design is partitioned so that the bus is the
only way to send messages between various computers it seems like it
should actually be feasible to parallelize this kind of simulation.

This is based on the following numbers:
* The OpenRisc processor was used as an example of a complex RTL core.
Using a commercial RTL simulator on a modern PC I could simulate it
at roughly 10K cycles / second.
* I can send roughly 100000 messages / second between tasks while using
OpenMPI (local messages, I don't have access to any high speed interconnect
on the machine I tested this on)

Also, I talked to some friends who are dealing with supercomputers on a
daily basis and they say that a latency of 3us for a MPI_Send to MPI_Recv is
doable if you have a good interconnect.


So, as long as you are simulating a SoC system which is easy to partition into
different parts due to a bus based interconnect, it seems like it should be
relatively easy to parallelize it. But I may of course be missing something
here. Perhaps most people are interested in simulating large monolithic cores
which are very hard to partition as opposed to complete SoC systems?

/Andreas
 
Hi Andreas

Andreas Ehliar <ehliar-nospam@isy.liu.se> writes:
It seems quite counter-intuitive, since there obviously is a massive
amount of parallelism in an RTL description.
What you say is true, however a very important property of RTL
simulation is that it is repeatable -- even if there are race
conditions in the RTL code. Using common multi-threading libraries
alone does not help. Essentially, the simulator has to implement its
own thread management and synchronization layer on top of that
(perhaps even replacing the OS layer).

Automatic partitioning is also difficult. Where would you start?
Always blocks? Continuous assignments? The complexity of each thread
is hard to guess and potentially very different. Thus I suspect there
would be a lot of waiting involved.

In the past, I used multiple parallel simulations of the entire system
to keep the computing farm busy. I guess this would be more efficient
as, apart from license and queue management, no synchronization is
needed.

Regards
Marcus

--
note that "property" can also be used as syntaxtic sugar to reference
a property, breaking the clean design of verilog; [...]

(seen on http://www.veripool.com/verilog-mode_news.html)
 
On 30 Jun, 17:56, Marcus Harnisch <marcus.harni...@gmx.net> wrote:
Hi Andreas

Andreas Ehliar <ehliar-nos...@isy.liu.se> writes:
It seems quite counter-intuitive, since there obviously is a massive
amount of parallelism in an RTL description.

What you say is true, however a very important property of RTL
simulation is that it is repeatable -- even if there are race
conditions in the RTL code. Using common multi-threading libraries
alone does not help. Essentially, the simulator has to implement its
own thread management and synchronization layer on top of that
(perhaps even replacing the OS layer).

Automatic partitioning is also difficult. Where would you start?
Always blocks? Continuous assignments? The complexity of each thread
is hard to guess and potentially very different. Thus I suspect there
would be a lot of waiting involved.

In the past, I used multiple parallel simulations of the entire system
to keep the computing farm busy. I guess this would be more efficient
as, apart from license and queue management, no synchronization is
needed.

Regards
Marcus

--
note that "property" can also be used as syntaxtic sugar to reference
a property, breaking the clean design of verilog; [...]

             (seen onhttp://www.veripool.com/verilog-mode_news.html)
Certain aspects of simulation can already take advantage of multiple
cores.
ModelSim & Questa can utilise more than one core to accelerate
compilation and result logging during simulation today.

However, the easiest way to utilise a compute farm is to accelerate
the execution of your complete simulation suite rather than focus on
an individual simulation run.

There are tools that facilitate this, achieving coverage closure in
the shortest possible time

http://www.embedded-computing.com/news/db/?12304

- Nigel

Mentor Graphics
 
On Jun 25, 8:37 am, Andreas Ehliar <ehliar-nos...@isy.liu.se> wrote:
I'm wondering if there are any good frameworks available which allows you
to parallelize RTL simulations over both multicore computers and clusters..

Bonus points if it is also easy to plug in models written in C or SystemC,
etc.

I have tried google but haven't been able to come up with any relevant
search terms to locate such a beast.

Does anyone on this group have any good examples of this?

/Andreas
You may want to take a look at:

http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel3/4480/12714/00588535..pdf?temp=x

However, there are other research works as well.

- Swapnajit

--
SystemVerilog, DPI, Verilog PLI and all other good stuffs.
Project VeriPage: http://www.project-veripage.com
For subscribing to the mailing list:
<URL: http://www.project-veripage.com/list/?p=subscribe&id=1>
 
On Mon, 30 Jun 2008 08:07:20 +0000 (UTC), Andreas Ehliar
<ehliar-nospam@isy.liu.se> wrote:


It seems quite counter-intuitive, since there obviously is a massive
amount of parallelism in an RTL description.
...
So, as long as you are simulating a SoC system which is easy to partition into
different parts due to a bus based interconnect, it seems like it should be
relatively easy to parallelize it. But I may of course be missing something
here. Perhaps most people are interested in simulating large monolithic cores
which are very hard to partition as opposed to complete SoC systems?
I've written multi-threaded code for processor simulation, where the
"RTL" was actually SystemC. I forget the exact numbers, but I think I
normally had no more than maybe 4 - 6 threads running simultaneously,
on two processor cores. I could have done better, with lots of $$, but
not much better.

Normal event-driven RTL is a very different problem. What you need to
parallelise is (to borrow VHDL terminology) a simulation network of
interconnected processes. You could have thousands of these processes
even for very simple designs, with many/most of them carrying out
trivial amounts of computing (a single continuous/concurrent
assignment, for example). A single delta change could potentially
ripple through the entire network. To handle this, you need vast
amounts of interconnect, with tiny amounts of computing power at the
'processes', or nodes, which is the opposite of what you get with a
general purpose computer.

To parallelise RTL you basically need special-purpose hardware. Taken
to its extreme, the best (and probably only practical) way to do this
is to actually synthesise the RTL onto an FPGA, since this gives you
exactly what you want - a process might end up as only a few LUTs
(instead of an entire processor), with lots of connectivity between
the processes.

-Evan
 

Welcome to EDABoard.com

Sponsor

Back
Top