Element update within records

M

Matthias Alles

Guest
Hi,

I have a problem which I would like to optimize for simulation speed. I
have a signal "a", which is a record with many different elements and
element types. Now I would like to substitue single elements within this
record asynchronously. Currently I do the following:

process(a, b, c) is
begin

out <= a;

out.record_element_x <= b;
out.record_element_y <= c;

end process;

"out" is of the same type as "a", so I first copy all record elements
and then overwrite the ones I would like to substitute. Like that I
don't have to copy each record element individually. The problem is that
"b" and "c" might be asynchronously calculated as well, which means the
process can be triggered several times per rising clock edge, slowing
down the simulation (copying "a" to "out" seems quite time consuming).

Is there a better solution for this problem, which prevents me from
copying "a" to "out" a couple of times per clock? Of course, the
solution should be synthesizable.

Thanks,
Matthias
 
On Fri, 05 Nov 2010 23:58:49 +0100, Matthias Alles
<REMOVEallesCAPITALS@rhrk.uni-kl.de> wrote:

Hi,

I have a problem which I would like to optimize for simulation speed. I
have a signal "a", which is a record with many different elements and
element types.

out <= a;

out.record_element_x <= b;
out.record_element_y <= c;

"out" is of the same type as "a", so I first copy all record elements
and then overwrite the ones I would like to substitute. Like that I
don't have to copy each record element individually. The problem is that
"b" and "c" might be asynchronously calculated as well, which means the
process can be triggered several times per rising clock edge, slowing
down the simulation (copying "a" to "out" seems quite time consuming).
Write a simple function returning that record type, accepting arguments a,b,c.

A sketch:

function combine(a: in <type>: b,c : in std_logic) return <type> is
variable temp : <type> := a;
begin
temp.x := b;
return temp;
end combine;

- Brian
 
On Sat, 06 Nov 2010 08:53:37 +0000, Brian Drummond wrote:

function combine(a: in <type>: b,c : in std_logic) return <type> is
variable temp : <type> := a;
begin
temp.x := b;
return temp;
end combine;
Right, but I don't think that solves the OP's problem, which
was to try to avoid the performance hit of repeated copying
of a large record when his process is repeatedly triggered
in successive deltas. Indeed, your function potentially
causes THREE copy operations:
- one to put the actual argument value into the formal "a"
- one to copy "a" to "temp"
- one to copy "temp" to the return target
although any half-decent compiler will collapse those
on to only one or two, I imagine.

The real solution is to avoid retriggering of the process,
which is best done by NOT writing a combinational process
with multiple signals in its sensitivity list.

Matthias, is there any way you could fold the record
processing into your clocked process, so you know it's
executed only once per cycle?
--
Jonathan Bromley
 
On Nov 5, 6:58 pm, Matthias Alles <REMOVEallesCAPIT...@rhrk.uni-kl.de>
wrote:
Is there a better solution for this problem, which prevents me from
copying "a" to "out" a couple of times per clock? Of course, the
solution should be synthesizable.
Yes, either use multiple concurrent assignment statements...

out.record_element_a <= a.a; -- Copy
out.record_element_b <= a.b; -- Copy
out.record_element_x <= b;
out.record_element_y <= c;

Or, a single assignment
out <(
record_element_a => a.a, -- Copy
record_element_b => a.b, -- Copy
record_element_x => b;
record_element_y => c
);

The second method has the advantage that if you add/remove record
elements, the assignment to 'out' will fail when you compile it,
forcing you to explicitly fix it at that time (that's a good thing).
The first method of independent assignments would compile correctly
without complaint (which can be nice), but might not represent what
you intend to do. The second method is usually better.

Kevin Jennings
 
On Sat, 06 Nov 2010 09:09:13 +0000, Jonathan Bromley
<spam@oxfordbromley.plus.com> wrote:

On Sat, 06 Nov 2010 08:53:37 +0000, Brian Drummond wrote:

function combine(a: in <type>: b,c : in std_logic) return <type> is
variable temp : <type> := a;
begin
temp.x := b;
return temp;
end combine;

Right, but I don't think that solves the OP's problem, which
was to try to avoid the performance hit of repeated copying
of a large record when his process is repeatedly triggered
in successive deltas. Indeed, your function potentially
causes THREE copy operations:
- one to put the actual argument value into the formal "a"
- one to copy "a" to "temp"
- one to copy "temp" to the return target
although any half-decent compiler will collapse those
on to only one or two, I imagine.
Could be. I thought the slowness he found was when copying a to out, i.e. when a
potentially tricky resolution took place.

This solution should at least reduce the number of assignments to "out" to one.
Which - I would guess - would give most of the benefit of whatever the optimum
approach is.

Confession time; I haven't spent an afternoon benchmarking different approaches
with a few million iterations each, so I could be wrong here.

---

Is it just me that pays (almost) no attention to simulation efficiency?
It seems to me that my thinking time dwarfs sim runtime anyway.
And corrupting the design for more "efficient" simulation seems to be a move in
the wrong direction - away from efficient use of my time - to me. Am I so far
out on a limb?

I try to get the heavy lifting done by fairly well focussed test cases rather
than by distorting the most natural (to me) - and therefore most likely to be
bug-free - design approach in the cause of pandering to simulators.

Even when I run a long simulation (e.g. error mapping a 32-bit floating point
square root) testing every one of 16 million mantissa values only took an hour
or two, with no concession to fast simulation coding. (About 200 of them had
error just over 0.5LSB so it wasn't 100% P754 compliant).

Of course by that stage I'm not interactively debugging, but if I were, after
lunch, I could focus on the specific values that were giving trouble.

The only time simulation speed REALLY bugs me is when vendors insist on
supplying only a gate-level representation of their core (and sometimes even
their testbench - I'm not kidding here!!! *cough* Xilinx PCI express ) - that
is too large to run on their basic pay-for simulator!

Then I restrict the simulation to a few basic reads and writes, and drop in a
higher level interface (approximating the PCIe core's local bus) for the real
system-level simulations.

- Brian
 
I had a similar situation on a project where I noticed an assignment
to a signal record significantly increased sim runtime. My solution
was to keep the record as a variable inside a process. It appeared to
be the assignment to the signal that caused the simulation slow down.
If I was debugging I would assign the variable to a signal to see what
was going on.

If the record has to be a signal then you can reduce the number of
members or size of the members of the record can also help. So you
could try splitting the record if some elements get 'touched' more
that others.

Darrin


On Nov 6, 9:58 am, Matthias Alles <REMOVEallesCAPIT...@rhrk.uni-kl.de>
wrote:
Hi,

I have a problem which I would like to optimize for simulation speed. I
have a signal "a", which is a record with many different elements and
element types. Now I would like to substitue single elements within this
record asynchronously. Currently I do the following:

process(a, b, c) is
begin

out <= a;

out.record_element_x <= b;
out.record_element_y <= c;

end process;

"out" is of the same type as "a", so I first copy all record elements
and then overwrite the ones I would like to substitute. Like that I
don't have to copy each record element individually. The problem is that
"b" and "c" might be asynchronously calculated as well, which means the
process can be triggered several times per rising clock edge, slowing
down the simulation (copying "a" to "out" seems quite time consuming).

Is there a better solution for this problem, which prevents me from
copying "a" to "out" a couple of times per clock? Of course, the
solution should be synthesizable.

Thanks,
Matthias
 
I tried both approaches, and both result in a significant simulation
speed-up.

out.record_element_a <= a.a; -- Copy
out.record_element_b <= a.b; -- Copy
out.record_element_x <= b;
out.record_element_y <= c;

Or, a single assignment
out <=
(
record_element_a => a.a, -- Copy
record_element_b => a.b, -- Copy
record_element_x => b;
record_element_y => c
);
The first approach is slower than the second one. With some other
improvements I gained 15% speed-up with multiple concurrent assignments
and 30% speed-up with a single assignment. So I'm using a single
assignment now. The code is still readable, but the problem remains that
the copied signals are written in every delta-cycle in which the signals
b or c change. Thus, I also tried to make sure that b and c are
calculated at the same delta.

Thanks,
Matthias
 
The real solution is to avoid retriggering of the process,
which is best done by NOT writing a combinational process
with multiple signals in its sensitivity list.

Matthias, is there any way you could fold the record
processing into your clocked process, so you know it's
executed only once per cycle?
Unfortunately not, but I reduced the number of deltas in which the
process (or now concurrent assignment, see other post) is triggered.

Matthias
 
Is it just me that pays (almost) no attention to simulation efficiency?
It seems to me that my thinking time dwarfs sim runtime anyway.
And corrupting the design for more "efficient" simulation seems to be a move in
the wrong direction - away from efficient use of my time - to me. Am I so far
out on a limb?
Usually I don't care that much either. But think of regression tests
that you might want to perform after fixing a bug. Those can be quite
time consuming, since you usually perform them for every single change
you make on a design. In this particular case, spending one hour in
profiling the VHDL code can easily save me many many hours of simulation
time later on.

Matthias



I try to get the heavy lifting done by fairly well focussed test cases rather
than by distorting the most natural (to me) - and therefore most likely to be
bug-free - design approach in the cause of pandering to simulators.

Even when I run a long simulation (e.g. error mapping a 32-bit floating point
square root) testing every one of 16 million mantissa values only took an hour
or two, with no concession to fast simulation coding. (About 200 of them had
error just over 0.5LSB so it wasn't 100% P754 compliant).

Of course by that stage I'm not interactively debugging, but if I were, after
lunch, I could focus on the specific values that were giving trouble.

The only time simulation speed REALLY bugs me is when vendors insist on
supplying only a gate-level representation of their core (and sometimes even
their testbench - I'm not kidding here!!! *cough* Xilinx PCI express ) - that
is too large to run on their basic pay-for simulator!

Then I restrict the simulation to a few basic reads and writes, and drop in a
higher level interface (approximating the PCIe core's local bus) for the real
system-level simulations.

- Brian
 

Welcome to EDABoard.com

Sponsor

Back
Top