Microblaze instruction timings

J

Jon Beniston

Guest
Hi,

Can anyone (Goran?) fill in some details of the Microblaze's pipeline
for me? Do multi-cycle instructions always take multiple cycles? For
example, if a load or shift is followed by an instruction that doesn't
use the result of the load or shift, will the load or shift still cost
two cycles? What is the branch penalty?

Also, what does the 950 logic-cell figure include? Does it include the
caches as well as all of the optional instructions / debug logic?

Cheers,
JonB
 
Hi,

The multicycle instruction always take multiple cycles.
This is due to the pipeline of MicroBlaze.
MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch
(OF) and Execution Stage (EX)

When a multicycle instruction is executing (is in EX), the next
instruction is in the OF stage.
The pipeline can't move since the EX stage is occupied.
A more complex handling of the EX stage to allow more than 1 instruction
at the same time maybe possible but will increase the control complexity
quite a lot.
All pipeline hazardous are resolved in hardware and an increase in
complexity might result in a overall lower performance since the clock
frequency might be lower.

The best way to handle multicycle instruction is to increase the number
of pipeline stages but that will increase the area.
You will always pay for a higher performance by using more resources.
The current MicroBlaze is a good tradeoff between area and performance.
It's smaller and the same time it's also faster than any other soft
processor.

The 950 LUT figure includes the basic features no caches or debug.
The caches is quite cheap on LUTs, around 50 LUTs for the instruction cache.
The cost is that BRAM is needed to handle the caches.

Göran
Jon Beniston wrote:

Hi,

Can anyone (Goran?) fill in some details of the Microblaze's pipeline
for me? Do multi-cycle instructions always take multiple cycles? For
example, if a load or shift is followed by an instruction that doesn't
use the result of the load or shift, will the load or shift still cost
two cycles? What is the branch penalty?

Also, what does the 950 logic-cell figure include? Does it include the
caches as well as all of the optional instructions / debug logic?

Cheers,
JonB
 
Goran Bilski <goran@xilinx.com> wrote in message news:<c12r85$l611@cliff.xsj.xilinx.com>...
Hi,

The multicycle instruction always take multiple cycles.
This is due to the pipeline of MicroBlaze.
MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch
(OF) and Execution Stage (EX)
Thanks for the explaination.

The current MicroBlaze is a good tradeoff between area and performance.
Sure.

The 950 LUT figure includes the basic features no caches or debug.
The caches is quite cheap on LUTs, around 50 LUTs for the instruction cache.
The cost is that BRAM is needed to handle the caches.
Does "basic features" include the h/w divider? I've been trying to
reproduce the quoted Dhrystone figures on the simulator, and only get
0.63 MIPS/MHz without it. If I add it, I can get 0.77.

It seems strange that on the Web page
(http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm),
the Spartan 3 is rated at 0.8 and the Spartan II is rated at 0.65, yet
they are both listed as requiring the same number of logic cells. I
would presume that either the performance figure for the Spartan II is
too low, or the number of logic cells required by the Spartan 3 and
Virtex II's to acheive the quoted figure is actually higher.

Incidentally, I've been trying to get the Dhrystone numbers for NIOS
as well. Can anybody clarify if their instruction set simulator is
cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit
implementation and 0.15 for a 16-bit implementation, but I have a
feeling that this should be lower.

Cheers,
JonB
 
Incidentally, I've been trying to get the Dhrystone numbers for NIOS
as well. Can anybody clarify if their instruction set simulator is
cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit
implementation and 0.15 for a 16-bit implementation, but I have a
feeling that this should be lower.

Cheers,
JonB
Jon,

RTL simulation in Nios of instruction execution (using ModelSim or
similar) is cycle accurate. This is true for whether you're executing
out of on-chip memory, SRAM (via simulation model), or SDRAM (we
include a simulation model in the latest Nios kit). For Dhrystone, you
can just run in hardware (much faster than running a long simulation)
to compare slight changes you make to Nios.

That said, I agree what you're seeing is a bit high - we've seen 0.4
(SDRAM + cache) to 0.5DMips/mhz (on-chip mem) for 32-bit "classic"
Nios. It makes me wonder if there is some difference in code?

I would also recommend that in what ever benchmark you do, to have the
memory (program/data/cache) as you will have it in your final
application to get the most realistic results possible.

Finally, while Dhrystone is pretty popular, the biggest advantage of
going with a soft-core CPU (regardless of whose it is) is that you're
in an environment where things can be tweaked to make your application
much faster. Custom instructions & peripherals can do wonders
depending on what your code looks like. One of my colleagues has a
cover article in Embedded Systems Programming this month that you may
find useful (sorry for the shameless plug..):
http://www.embedded.com/showArticle.jhtml?articleID=17500157

....of course, you can also wait a bit for Nios II :)

Jesse Kempa
Altera Corp.
jkempa at altera dot com
 
That said, I agree what you're seeing is a bit high - we've seen 0.4
(SDRAM + cache) to 0.5DMips/mhz (on-chip mem) for 32-bit "classic"
Nios. It makes me wonder if there is some difference in code?
When I said simulator, I meant the software simulator that comes as
part of the GNUPro tools. I don't have access to the RTL.

Do you have any idea what the performance is for a 16-bit core?

I would also recommend that in what ever benchmark you do, to have the
memory (program/data/cache) as you will have it in your final
application to get the most realistic results possible.
Sure.

Cheers,
JonB
 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Hi,<br>
<br>
See below.<br>
<br>
Jon Beniston wrote:<br>
&lt;blockquote type="cite"
cite="mide87b9ce8.0402200239.2d135380@posting.google.com"&gt;
<pre wrap="">Goran Bilski <a class="moz-txt-link-rfc2396E" href="mailto:goran@xilinx.com">&lt;goran@xilinx.com&gt;</a> wrote in message news:<a class="moz-txt-link-rfc2396E" href="mailto:c12r85$l611@cliff.xsj.xilinx.com">&lt;c12r85$l611@cliff.xsj.xilinx.com&gt;</a>...
</pre>
&lt;blockquote type="cite"&gt;
<pre wrap="">Hi,

The multicycle instruction always take multiple cycles.
This is due to the pipeline of MicroBlaze.
MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch
(OF) and Execution Stage (EX)
</pre>
&lt;/blockquote&gt;
<pre wrap="">&lt;!----&gt;
Thanks for the explaination.

</pre>
&lt;blockquote type="cite"&gt;
<pre wrap="">The current MicroBlaze is a good tradeoff between area and performance.
</pre>
&lt;/blockquote&gt;
<pre wrap="">&lt;!----&gt;
Sure.

</pre>
&lt;blockquote type="cite"&gt;
<pre wrap="">The 950 LUT figure includes the basic features no caches or debug.
The caches is quite cheap on LUTs, around 50 LUTs for the instruction cache.
The cost is that BRAM is needed to handle the caches.
</pre>
&lt;/blockquote&gt;
<pre wrap="">&lt;!----&gt;
Does "basic features" include the h/w divider? I've been trying to
reproduce the quoted Dhrystone figures on the simulator, and only get
0.63 MIPS/MHz without it. If I add it, I can get 0.77.
</pre>
&lt;/blockquote&gt;
To get 0.8 MIPS/MHz, you need to enable the HW divider. The size of the
HW divider is around 60-80 LUTs.<br>
I can't remember correctly but the implementation is a basic
shift-compare design which only needs a compare block and a shift
block. The divide will take 35 clock cycles. 2 clock cycles to setup
the operands, 32 clock cycles for the division and 1 clock cycle for
writing the result.<br>
&lt;blockquote type="cite"
cite="mide87b9ce8.0402200239.2d135380@posting.google.com"&gt;
<pre wrap="">
It seems strange that on the Web page
(<a class="moz-txt-link-freetext" href="http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm">http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm</a>),
the Spartan 3 is rated at 0.8 and the Spartan II is rated at 0.65, yet
they are both listed as requiring the same number of logic cells. I
would presume that either the performance figure for the Spartan II is
too low, or the number of logic cells required by the Spartan 3 and
Virtex II's to acheive the quoted figure is actually higher.
</pre>
&lt;/blockquote&gt;
The difference is that S3 and VII has embedded multiplier so MicroBlaze
will have a HW multiplier while the S2 doesn't have the HW multiplier
so multiplication is done using SW (which takes many more clock cycles)<br>
&lt;blockquote type="cite"
cite="mide87b9ce8.0402200239.2d135380@posting.google.com"&gt;
<pre wrap="">
Incidentally, I've been trying to get the Dhrystone numbers for NIOS
as well. Can anybody clarify if their instruction set simulator is
cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit
implementation and 0.15 for a 16-bit implementation, but I have a
feeling that this should be lower.

Cheers,
JonB
</pre>
&lt;/blockquote&gt;
&lt;/body&gt;
&lt;/html&gt;
 
Hi,

Sorry, I sent the answers as HTML only so I resent this as text only.

See below.


Jon Beniston wrote:

Goran Bilski &lt;goran@xilinx.com&gt; wrote in message news:&lt;c12r85$l611@cliff.xsj.xilinx.com&gt;...


Hi,

The multicycle instruction always take multiple cycles.
This is due to the pipeline of MicroBlaze.
MicroBlaze has only 3 pipestages, Instruction Fetch (IF), Operand Fetch
(OF) and Execution Stage (EX)



Thanks for the explaination.



The current MicroBlaze is a good tradeoff between area and performance.



Sure.



The 950 LUT figure includes the basic features no caches or debug.
The caches is quite cheap on LUTs, around 50 LUTs for the instruction cache.
The cost is that BRAM is needed to handle the caches.



Does "basic features" include the h/w divider? I've been trying to
reproduce the quoted Dhrystone figures on the simulator, and only get
0.63 MIPS/MHz without it. If I add it, I can get 0.77.


To get 0.8 MIPS/MHz, you need to enable the HW divider. The size of the
HW divider is around 60-80 LUTs.
I can't remember correctly but the implementation is a basic
shift-compare design which only needs a compare block and a shift block.
The divide will take 35 clock cycles. 2 clock cycles to setup the
operands, 32 clock cycles for the division and 1 clock cycle for writing
the result.

It seems strange that on the Web page
(http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm),
the Spartan 3 is rated at 0.8 and the Spartan II is rated at 0.65, yet
they are both listed as requiring the same number of logic cells. I
would presume that either the performance figure for the Spartan II is
too low, or the number of logic cells required by the Spartan 3 and
Virtex II's to acheive the quoted figure is actually higher.


The difference is that S3 and VII has embedded multiplier so MicroBlaze
will have a HW multiplier while the S2 doesn't have the HW multiplier so
multiplication is done using SW (which takes many more clock cycles)

Incidentally, I've been trying to get the Dhrystone numbers for NIOS
as well. Can anybody clarify if their instruction set simulator is
cycle accurate? If it is, the figures appear to be 0.64 for a 32-bit
implementation and 0.15 for a 16-bit implementation, but I have a
feeling that this should be lower.

Cheers,
JonB
 

Welcome to EDABoard.com

Sponsor

Back
Top