improving code

M

mysticlol

Guest
Hi,

I am having trouble with this code. It is reducing the design max clock
frequency considerably. How can I improve this?

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
begin

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
..
..
..
end architecture rtl;
 
mysticlol wrote:
Hi,

I am having trouble with this code. It is reducing the design max clock
frequency considerably. How can I improve this?

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
begin

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
.
.
.
end architecture rtl;
Probably everyone will say go with a synchronous clock approach. Also
you can compute the next data word in combinatorial logic:

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
signal next_dataWord : std_logic_vector (10 downto 0);

begin
next_dataWord <= dataWord(11 downto 1) - '1';

process(clock)
begin
if rising_edge(clock) then
if reset='1' then
ontime <= DEFAULT_TIME;
elsif go='1' then
ontime <= next_dataWord;
end if;
end if;
end process;

end architecture rtl;


--
David Ashley http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
 
Hi David,

I already tried that. It is not making any difference..

Actually I have a 100MHz clock. dataWord is user input.
100 MHz/dataWord(11 downto 0) = required Clock Period.

I am taking dataWord(11 downto 1) - so that I can get ontime which is
half of this dataWord. As I am counting from 0, I am substracting 1
from this ontime.

regards,
mysticlol


David Ashley wrote:
mysticlol wrote:
Hi,

I am having trouble with this code. It is reducing the design max clock
frequency considerably. How can I improve this?

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
begin

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
.
.
.
end architecture rtl;


Probably everyone will say go with a synchronous clock approach. Also
you can compute the next data word in combinatorial logic:

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
signal next_dataWord : std_logic_vector (10 downto 0);

begin
next_dataWord <= dataWord(11 downto 1) - '1';

process(clock)
begin
if rising_edge(clock) then
if reset='1' then
ontime <= DEFAULT_TIME;
elsif go='1' then
ontime <= next_dataWord;
end if;
end if;
end process;

end architecture rtl;


--
David Ashley http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
 
mysticlol wrote:
Hi David,

I already tried that. It is not making any difference..

Actually I have a 100MHz clock. dataWord is user input.
100 MHz/dataWord(11 downto 0) = required Clock Period.

I am taking dataWord(11 downto 1) - so that I can get ontime which is
half of this dataWord. As I am counting from 0, I am substracting 1
from this ontime.

regards,
mysticlol
If you take out the subtracting 1 and cause the user input to
just be one less than it otherwise would be, does it meet timing?

How about an intermediate register that already has the -1 applied --
so ontime gets reloaded from that?

Maybe add 1 to the DEFAULT_TIME value and change the down
counter to stop at one more than it currently stops at? Then you
don't need the minus 1.

-Dave




--
David Ashley http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
 
mysticlol wrote:
Hi,

I am having trouble with this code. It is reducing the design max clock
frequency considerably. How can I improve this?

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
begin

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
.
.
.
end architecture rtl;
First take a look at the critical timing path that is being reported by
the fitter and try to understand what is going on in that path.
Understanding what is going on in that path is the key to improving
performance.

The synthesis/fitter tools are all pretty smart and have probably
already performed the optomization that Dave suggested about moving the
calculation of 'ontime' outside the process. That's why when you tried
it, it didn't help...it probably synthesized to the exact same thing.
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs. Sometimes there
are exceptions to this general guideline but in this case there isn't.

I'm guessing that the worst case path is the setup time from the input
'dataword' to the flip flops that compute 'ontime'. Since setup times
that start from external pins are almost always longer than ones that
start from an internal flip flop the way to improve clock cycle
performance is to simply register 'dataword' right at the input and
then feed the 'dataword_delayed' into your computation of 'ontime'.
The disadvantage is that you have added one clock cycle latency but in
this case it looks like you can probably change your code to compensate
for that.

KJ
 
Strange. Your critical path is only the carry propagation along the
substraction and your target clock frequency is easy to reach for the
current ASIC/FPGA technology (100MHz according to your reply?). Could
you give some more infos about

- Which technology do you target?
- Which frequency would you like to reach?
- What is the dataWord input delay wrt clock?

If your technology really prevent you to compute the 11-bit
substraction using a basic ripple carry substractor, then you'll have
to use some carry anticipation technics (such as carry-save adders and
so on). Try google and "arithmetic operator implementation".

Eric
 
KJ wrote:
The synthesis/fitter tools are all pretty smart and have probably
already performed the optomization that Dave suggested about moving the
calculation of 'ontime' outside the process. That's why when you tried
it, it didn't help...it probably synthesized to the exact same thing.
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs. Sometimes there
are exceptions to this general guideline but in this case there isn't.
KJ,

Can you be specific about:
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs.
I want to make sure I'm following. Is it a question of

#1:

process(clk)
begin
if clk'event and clk = '1' then
state <= next_state;
end if;
end process;

process(state, ...)
begin
case state is
when state1 => next_state <= state2;blah;
when state2 => next_state <= state3;blah;
when others => next_state <= state1;blah;
end case;
end process;

VS
#2:

process(clk)
begin
if clk'event and clk = '1' then
case state is
when state1 => state <= state2;blah;
when state2 => state <= state3;blah;
when state3 => state <= state1;blah;
end case;
end if;
end process;

I just moved a block of code that had been doing #2 to
#1, but are you saying #2 is the way to go? I'm trying
to develop the right habits...

-Dave

-
David Ashley http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
 
David Ashley wrote:
KJ wrote:
The synthesis/fitter tools are all pretty smart and have probably
already performed the optomization that Dave suggested about moving the
calculation of 'ontime' outside the process. That's why when you tried
it, it didn't help...it probably synthesized to the exact same thing.
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs. Sometimes there
are exceptions to this general guideline but in this case there isn't.

KJ,

Can you be specific about:
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs.

I want to make sure I'm following. Is it a question of

#1:

process(clk)
begin
if clk'event and clk = '1' then
state <= next_state;
end if;
end process;

process(state, ...)
begin
case state is
when state1 => next_state <= state2;blah;
when state2 => next_state <= state3;blah;
when others => next_state <= state1;blah;
end case;
end process;

VS
#2:

process(clk)
begin
if clk'event and clk = '1' then
case state is
when state1 => state <= state2;blah;
when state2 => state <= state3;blah;
when state3 => state <= state1;blah;
end case;
end if;
end process;

I just moved a block of code that had been doing #2 to
#1, but are you saying #2 is the way to go? I'm trying
to develop the right habits...
This point almost always ignites a ton of responses from people with
some taking one side, others taking the other....so we best be careful
here lest wind get out of this :-0

The example you called #1 would be considered by many to be an example
of 'two process' coding style where you have one clocked process and
another combinatorial process and where some or all of the outputs of
the combinatorial process feeds into the clocked process. What you
called #2 is an example of what many would call 'single process' coding
style where there is simply one clocked process. The point that people
argue about is whether or not the 'one process' approach is better than
the 'two process'...and like I said, if word of this conversation gets
around there will no doubt be a number of people on both sides of the
fence stating their opinion....so I'll go first ;)

- The 'one process' coding style will always take fewer lines of code.
Even your simple example takes 8 lines for the two process version
versus 7 for the one process (not counting the begin/end
process...which really should be counted also but just trying to strip
down to the absolute indisputable basics here. Typically what you'll
find is that the two process style bloats it up a bit more....the
number of lines of code difference between the two typically grows a
bit for each additional output signal of the process that is added.
- The 'two process' coding style is prone to two types of errors
* Incomplete sensitivity list (you had "process(state, ...)" ).
Maintenance of that "..." is not trivial in a real design, will
simulate just fine with no errors or warnings but synthesize to
something functionally different. There is no shortcut that I know of
to aid in that maintenance other than to run the code through a
synthesis engine and see where it warns you about an incomplete
sensitivity list and then fix it...and re-simulate....wasted time if
you ask me.
* Default values for signals inside the combinatorial process. Your
example covers every possible path assigning the one signal that is
inside the process but let's say that process assigned to 4 different
signals. You'd need to make sure that every possible path always got
assigned regardless of the logic. Typically what is done is right at
the begining of the combinatorial process you would have 'default'
assignments to those 4 signals and then head down into the meat of what
the process is attempting to do. If every possible path does not
assign a value then you've created a combinatorial latch...definite
'no,no' in FPGA/CPLDs. The method of always having a 'default'
assignment right at the begining of the process is at least a way to
insure that you cover this....and is also a contributor to why I
mentioned that the difference between the number of lines of code
written using the two process approach versus using the one process
approach grows with the number of signals being assigned within that
process. Using the one process coding style these additional
statements just aren't necessary.
- Whether one uses the 'one process' coding style or the 'two process'
coding style you will get the exact same synthesis results.
- More lines of code translates into more possible bugs.

Those are pretty much the indisputable facts about 'one process' versus
'two process' coding style that even the two process people will
begrudingly acknowledge. Pick up a textbook and you'll often see the
two process being waxed poetically and how it has some esoteric
advantage but when you cut through the cr-p and dig down to how it is
in any way better you'll come up empty. A lot of times the style that
people seem to use has more to do with
- How they were first taught it.
- How their first 'mentor' taught them.
- How the company they work for demands it.

From that they now develop a personal preference and tend to stick with
it.

So take a look at the above mentioned disadvantages and I'll bet that
they are probably fairly closely aligned to how your personal
productivity will be measured in your job and decide for yourself.
Better yet, take some stripped down but somewhat realistic code and
implement it both ways and see firsthand for yourself.

As for the specific code of the original post, I simply think it
belongs inside the clocked process since ultimately this is a clocked
output and having two lines of code that are physically separated in
the file to implement what can be done with one is not a method I would
encourage.

The optomization that you mentioned as a suggestion would have been a
good one if the synthesis tool hadn't (most likely) already implemented
it. Based on the original poster's comment that he had tried your
suggestion and got the same performance would suggest that the
synthesis process had indeed sniffed this one out. It can be
enlightening sometimes to take a look at the synthesized output code
and see how it has implemented functionally the same thing but in a
much better (space, performance) way than an inspection of your source
code would seem to suggest that it was intended to be implemented.

KJ
 
KJ wrote:
...
- Whether one uses the 'one process' coding style or the 'two process'
coding style you will get the exact same synthesis results.
- More lines of code translates into more possible bugs.

Those are pretty much the indisputable facts about 'one process' versus
'two process' coding style that even the two process people will
begrudingly acknowledge. Pick up a textbook and you'll often see the
two process being waxed poetically and how it has some esoteric
advantage but when you cut through the cr-p and dig down to how it is
in any way better you'll come up empty. A lot of times the style that
people seem to use has more to do with
- How they were first taught it.
- How their first 'mentor' taught them.
- How the company they work for demands it.
...

KJ
KJ,

I think I'm convinced. I had a module that was using the single clocked
process style, and I converted it to the 2 process style. So for every
clocked signal I had to declare a next_<whatever> signal. During
simulation I wasted time because my non-clocked process didn't have
all the sensitivities listed. Then during synthesis, I got bit because
I missed one or to next_<whatever> signals that didn't have a default
setting (as in starting out with next_cmd <= cmd;) So in this one
exercise I experienced pretty much everything you listed. Especially:
works fine in simulation, fails on actual hardware -- what a headache
to track down.

Now for the 2 coding style I had been attracted because it eliminates
one level of if -- the non-clocked region doesn't need to be inside
a "if clk'event and clk = '1' then". This makes for less indented code.

Another possible benefit to clocked/non-clocked processes: the non-clocked
can be divided up into different processes which can be kept small.
However that craps out when you want 2 processes to define the same
signal depending on state -- which means your state dependent code
has to be in one process anyway.

Trying both on a real world, but non-complex circuit, was a very good
exercise. Actually the design was for a simple memory test unit using
the open cores ddr controller -- it fills memory up with a non-repetitive
pattern, then reads it out and verifies it. If all is ok an LED stays lit.

All in all I think I lean towards the single clocked process. Thanks for
taking the time to elaborate all that!

-Dave

--
David Ashley http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
 
Hi KJ,

The critical path is between 1. ontime FF to Count FF (6.36ns) and 2.
dataword_2 to ontime_8 (5.08 ns)

Actually it is meeting 100 MHz clock requirement. But in overall
project I need this module should reach 200 MHz. Now it is giving upto
150 MHz.

I know this might be annoying like asking questions without full code.
So let me give code....


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.NUMERIC_STD.ALL;

---- Uncomment the following library declaration if instantiating
---- any Xilinx primitives in this code.
--library UNISIM;
--use UNISIM.VComponents.all;

entity clock_gen is
generic (default_div : std_logic_vector(10 downto 0) :=
"00000110001");
Port (
clk100 : in std_logic;
reset : in std_logic;
enable : in std_logic;
go : in std_logic;
dataWord : in std_logic_vector(11 downto 0);
clock_out : out std_logic;
done : out std_logic
);
end entity clock_gen;

architecture Behavioral of clock_gen is
signal count : std_logic_vector(10 downto 0);
signal ontime : std_logic_vector(10 downto 0);
signal clkout : std_logic;
signal go_en : std_logic;
begin

go_en <= go and enable;

process(clk100, reset)
begin
if reset='1' then
ontime <= default_div;
elsif rising_edge(clk100) then
if go_en='1' then
ontime <= dataWord(11 downto 1)-'1';
end if;
end if;
end process;

process(clk100, reset)
begin
if reset='1' then
done <= '1';
elsif rising_edge(clk100) then
if go_en='1' then
done <= '0';
else
done <= '1';
end if;
end if;
end process;

process(clk100, reset)
begin
if reset='1' then
count <= (others => '0');
elsif rising_edge(clk100) then
if go_en='1' or count=ontime then
count <= (others => '0');
else
count <= count + '1';
end if;
end if;
end process;

process(clk100, reset)
begin
if reset='1' then
clkout <= '0';
elsif rising_edge(clk100) then
if count=ontime then
clkout <= not clkout;
end if;
end if;
end process;

clock_out <= clkout;

end architecture Behavioral;


KJ, I am speachless .. what a guess.
You are an inspiration to programmers like me..


Regards,
Krishna
krishna.janumanchi@wipro.com



KJ wrote:
mysticlol wrote:
Hi,

I am having trouble with this code. It is reducing the design max clock
frequency considerably. How can I improve this?

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
begin

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
.
.
.
end architecture rtl;

First take a look at the critical timing path that is being reported by
the fitter and try to understand what is going on in that path.
Understanding what is going on in that path is the key to improving
performance.

The synthesis/fitter tools are all pretty smart and have probably
already performed the optomization that Dave suggested about moving the
calculation of 'ontime' outside the process. That's why when you tried
it, it didn't help...it probably synthesized to the exact same thing.
For that reason, you probably shouldn't write your code in that manner,
but keep it inside the logic block where it belongs. Sometimes there
are exceptions to this general guideline but in this case there isn't.

I'm guessing that the worst case path is the setup time from the input
'dataword' to the flip flops that compute 'ontime'. Since setup times
that start from external pins are almost always longer than ones that
start from an internal flip flop the way to improve clock cycle
performance is to simply register 'dataword' right at the input and
then feed the 'dataword_delayed' into your computation of 'ontime'.
The disadvantage is that you have added one clock cycle latency but in
this case it looks like you can probably change your code to compensate
for that.

KJ
 
Hi,

David Ashley schrieb:
mysticlol wrote:
process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
[..]
Probably everyone will say go with a synchronous clock approach. Also
you can compute the next data word in combinatorial logic:
You can anyway calculate the data only combinatorical and store the
result sequentiell.

architecture rtl of clock_gen is
constant DEFAULT_TIME : std_logic_vector := "1111011010";
signal ontime : std_logic_vector( 10 downto 0);
signal next_dataWord : std_logic_vector (10 downto 0);

begin
next_dataWord <= dataWord(11 downto 1) - '1';

process(clock)
begin
if rising_edge(clock) then
if reset='1' then
ontime <= DEFAULT_TIME;
elsif go='1' then
ontime <= next_dataWord;
end if;
end if;
end process;
The most interessting difference I see, is the change of async reset to
a sync reset.
Your code will be better for real stupid tools, but I think there is no
commercial tool making any difference between
<clocked>
A=B+1;
</clocked>
and
temp=B+1;
<clocked>
A=temp;
</clocked>

What the OP wants is a faster '+ (-1)' than the one he gets with using
just '-1'.
Fist step is to investigate long paths with respect to layout and
netload.
Then have a look at the available technology, maybe you have a fast
carry chain outrunning even CLA (Carry Lookahead Adder) with a RCA
(Ripple-Carry Adder).

Last chance would be the use of pipeline stages. Or use double HW, if
your adder runs 100 MHz but not 200 MHz you could possibly use two
adders parallel running at 100 MHz to get the needed number of results
per second

bye Thomas
 
mysticlol wrote:
Hi KJ,

The critical path is between 1. ontime FF to Count FF (6.36ns) and 2.
dataword_2 to ontime_8 (5.08 ns)

Actually it is meeting 100 MHz clock requirement. But in overall
project I need this module should reach 200 MHz. Now it is giving upto
150 MHz.

So let me give code....


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use IEEE.NUMERIC_STD.ALL;
Get rid of the std_logic_arith and std_logic_unsigned....numeric_std is
the preferred approach and is a standard. The other two (despite being
in the 'ieee' library) are not standards. This won't affect your
timing, just a comment.

The path from dataword_2 to ontime_8 (5.08 ns) might improve by adding
the previously mentioned flip flop delay. Many parts have I/O buffers
that include flip flops. Making use of those will get the best
possible timing, whether it will get you up to 150-200 MHz depends on
the specific part.

In order to 'make use of those flip flops' you need to make sure that
the code you write get implemented in the way that you want it to.
Some parts with the flip flops in the I/O buffer do not have any sort
of clock enable or reset....they are just flip flops. So if the code
that you write infers something more than that (as your does) than it
won't be able to make use of those flops. It might cost you one clock
cycle of latency but if clock frequency is the issue, then this will
help.

The path from ontime FF to Count FF is caused by your statements "if
go_en='1' or count=ontime then". In order to see if 'count = ontime'
logic needs to be implemented to compare two 11 bit numbers. There's
no magic, a compare of two 11 bit numbers is a function of 22 inputs
producing a single output. The way to improve performance here is to
try to recast the problem into one that compares an 11 bit number to a
constant. Now logic needs to be implemented to take 11 bits of input
and produce a single output. Again, it might not end up being fast
enough to hit 150-200 MHz but it should get you a lot closer.

It appears on the surface that instead of resetting count to 0 and then
incrementing it up until you get to 'ontime' you could instead
initialize a counter to something dependent on dataword (since that is
where 'ontime' originates from) and then have that counter decrement
until it hits 0. That way the equivalent to comparing 'count=ontime'
will be to compare 'count=0' which, as I said, is comparing an 11 bit
number to a constant which should speed up that critical path.

You'll need to work through the detailed changes so that you keep the
function that you're trying to implement but that should be enough to
get you going.

As an aside, always keep in mind whenever you use any sort of multi-bit
operator ('=', '>' or '<', '+', '-', '*', '/', etc.) that these things
can be costly in terms of logic and timing. If performance is an issue
you'll almost always want to see if there are ways to write the code so
that one of the operands is a constant since that cuts the number of
inputs to that function in half. It can't always be done, and
depending on the hardware resources in the part it might not be
desirable, but it's good to keep this sort of transformation in the
back of your head for those times when you need it.

KJ
 
David Ashley wrote:
mysticlol wrote:
This:

process(clock, reset)
begin
if reset='1' then
ontime <= DEFAULT_TIME;
elsif rising_edge(clock) then
if go='1' then
ontime <= dataWord(11 downto 1)-'1'; -- dataWord - 12 bit i/p
end if;
end if;
end process;
and this:

next_dataWord <= dataWord(11 downto 1) - '1';

process(clock)
begin
if rising_edge(clock) then
if reset='1' then
ontime <= DEFAULT_TIME;
elsif go='1' then
ontime <= next_dataWord;
end if;
end if;
end process;
are equivalent. Mostly; you've used a synchronous reset and
"mysticlol" has used an async reset. But the "go" signal assignment is
otherwise identical.
-a
 
OK.
I will make these changes to my code.
Thank you all for the timely suggestions.

Regards,
Krishna Janumanchi
 

Welcome to EDABoard.com

Sponsor

Back
Top