Am I right in my VHDL code? Synopsys DC runs for ever...

W

walala

Guest
Dear all,

I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...

I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.

As you can see from the code, the size of code is huge, about 228KBytes, but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the components,
since there will be only one input at one clock cycle, so the 64 cases will
not overlap with each other in time... If DC can make the resources reuse, I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64 adders?

I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?

Thanks a lot,

-Walala

My code snippet is as follows:

---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' & COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:

CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =>
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;

WHEN 1 =>
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;

WHEN 2 =>
 
Dear experts,

Is that simply because the source code size is too large: 228KBytes...

If I further divide the computation of each 64 additions into 8 clocks, will
that be helpful to reduce the adders to 8 since now at a clock cycle there
will be only 8 additions needed and the input will now hold for 8 cycles?

For instance, I can do the following:

WHEN 0 =>
T1:=CONV_INTEGER(X) * 32;

STEP<=STEP+1;

CASE STEP IS
WHEN 0 =>
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
ZZ(0)(3)<=ZZ(0)(3)+T1;
ZZ(0)(4)<=ZZ(0)(4)+T1;
ZZ(0)(5)<=ZZ(0)(5)+T1;
ZZ(0)(6)<=ZZ(0)(6)+T1;
ZZ(0)(7)<=ZZ(0)(7)+T1;

WHEN 1 =>

ZZ(1)(0)<=ZZ(1)(0)+T1;
ZZ(1)(1)<=ZZ(1)(1)+T1;
ZZ(1)(2)<=ZZ(1)(2)+T1;
ZZ(1)(3)<=ZZ(1)(3)+T1;
ZZ(1)(4)<=ZZ(1)(4)+T1;
ZZ(1)(5)<=ZZ(1)(5)+T1;
ZZ(1)(6)<=ZZ(1)(6)+T1;
ZZ(1)(7)<=ZZ(1)(7)+T1;


You may say why don't I put everything in a "for" LOOP? I cannot do that,
because the above example is for case COUNT=1, it's add all ZZ by T1, but in
other cases, it will not be so regular, so I cannot put everything in a
"for" loop...

My question is: if do the above, will the Synopsys DC do a good job to share
resources and now I only need a bunch of multipliers and at most 8 adders,
right?

But the code size will be 300KBytes, will Synopsys DC have a difficult time
to handle and run it forever? (Now I have two levels of CASE's, and if
include "IF", I have four levels of depth)

Please give me some advice in this design!

Thanks a lot,

-Walala

"walala" <mizhael@yahoo.com> wrote in message
news:bkpmtj$ne8$1@mozo.cc.purdue.edu...
Dear all,

I am doing a VHDL code on digital signal processing.Basically, it's IDCT.
It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...

I also maintain an internal 8x8 memory ZZ, once there is a coefficient
input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.

As you can see from the code, the size of code is huge, about 228KBytes,
but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the
components,
since there will be only one input at one clock cycle, so the 64 cases
will
not overlap with each other in time... If DC can make the resources reuse,
I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my
code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64
adders?

I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?

Thanks a lot,

-Walala

My code snippet is as follows:

---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' &
COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when
count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:

CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;

WHEN 1 =
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;

WHEN 2 =
 
Walala,
A synthesis tool is not a magic hat. Most often
you get out a reflection of what you put in.

I suggest rather that you start by drawing a picture
of what you want and then code it. As a hint at to
where you are currently having problems, draw and
exact picture of what you have coded. Note where
the data inputs are, where your numeric computation
units are, and where your multiplexers are.

To experiment with different coding techniques, I
recommend that you reduce your problem size some.
In addition, I recommend that you do some baseline
synthesis to establish how big one one of your
multiplers is and how big one of your adders is.

What is the type of Internal_Word. Becareful if it is
an integer.

Cheers,
Jim Lewis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jim Lewis
Director of Training mailto:Jim@SynthWorks.com
SynthWorks Design Inc. http://www.SynthWorks.com
1-503-590-4787

Expert VHDL Training for Hardware Design and Verification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


walala wrote:

Dear all,

I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...

I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.

As you can see from the code, the size of code is huge, about 228KBytes, but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the components,
since there will be only one input at one clock cycle, so the 64 cases will
not overlap with each other in time... If DC can make the resources reuse, I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64 adders?

I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?

Thanks a lot,

-Walala

My code snippet is as follows:

---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' & COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:

CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;

WHEN 1 =
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;

WHEN 2 =
--
 
walala wrote:

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
One thing at least. Take INPUTEND out of the sensitivity list.

Consider simulation before synthesis.


-- Mike Treseler
 
What was mentioned in my C book: Programs with up to 10000 lines can be
made running using brute force.

To create an algorithm is one art, its implementation in hardware an
other. RTFM.

Chris


walala wrote:

Dear all,

I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...

I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.

As you can see from the code, the size of code is huge, about 228KBytes, but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the components,
since there will be only one input at one clock cycle, so the 64 cases will
not overlap with each other in time... If DC can make the resources reuse, I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64 adders?

I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?

Thanks a lot,

-Walala

My code snippet is as follows:

---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' & COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:

CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;

WHEN 1 =
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;

WHEN 2 =
 
"Mike Treseler" <mike.treseler@flukenetworks.com> wrote in message
news:3F708F77.8090107@flukenetworks.com...
walala wrote:

But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my
code?

One thing at least. Take INPUTEND out of the sensitivity list.

Consider simulation before synthesis.


-- Mike Treseler
Dear Mike,

Can you explain to me why? All VHDL books say that the net name serves as an
enable should be in the sensitivy list.... and the simulation went ok... I
used Modelsim and got correct result...

--Thanks,

--Walalal
 
"Jim Lewis" <Jim@SynthWorks.com> wrote in message
news:3F7072C7.2090707@SynthWorks.com...
Walala,
A synthesis tool is not a magic hat. Most often
you get out a reflection of what you put in.
Dear Jim,

Thank you for your advice.

Here is my result. DC is in fact doing a good job, my original design comes
down to 80 adders, so 64 of them are sharing a lot.

Now I am going to further divide the computation of each 64 additions into 8
clocks, will
that be helpful to reduce the adders to 8 since now at a clock cycle there
will be only 8 additions needed and the input will now hold for 8 cycles?

For instance, I can do the following:

WHEN 0 =>
T1:=CONV_INTEGER(X) * 32;

STEP<=STEP+1;

CASE STEP IS
WHEN 0 =>
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
ZZ(0)(3)<=ZZ(0)(3)+T1;
ZZ(0)(4)<=ZZ(0)(4)+T1;
ZZ(0)(5)<=ZZ(0)(5)+T1;
ZZ(0)(6)<=ZZ(0)(6)+T1;
ZZ(0)(7)<=ZZ(0)(7)+T1;

WHEN 1 =>

ZZ(1)(0)<=ZZ(1)(0)+T1;
ZZ(1)(1)<=ZZ(1)(1)+T1;
ZZ(1)(2)<=ZZ(1)(2)+T1;
ZZ(1)(3)<=ZZ(1)(3)+T1;
ZZ(1)(4)<=ZZ(1)(4)+T1;
ZZ(1)(5)<=ZZ(1)(5)+T1;
ZZ(1)(6)<=ZZ(1)(6)+T1;
ZZ(1)(7)<=ZZ(1)(7)+T1;


You may say why don't I put everything in a "for" LOOP? I cannot do that,
because the above example is for case COUNT=1, it's add all ZZ by T1, but in
other cases, it will not be so regular, so I cannot put everything in a
"for" loop...

Please give me some advice in this design!

Thanks a lot,

-Walala
 
walala wrote:

Can you explain to me why? All VHDL books say that the net name serves as an
enable should be in the sensitivy list....
Hmmm. None of my books say that.
A synchronous input should never be in the list.

and the simulation went ok... I
used Modelsim and got correct result...
Good work.

Then problem is that your design is too complex for
your pc or for Synopsys.

Try starting with a much simplifed case.

-- Mike Treseler
 
Hi "walala"!



I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...

I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.
What ist your target? If it s a FPGA, takt the internal block-RAM. It
kan be used like normal flipflops, but is small and easy to implement.


As you can see from the code, the size of code is huge, about 228KBytes, but
It's not very huge. But the synthesis time does not depend on the amount
of code, but on the way your hardware is described.


Hint:
Break your design into smaller parts. Synthesize them seperately -
step-by-step, from bottom to up. If you are in doubt, take the extra
work time and split processes into smaller ones. Put them into
subcomponents, that can be synthesized seperately.

Think about every process / signal assignment, that you describe, what
it could be in real hardware. Don't program VHDL - describe hardware!


Ralf
 

Welcome to EDABoard.com

Sponsor

Back
Top