W
walala
Guest
Dear all,
I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...
I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.
As you can see from the code, the size of code is huge, about 228KBytes, but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the components,
since there will be only one input at one clock cycle, so the 64 cases will
not overlap with each other in time... If DC can make the resources reuse, I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...
But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64 adders?
I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?
Thanks a lot,
-Walala
My code snippet is as follows:
---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' & COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:
CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =>
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;
WHEN 1 =>
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;
WHEN 2 =>
I am doing a VHDL code on digital signal processing.Basically, it's IDCT. It
takes in some data, and output 8 pixels at a cycle, totally 8 cycles to
output a 8x8 matrix(see count)...
I also maintain an internal 8x8 memory ZZ, once there is a coefficient input
X, depending on its position, I have a case 0 to case 63 multiplexer, to
multiply the input X with a different constant matrix C(depending on
position POS), then add/accumulate to ZZ itself.
As you can see from the code, the size of code is huge, about 228KBytes, but
I make the structure very regular, for each case, it's just a bunch of
multiplications, and 64 18-bit addtions(for all the elements of the 8x8
memory ZZ)... I hope Synopsys DC can do a good job in reuse the components,
since there will be only one input at one clock cycle, so the 64 cases will
not overlap with each other in time... If DC can make the resources reuse, I
will only need at most 10 multipliers, plus 64 18-bit adders(maybe ripple
carry adder)...
But Synopsys DC seems run forever after running for one day. I begin to
suspect the correctness of my code.... Is there anything wrong with my code?
Is the size of code too large? Will Synopsys DC give resource-sharing
result? I am not so sure that DC will not too stupid to yield 64*64 adders?
I know there are a bunch of experts here... can you give me some
suggestions? How to optimize?
Thanks a lot,
-Walala
My code snippet is as follows:
---------------------------------------------------------------
ARCHITECTURE FLEX OF MYIDCT_ZERO IS
SIGNAL ZZ : INTERNAL_WORD_ARRAY_2D;
COUNT: INTEGER RANGE 0 TO 8;
BEGIN
P1: PROCESS(RST, CLK, INPUTEND)
VARIABLE T1, T2, T3, T4, T5, T6, T7, T8, T9, T10: INTERNAL_WORD;
VARIABLE TEMP: INTERNAL_WORD;
BEGIN
IF RST = '1' THEN
ZZ <= (OTHERS => (OTHERS => 0) ) ;
ELSIF CLK'EVENT AND CLK = '1' THEN
IF INPUTEND='1' THEN
--when all input finished:
TEMP:=CONV_STD_LOGIC_VECTOR(ZZ(CONV_INTEGER('0' & COUNT))(0),
17);
Y0<=TEMP(15 DOWNTO 8)+TEMP(7); ...
--output from Y0 to Y7 when count =0, output another Y0 to Y7 when count=1,
etc...count from 0 to 7...
COUNT <= COUNT + '1';
ELSE
--when input is not finished:
CASE POS IS
-depending on input position, I have a 64-cases multiplexer
WHEN 0 =>
T1:=CONV_INTEGER(X) * 32;
ZZ(0)(0)<=ZZ(0)(0)+T1;
ZZ(0)(1)<=ZZ(0)(1)+T1;
ZZ(0)(2)<=ZZ(0)(2)+T1;
-- .... all the way to ZZ(7)(7)
ZZ(7)(5)<=ZZ(7)(5)+T1;
ZZ(7)(6)<=ZZ(7)(6)+T1;
ZZ(7)(7)<=ZZ(7)(7)+T1;
WHEN 1 =>
T1:=CONV_INTEGER(X) * 44 ;
T2:=CONV_INTEGER(X) * 28;
T3:=CONV_INTEGER(X) * 15;
T4:=CONV_INTEGER(X) * 7;
ZZ(0)(0)<=ZZ(0)(0)+T4;
ZZ(0)(1)<=ZZ(0)(1)+T4;
ZZ(0)(2)<=ZZ(0)(2)+T4;
--all the way to ZZ(7)(7)
ZZ(7)(7)<=ZZ(2)(0)+T2;
WHEN 2 =>