Coding style

Guest
Hi,

Does anyone happen to know what will the following (software) code will look
like after going through a synthesis tool? I was told, back in the school,
that never use *HDL as a software language and should have a block diagram or
data path before coding. But it seams that more and more people trust the
synthesis tool rather.

// Verilog version
reg [5:0] offset; // input
reg [15:0] we; // input
reg [15:0] data_in[127:0]; // input

reg [7:0] data_out [63:0]; // output

wire [5:0] offset0; // internal logic
wire [5:0] offset1; // internal logic
....
wire [5:0] offset15; // internal logic

assign offset0 = offset;
assign offset1 = offset + 1;
....
assign offset15 = offset + 15;

always @ (posedge clk) begin
if (reset) begin
// initial code here
...
end begin
if (we[0]) data[offset0] <= data_in[7:0];
if (we[1]) data[offset1] <= data_in[15:8];
...
if (we[15]) data[offset15] <= data_in[127:120];
end
end

-- VHDL version
type data_type is array (63 downto 0) of std_logic_vector(7 downto 0);

signal offset : std_logic_vector(5 downto 0); -- input
signal we : std_logic_vector (15 downto 0); -- input
signal data_in : std_logic_vector (127 downto 0); -- input

signal data_out : data_type; -- output

signal offset0 : std_logic_vector(5 downto 0); -- internal logic
signal offset1 : std_logic_vector(5 downto 0); -- internal logic
....
signal offset15 : std_logic_vector(5 downto 0); -- internal logic

offset0 <= offset;
offset1 <= offset + 1;
....
offset15 <= offset + 15;

process (clk, reset)
if rising_edge(clk) then
if (reset = '1') then
-- initial code here
...
else
if (we(0)='1') then data(offset0) <= data_in(7 downto 0); end if;
if (we(1)='1') then data(offset0) <= data_in(15 downto 8); end if;
...
if (we(15)='1') then data(offset0) <= data_in(127 downto 120); end if;
end if;
end if;
end process;

If I was to implement the above circuit, I will not write the above codes. I
will first draw a data path diagram which show me that I can have have a
barrel shifter to map the 16-bit 'we' to 64-bit 'data_we' based on the
'offset' input. Than I will use another shifter to "expand" the data and
assign each of the 64 slots based on the 'data_we'. I may even write a simple
Perl script to generate this 64 statement.

Does anyone know any better implementation scheme or any synthesis tools that
will generate a better result base on the above code? I tried Xilinx XST and
the above code failed (out of memory ?!) I know some other synthesis tool can
generated a valid circuit with terrible performance.

The most important question is, "Should I worry about this?" Someone told me
that the code failed to compile just because XST is not good enough. But I
wonder. My teammates and me have a different point of view in CAD tools. They
believe the logic optimization (e.g. from behavioral description to netlist)
and don't trust the physical optimization (e.g. they do hand placement and
timing on every net). I believe the opposite.

There are too many *information* available when running the placement and
routing process which is NOT suitable for a human brain. But the tools can
handle this well. There some *knowledge* in the design which is hard to be
captured in the tools. And we engineerings are trained to use this knowledge
to optimize our design. So I believe that we should code more careful rather
than pumping constraints to the tools.

But the current trend is (at least observed by myself), to encourages users
code more like in a software environment. Is this the design flow for tomorrow
or I misunderstand something?

--
Regards,
Tsoi Kuen Hung (Brittle)
CSE CUHK
 
The code you've shown is rather ugly.
1) The data_in is probably improperly specified as 128 8-bit values
rather than one 128-bit value as implied by your code.
2) The data_out declaration may ne the data[] array you're using in your
code.
3) The implementation in hardware will be 512 registers, not a 64x8
memory element because the we input can have multiple valid signals in
any one cycle. Will there only be one we bit active at a time?

Do you actually want a memory for your data array or are 512 registers
the way to go?


XST will probably synthesize code successfully if it's synthesizable but
not necessarily the most optimal results. The tools improved the
Quality of Results over the years.
 
Thanks for the reply.

John_H <johnhandwork@mail.com> wrote:
The code you've shown is rather ugly.
1) The data_in is probably improperly specified as 128 8-bit values
rather than one 128-bit value as implied by your code.
no, the 'data_in' is a 128-bit vector, as required by external world
(translate, most other logic)
array you're using in your
code.
we use array for the external world reason again
3) The implementation in hardware will be 512 registers, not a 64x8
memory element because the we input can have multiple valid signals in
any one cycle. Will there only be one we bit active at a time?
no, it should be a 64x8 memory, not 512 register, consider the following
C code:

char data_out[64];
char data_in[16]; // we cannot use array in HDL code for this
for (i=0; i<16; i++)
if (we)
data_out[offset] = data_in;

I thinks this C code does the same thing as in the HDL code. So
'data_out' is a 64x8 memory, but not accessed per bit.
Do you actually want a memory for your data array or are 512 registers
the way to go?
Actually, we don't really care about that as long as the circuit is
correct. The final result will go to ASIC.
XST will probably synthesize code successfully if it's synthesizable but
not necessarily the most optimal results. The tools improved the
Quality of Results over the years.
No. we tried the Verilog code in XST (7.1i) but it exit for running out of
memory after a long period. Even though the original code is not allowed for
publish, the code I post here is almost the same thing.

--
Regards,
Tsoi Kuen Hung (Brittle)
CSE CUHK
 
1) Change to
reg [127:0] data_in;
reg [7:0] data [63:0];
so your code matches your declarations.
2) Your C code sample seems to indicate the offset is fixed, not "offset+i"
for we. Which is it?
3) Since you expect a 64-entry memory, you can have only one we active per
clock cycle. Your C code uses 16 cycles for 16 write enables and your
hardware needs to do the same if you want a memory.

To do one write per cycle, the interface is ideally a "we" and a "wrChan"
where the wrChan gives the offset (or helps define the offset) and the write
enable is applied to the memory array.

Whether explicit or implicit, the input data will be multiplexed. Your
original code makes it difficult for a synthesizer to understand what to
implement so an explicit mux or barrel shifter may be appropriate. If you
can get the memory write down to one line, synthesis should occur nicely.

One more comment: you can use "for" statements in Verilog to avoid
generating 64 lines with perl scripts. Check out the use of the for
statement with your favorite Verilog reference guide.

- John_H

<khtsoi@pc89122.cse.cuhk.edu.hk> wrote in message
news:1136302061.348840@jupiter.cse.cuhk.edu.hk...
Thanks for the reply.

John_H <johnhandwork@mail.com> wrote:
The code you've shown is rather ugly.
1) The data_in is probably improperly specified as 128 8-bit values
rather than one 128-bit value as implied by your code.
no, the 'data_in' is a 128-bit vector, as required by external world
(translate, most other logic)
2) The data_out declaration may ne the data[] array you're using in your
code.
we use array for the external world reason again
3) The implementation in hardware will be 512 registers, not a 64x8
memory element because the we input can have multiple valid signals in
any one cycle. Will there only be one we bit active at a time?
no, it should be a 64x8 memory, not 512 register, consider the following
C code:

char data_out[64];
char data_in[16]; // we cannot use array in HDL code for this
for (i=0; i<16; i++)
if (we)
data_out[offset] = data_in;

I thinks this C code does the same thing as in the HDL code. So
'data_out' is a 64x8 memory, but not accessed per bit.

Do you actually want a memory for your data array or are 512 registers
the way to go?
Actually, we don't really care about that as long as the circuit is
correct. The final result will go to ASIC.


XST will probably synthesize code successfully if it's synthesizable but
not necessarily the most optimal results. The tools improved the
Quality of Results over the years.
No. we tried the Verilog code in XST (7.1i) but it exit for running out of
memory after a long period. Even though the original code is not allowed
for
publish, the code I post here is almost the same thing.

--
Regards,
Tsoi Kuen Hung (Brittle)
CSE CUHK
 
John_H <johnhandwork@mail.com> wrote:
1) Change to
reg [127:0] data_in;
reg [7:0] data [63:0];
so your code matches your declarations.
yes, sorry that I have too much typo here. The VHDL code do this correct
by the Verilog code does not. Thanks for pointing it out.
2) Your C code sample seems to indicate the offset is fixed, not "offset+i"
for we. Which is it?
sorry, the C code has typo too, it should be "data_out[offset+i]"
3) Since you expect a 64-entry memory, you can have only one we active per
clock cycle. Your C code uses 16 cycles for 16 write enables and your
hardware needs to do the same if you want a memory.

To do one write per cycle, the interface is ideally a "we" and a "wrChan"
where the wrChan gives the offset (or helps define the offset) and the write
enable is applied to the memory array.
You are right. It should not be a memory as you stated. And this (not a

memory) is what we want since we have no control over the 'offset' and
'we'.

Whether explicit or implicit, the input data will be multiplexed. Your
original code makes it difficult for a synthesizer to understand what to
implement so an explicit mux or barrel shifter may be appropriate. If you
can get the memory write down to one line, synthesis should occur nicely.
This is the main point why I bring up the issue in my original post. I
don't think the logic optimization in synthesis tool is smart enough to
handle this efficiently.

One more comment: you can use "for" statements in Verilog to avoid
generating 64 lines with perl scripts. Check out the use of the for
statement with your favorite Verilog reference guide.
I did tried the 'for' statement. But it always reporting error as
something wrong in the MSB/LSB. I used the syntax
data_out(i) <= data_in(i*8+7:i*8);
And the same error pops out in VHDL code also.
- John_H

khtsoi@pc89122.cse.cuhk.edu.hk> wrote in message
news:1136302061.348840@jupiter.cse.cuhk.edu.hk...
Thanks for the reply.

John_H <johnhandwork@mail.com> wrote:
The code you've shown is rather ugly.
1) The data_in is probably improperly specified as 128 8-bit values
rather than one 128-bit value as implied by your code.
no, the 'data_in' is a 128-bit vector, as required by external world
(translate, most other logic)
2) The data_out declaration may ne the data[] array you're using in your
code.
we use array for the external world reason again
3) The implementation in hardware will be 512 registers, not a 64x8
memory element because the we input can have multiple valid signals in
any one cycle. Will there only be one we bit active at a time?
no, it should be a 64x8 memory, not 512 register, consider the following
C code:

char data_out[64];
char data_in[16]; // we cannot use array in HDL code for this
for (i=0; i<16; i++)
if (we)
data_out[offset] = data_in;

I thinks this C code does the same thing as in the HDL code. So
'data_out' is a 64x8 memory, but not accessed per bit.

Do you actually want a memory for your data array or are 512 registers
the way to go?
Actually, we don't really care about that as long as the circuit is
correct. The final result will go to ASIC.


XST will probably synthesize code successfully if it's synthesizable but
not necessarily the most optimal results. The tools improved the
Quality of Results over the years.
No. we tried the Verilog code in XST (7.1i) but it exit for running out of
memory after a long period. Even though the original code is not allowed
for
publish, the code I post here is almost the same thing.

--
Regards,
Tsoi Kuen Hung (Brittle)
CSE CUHK



--
Regards,
Tsoi Kuen Hung (Brittle)
CSE CUHK
 
So if all 16 we signals can be valid at once, the barrel shifter is the way
I'd suggest going. Generating 64 data_we signals from the original 16 we
signals plus the offset is a clean way to implement the enables as well. If
your target is an ASIC it probably doesn't matter but keep in mind the
approximate level of resources to implement this functionality: 512
registers for the storage, 512 LUTs for the barrrel shifter, and perhaps 160
LUTs for the we to data_we translation.

If the we and offset will be stable for at least 16 clocks and you can wait
for the data to be updated, you can make the process much simpler by cycling
through the we bits (and associated data_in bytes) one at a time.


The error that you had with the for statement can be fixed with a
Verilog2001 construct, specifically

for( i=0; i<16; i=i+1 )
data_out <= data_in[i*8 +: 8];

where the +: operator needs the "size" as the second arguement.

Happy coding!


<khtsoi@pc89122.cse.cuhk.edu.hk> wrote in message
news:1136309010.216093@jupiter.cse.cuhk.edu.hk...
John_H <johnhandwork@mail.com> wrote:
1) Change to
reg [127:0] data_in;
reg [7:0] data [63:0];
so your code matches your declarations.
yes, sorry that I have too much typo here. The VHDL code do this correct
by the Verilog code does not. Thanks for pointing it out.
2) Your C code sample seems to indicate the offset is fixed, not
"offset+i"
for we. Which is it?
sorry, the C code has typo too, it should be "data_out[offset+i]"
3) Since you expect a 64-entry memory, you can have only one we active
per
clock cycle. Your C code uses 16 cycles for 16 write enables and your
hardware needs to do the same if you want a memory.

To do one write per cycle, the interface is ideally a "we" and a "wrChan"
where the wrChan gives the offset (or helps define the offset) and the
write
enable is applied to the memory array.
You are right. It should not be a memory as you stated. And this (not a
memory) is what we want since we have no control over the 'offset' and
'we'.


Whether explicit or implicit, the input data will be multiplexed. Your
original code makes it difficult for a synthesizer to understand what to
implement so an explicit mux or barrel shifter may be appropriate. If
you
can get the memory write down to one line, synthesis should occur nicely.
This is the main point why I bring up the issue in my original post. I
don't think the logic optimization in synthesis tool is smart enough to
handle this efficiently.


One more comment: you can use "for" statements in Verilog to avoid
generating 64 lines with perl scripts. Check out the use of the for
statement with your favorite Verilog reference guide.
I did tried the 'for' statement. But it always reporting error as
something wrong in the MSB/LSB. I used the syntax
data_out(i) <= data_in(i*8+7:i*8);
And the same error pops out in VHDL code also.
 

Welcome to EDABoard.com

Sponsor

Back
Top