Smart coding for big multiplexer

M

Massi

Guest
Hi everyone, I'm working on a Xilinx Virtex 5 FPGA with ISE 10.1. In
my design I have to instantiate 128 ram blocks, each one of them is
1024 bytes wide. The outuput of my device depends on only one ram
block at a time, therefore I have to multiplex them. Which is the
smartest way to implement such a huge multiplexer?
Thanks a lot for you help.
 
On Fri, 17 Apr 2009 03:29:22 -0700 (PDT), Massi <massi_srb@msn.com>
wrote:

Hi everyone, I'm working on a Xilinx Virtex 5 FPGA with ISE 10.1. In
my design I have to instantiate 128 ram blocks, each one of them is
1024 bytes wide. The outuput of my device depends on only one ram
block at a time, therefore I have to multiplex them. Which is the
smartest way to implement such a huge multiplexer?
Funny, we just answered a query from a customer about
almost exactly the same topic.

Do you really mean 1024 bytes WIDE? That's way scary -
an 8192-bit data path :) I guess you mean that each
RAM block is in fact 1024 locations, each 8 bits wide.
That's normally known as a "depth" of 1024.

We've found that XST does a better job of optimizing
wide MUXes if you code them as an explicit AND-OR
structure. I don't know why this is, and I don't
know if it will always be true; you could imagine,
for example, that a synthesis tool might be able
to exploit carry chains to build the big OR gates.
Anyway, here's a sketch of the code:

-- useful declarations
subtype byte is std_logic_vector(7 downto 0);
type byte_array is array(natural range <>) of byte;

-- one result from each of your 128 RAM blocks
signal RAM_read_data: byte_array(0 to 127);

-- final output
signal mux_data: byte;

-- memory selector, chooses one from 128
signal which_RAM: integer range RAM_read_data'range;

...
process (RAM_read_data, which_RAM)
variable mux_result: byte;
begin
mux_result := (others => '0');
for i in RAM_read_data'range loop
if i = which_RAM then
mux_result := mux_result OR RAM_read_data(which_RAM);
end if;
end loop;
mux_data <= mux_result;
end process;

If this trick doesn't provide the improvement you need,
the next step is to consider pipelining. It won't reduce
the area, but will give you better Fmax.

I'm sure other folk will have more, better ideas.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
On Fri, 17 Apr 2009 04:10:17 -0700 (PDT), Massi wrote:

I really appreciate your help, I'll immediatly try to integrate your
code in my design...thank you!
OOOOOH, don't do that just yet... sorry....

        mux_result := mux_result OR RAM_read_data(which_RAM);
No, don't do that. My mistake. Instead,

        mux_result := mux_result OR RAM_read_data(i);

The difference is that, when the loop is unrolled, you
are subscripting the array with a CONSTANT (i) rather
than with a variable. It can be important for optimization,
even though the two are functionally identical.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
Do you really mean 1024 bytes WIDE? That's way scary -
an 8192-bit data path :) I guess you mean that each
RAM block is in fact 1024 locations, each 8 bits wide.
That's normally known as a "depth" of 1024.
Silly me....of course I meant depth, that's my bad english fault.

We've found that XST does a better job of optimizing
wide MUXes if you code them as an explicit AND-OR
structure.  I don't know why this is, and I don't
know if it will always be true; you could imagine,
for example, that a synthesis tool might be able
to exploit carry chains to build the big OR gates.
Anyway, here's a sketch of the code:

  -- useful declarations
  subtype byte is std_logic_vector(7 downto 0);
  type byte_array is array(natural range <>) of byte;

  -- one result from each of your 128 RAM blocks
  signal RAM_read_data: byte_array(0 to 127);

  -- final output
  signal mux_data: byte;

  -- memory selector, chooses one from 128
  signal which_RAM: integer range RAM_read_data'range;

  ...
  process (RAM_read_data, which_RAM)
    variable mux_result: byte;
  begin
    mux_result := (others => '0');
    for i in RAM_read_data'range loop
      if i = which_RAM then
        mux_result := mux_result OR RAM_read_data(which_RAM);
      end if;
    end loop;
    mux_data <= mux_result;
  end process;

If this trick doesn't provide the improvement you need,
the next step is to consider pipelining.  It won't reduce
the area, but will give you better Fmax.

I'm sure other folk will have more, better ideas.
I really appreciate your help, I'll immediatly try to integrate your
code in my design...thank you!
 
The difference is that, when the loop is unrolled, you
are subscripting the array with a CONSTANT (i) rather
than with a variable.  It can be important for optimization,
even though the two are functionally identical.
Yes, that's VERY important. I ran into something like this a while ago
with Synplify, where the constant version properly instantiated a mux
and the variable version implemented some sort of variable shift
widget that was about an order of magnitude larger.

Chris
 
On Apr 17, 5:29 am, Massi <massi_...@msn.com> wrote:
Which is the
smartest way to implement such a huge multiplexer?
The smartest way is to let the synthesis tool do as much of the work
as possible. Don't try to outsmart it unless you have to. If the
simplest, easiest to read, understand or write description will work
(i.e. meet timing, area, etc. requirements), then use that.

Borrowing Jonathan's definitions:

-- 128-to-1, byte wide multiplexer:
mux_data <= RAM_read_data(which_RAM);

If you don't know your requirements, then you won't know whether the
implementation you used is good enough, no matter how fast/small/cool/
elegant it is.

Andy
 
Massi wrote:
Hi everyone, I'm working on a Xilinx Virtex 5 FPGA with ISE 10.1. In
my design I have to instantiate 128 ram blocks, each one of them is
1024 bytes wide. The outuput of my device depends on only one ram
block at a time, therefore I have to multiplex them. Which is the
smartest way to implement such a huge multiplexer?
Thanks a lot for you help.
I agree with Andy.
I don't solve a synthesis problem until I have one.
The cleanest mux description is an array selection.
Give ISE a crack at it and have a look at
the RTL viewer and static timing.

I also agree with Jonathan.
Declare register/port dimensions first.
VHDL gives us an unfair advantage here.

-- Mike
 
If you only need one RAM at a time could you merge the rams into a
smaller number of larger ones? This would require that you only write
to one ram at a time too.

Also, I have used tbufs in the past to do this, however it appears
that V5's don't have these.

Darrin
 
On Apr 20, 1:30 am, Dal <darrin.n...@gmail.com> wrote:
If you only need one RAM at a time could you merge the rams into a
smaller number of larger ones?  This would require that you only write
to one ram at a time too.

Also, I have used tbufs in the past to do this, however it appears
that V5's don't have these.

Darrin
Tri-state bus code is translated into equivalent multiplexer type
circuits. The tristate enables are assumed to be mutually exclusive
for the multiplexor implementation. This actually comes in handy in
some applications where it is difficult to convince the synthesis tool
that separate inputs are mutually exclusive.

Andy
 

Welcome to EDABoard.com

Sponsor

Back
Top