Implementing a fast cache in Altera Cyclone

V

Vazquez

Guest
Hello,

I want to implement a cache which has over 30 words á 2 bytes.
The description of that module should be in VDHL.

The problem I am facing right now is the following:

It should be feasible to write an address into it and to read an
address out of it.
The most important function should be that a new incoming address
should be
checked if it is already in the cache. This check should occur within
some few clocks. The problem is that the cache has over 30 entries so
that
using a FIFO or a RAM-structure would take over 30 clock cycles to
check all the
words.

Has anyone an idea how this problem could be solved basically in order
to achieve a faster check time?

Thank you very much for your help.

Kind regards
Andrés Vázquez
G&D System Development - FPGA design
 
"Vazquez" <andres.vazquez@gmx.de> wrote in message
news:eee19a7a.0310080131.5a53d3f2@posting.google.com...
Hello,

I want to implement a cache which has over 30 words á 2 bytes.
The description of that module should be in VDHL.

The problem I am facing right now is the following:

It should be feasible to write an address into it and to read an
address out of it.
The most important function should be that a new incoming address
should be
checked if it is already in the cache. This check should occur within
some few clocks. The problem is that the cache has over 30 entries so
that
using a FIFO or a RAM-structure would take over 30 clock cycles to
check all the
words.

Has anyone an idea how this problem could be solved basically in order
to achieve a faster check time?

Thank you very much for your help.

Kind regards
Andrés Vázquez
G&D System Development - FPGA design
http://www.altera.com/support/software/eda_quartus2/glossary/def_cam.html

Gr, Hans
 
"Vazquez" <andres.vazquez@gmx.de> wrote in message
news:eee19a7a.0310080131.5a53d3f2@posting.google.com...

I want to implement a cache which has over 30 words á 2 bytes.
It should be feasible to write an address into it and to read an
address out of it.
The most important function should be that a new incoming address
should be
checked if it is already in the cache. This check should occur within
some few clocks. The problem is that the cache has over 30 entries so
that using a FIFO or a RAM-structure would take over 30 clock
cycles to check all the words.

Has anyone an idea how this problem could be solved basically in order
to achieve a faster check time?
The usual arrangement in caches is to compute some function of the
address, and use that as the index into your cache. That way, it's
very easy to locate whether an address is held in the cache.

Typically the "function" is just the bottom few bits of the address,
because this gives a reasonable distribution of cached addresses
into cached locations for many practical problems.

I don't know how big your addresses are, but let's guess they are
16 bits wide. Let's also suppose that you use the bottom 5 bits
of address to index into your 32-slot cache. Now let's suppose
you have accessed some locations...

0x1234 -- uses cache slot 0x14
0x1235 -- uses cache slot 0x15
0x9A23 -- uses cache slot 0x03

You can see that this scheme will work well for short sequential
burst accesses. However, let's now suppose that you make a
fourth access...

0x7255 -- uses cache slot 0x15

This new access will of course evict the 0x1235 entry from slot 0x15.

To make all this work, each cache slot needs to store...
* those bits of the address that were not used to index it:
in my example, the top 11 bits
* the cached data
* a "valid" bit, to say that the stored data is correct
* (for write-back caches) a "dirty" bit, saying that the
cache slot has been written with a new data value, but
the cached memory location hasn't yet been written with
this new value

To improve performance, caches can have N-way associativity - in
other words, you have more than one cache slot associated with
each cache index. This means you have to compare addresses, but
typically only 2 or 4 rather than the full 32. You can generally
do this with parallel, single-cycle logic.

If indexing on the bottom few bits of address doesn't
suit your requirement, you can use a more sophisticated
address mapping - perhaps some kind of hash function. It
doesn't matter, as long as the function is repeatable and
distributes addresses onto cache slots in a suitable way.
You would then need to store the whole address in the cache.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, Hampshire, BH24 1AW, UK
Tel: +44 (0)1425 471223 mail: jonathan.bromley@doulos.com
Fax: +44 (0)1425 471573 Web: http://www.doulos.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
http://www.altera.com/support/software/eda_quartus2/glossary/def_cam.html
This feature is only available in APEX, Excalibur, and Mercury.

- Paul Leventis
Altera Corp.
 
Andres,

I've written two responses, one assuming you want a CAM (that is, something
that you give an entry of 16-bits to and it tells you whether that entry is
in one of the 32 slots of the CAM), the other assuming you can get away with
a cache.

*** If you really do need a CAM ***

CAMs are expensive to implement. You need to look at ALL the data in your
cache at once in order to find a match (in a single cycle) and hence using a
memory is not a great fit for this task. You're stuck implementing your CAM
in logic if you need single-cycle access. With 32 entries of 16 bits each,
this means 32x16 = 480 LEs just for the registers. But it's not as bad as
it seems -- the LUTs in these LEs will still be used (automatically) by
Quartus for related logic if such logic is available, or unrelated logic if
your design is getting full.

I'd also worry about the explosion in the amount of logic required for
comparison if you are doing single-cycle matching. One technique you can
use to half or further reduce time & comparison hardware and still use a
RAM: use a wider RAM. You can get two entries/cycle just by using a 16x32
RAM (32-bits wide) which maps into a single M4K block in Cyclone. Or go up
to an 8x64 or 4x128 RAM. This way you are retrieving multiple
entries/cycle.

Note: Just use the lpm_ram megafunction. Quartus will automatically make
use of the right combination of M4K blocks to implement the RAM you need
"behind your back".

If you expect a lot of misses and would like misses to come back quickly,
you could store your data differently. For example, use a 4x128 RAM and
instead of storing 8 entries per address, store the first 4-bits of all 32
entries in the 1st address, the next 4-bits in the 2nd adress, etc. In each
clock cycle you are comparing 4-bits of every entry to 4-bits of the target
word. But this way you get an early exit on a miss.

I'm no CAM expert and I'm sure there'll be a bunch of other ideas that come
up...

*** If you are looking for a cache ***

Jonathan gives a nice description of direct-mapped and N-way set associative
caching. I'll just add a few points and ideas.

As a rule of thumb, increasing cache size yields better results (in a
microprocessor) than increasing associativity. In selecting the cache
design for your system, you need to take into account the memory access
patterns, the speed you need to run your system clock at, tolerable cache
latency, etc. Switching from a direct-mapped (i.e. 1-way associative) to 2-
or 4-way set associative cache will mean that you require more logic for
performing reads & writes, and depending on the speed your clock is running
at, this may limit the speed of your system. Often you must run software
emulations of a system to determine what the best cache is -- I'd suggest
trying 1-way, 2-way and 4-way set associative caches of various sizes and
comparing the increase in efficiency (i.e. # of clocks required per
instruction on average) of your system across simulated, real-world data
that it will handle. Then compare this to the relative clock speed that
those different systems will run at (which will require implementing both a
direct-mapped and N-way associative cache in HDL).

Also, note that making a cache deeper is nearly free in Cyclone if you
aren't using your memory blocks for other features. You get a 128x36 RAM
for the same price as a 32x36 RAM -- so you might as well make a 128-entry
direct mapped cache. Or you can stitch multiple RAMs together
(automatically through lpm_ram) to create a deeper cache.

Regards,

Paul Leventis
Altera Corp.
 
Jonathan Bromley wrote:

To improve performance, caches can have N-way associativity - in
other words, you have more than one cache slot associated with
each cache index. This means you have to compare addresses, but
typically only 2 or 4 rather than the full 32. You can generally
do this with parallel, single-cycle logic.

If indexing on the bottom few bits of address doesn't
suit your requirement, you can use a more sophisticated
address mapping - perhaps some kind of hash function. It
doesn't matter, as long as the function is repeatable and
distributes addresses onto cache slots in a suitable way.
You would then need to store the whole address in the cache.
An auxiliary victim cache could work well in distributed RAM.
 

Welcome to EDABoard.com

Sponsor

Back
Top