Quad-Port BlockRAM in Virtex

K

Kevin Neilson

Guest
I think I need a quad-port blockRAM in a Xilinx V7. Having multiple read ports is no problem, but I need two read ports and two write ports. The two write ports is the problem. I can't double the clock speed. To be clear, I need to be able to do two reads and two writes per cycle. (Not writes to the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a semaphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes the same value to address x in AC and AD and simultaneously sets the semaphore of address x to point to 'A'. Now when reader C wants to read address x, it reads AC and BC and the semaphore, sees that semaphore points toward the A side, and uses the value from AC and discards BC. If writer B writes to address x, it writes the value to both BC and BD and sets the semaphore x to point to side B. Reader D reads AD and BD and picks one based on the semaphore bit.

The semaphore itself is complicated. I think it would consists of 2 quad-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3 read ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the side A bit and set its own to the opposite. Now when reader C or D read their two copies (A/B) of the semaphore bits using their read ports, they check if they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port. Maybe I need a different solution.
 
Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)
 
On Friday, October 23, 2015 at 2:10:20 PM UTC-6, Kevin Neilson wrote:
> Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Update 2: I came up with a better solution than the Altera Cookbook. The semaphore bits are stored partly in a separate blockRAM and partly in the main data blockRAMs. Then there is very little logic out in the fabric--just the muxes for the two read ports. Too bad there isn't an app note on this.
 
On 01.11.2016 23:16, Kevin Neilson wrote:
On Friday, October 23, 2015 at 2:10:20 PM UTC-6, Kevin Neilson wrote:
Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Update 2: I came up with a better solution than the Altera Cookbook. The semaphore bits are stored partly in a separate blockRAM and partly in the main data blockRAMs. Then there is very little logic out in the fabric--just the muxes for the two read ports. Too bad there isn't an app note on this.

Again, why do you need four BRAMs? Perhaps I'm stupid, but I don't see
what can be achieved with four BRAMs that cannot be achieved with two,
if it's correct that "[h]aving multiple read ports is no problem". Or is
it just how you solve the problem of having multiple read ports?

Like, you have two BRAMs A and B, and a semaphore array. The writer A
writes to A and points the semaphore of address x to A. The writer B
does the same for B. You read simultaneously A and B and the semaphore
for address x.

Gene
 
Again, why do you need four BRAMs?

Gene

I need 4 ports (2 wr, 2 rd). Your 2-BRAM solution allows for 2 wr ports, but only 1 rd port. In your solution you read A and B and the semaphore, then mux either A or B to your read data output based on the semaphore. But I need a second read port, so I have to have a second copy of the system you describe.

I drew up a nice diagram with a good solution for doing the semaphores, but I don't know how to post it here.
 
On 04.11.2016 3:35, Kevin Neilson wrote:
Again, why do you need four BRAMs?

Gene

I need 4 ports (2 wr, 2 rd). Your 2-BRAM solution allows for 2 wr ports, but only 1 rd port. In your solution you read A and B and the semaphore, then mux either A or B to your read data output based on the semaphore. But I need a second read port, so I have to have a second copy of the system you describe.

I drew up a nice diagram with a good solution for doing the semaphores, but I don't know how to post it here.

Thanks for explaining the rationale for using 4 BRAMs.

Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene
 
On Friday, October 23, 2015 at 2:40:46 PM UTC-5, Kevin Neilson wrote:
I think I need a quad-port blockRAM in a Xilinx V7. Having multiple read ports is no problem, but I need two read ports and two write ports. The two write ports is the problem. I can't double the clock speed. To be clear, I need to be able to do two reads and two writes per cycle. (Not writes to the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a semaphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes the same value to address x in AC and AD and simultaneously sets the semaphore of address x to point to 'A'. Now when reader C wants to read address x, it reads AC and BC and the semaphore, sees that semaphore points toward the A side, and uses the value from AC and discards BC. If writer B writes to address x, it writes the value to both BC and BD and sets the semaphore x to point to side B. Reader D reads AD and BD and picks one based on the semaphore bit.

The semaphore itself is complicated. I think it would consists of 2 quad-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3 read ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the side A bit and set its own to the opposite. Now when reader C or D read their two copies (A/B) of the semaphore bits using their read ports, they check if they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port. Maybe I need a different solution.

There is a literature on this subject:
http://fpgacpu.ca/multiport/TRETS2014-LaForest-Article.pdf
 
Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene

Thanks. Here's my sketch:

http://imgur.com/a/NhNr0

The timestamp is a nice idea, but, like you said, it would overflow quickly. And you'd have a long carry chain to do the timestamp comparison.
 
There is a literature on this subject:
http://fpgacpu.ca/multiport/TRETS2014-LaForest-Article.pdf

Yes, I did actually find this yesterday when searching again. The design I ended up using (http://imgur.com/a/NhNr0 ) looks like what they have in Fig. 3(a), except I implemented the "live value table" in BRAMs so it's much faster. They have a faster solution in Fig. 4(c), which uses their "XOR-based" design. However, it requires a lot more RAM because you need 6 full data storage units. I used only 4, and then two much smaller RAMs for semaphores (aka Live Value Table), and I also store semaphore copies in the 4 data RAMs.
 
I find this thread very interesting, it discusses quite some approaches I would not have thought of in first place...

Maybe a different view-point: As most modern FPGAs support true dual port RAM, with double clock rate you could write to two ports in the first cycle and read from both ports in the second cycle. This would only require 1 BRAM compared to 4 BRAMs (assuming your content fits into 1 BRAM, of course...).

However, you wrote that you cannot double the clock rate (out of curiosity: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the writes going to both BRAMs (taking two of the 3 cycles), but the reads for these two transactions (4 in total) are done in the 3rd cycle from both BRAMs. Of course this makes only sense if you can find a simple clock-domain-crossing-solution on system level...

Regards,

Thomas
www.entner-electronics.com - Home of EEBlaster and JPEG-Codec
 
On 05.11.2016 3:00, Kevin Neilson wrote:
Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene

Thanks. Here's my sketch:

http://imgur.com/a/NhNr0

The timestamp is a nice idea, but, like you said, it would overflow quickly. And you'd have a long carry chain to do the timestamp comparison.

Great design! In terms of the referenced article, it combines the good
features of both the LVT/semaphore approach (requires little memory to
store semaphores), and the XOR-based approach (no need for multiport
memory to store semaphores).

I would only suggest, that like discussed at pp. 6-7 of LaForest
article, it's possible to give user the impression there's no writing
delay by adding some forwarding circuitry.

Gene
 
I would only suggest, that like discussed at pp. 6-7 of LaForest
article, it's possible to give user the impression there's no writing
delay by adding some forwarding circuitry.

Gene

I realized that since I'm doing read-modify-writes, I don't even need the extra semaphore RAMs. Since I'm reading each address two cycles before writing, I can get the semaphores from the data RAMs. When I'm doing a write only, I can precede it by a dummy read to get the semaphores.

The Xilinx BRAMs operate at the same speed for write-first and read-first modes, so I probably wouldn't need the forwarding logic. (The setup time is a lot bigger for write-first mode, though.) However, I do need a short "local cache" for when I try to read-modify-write the same location on successive cycles. Because of the read latency, the second read would be of stale data so I have to read from the local cache instead.
 
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy
 
My Ph.D. Ameer added forwarding paths to his version, available on GitHub. See papers at FPGA2014 and FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

https://github.com/AmeerAbdelhadi/Multiported-RAM
 
Maybe a different view-point: As most modern FPGAs support true dual port RAM, with double clock
However, you wrote that you cannot double the clock rate (out of curiosity: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the writes going to both BRAMs (taking two of the 3 cycles), but the reads for these two transactions (4 in total) are done in the 3rd cycle from both BRAMs. Of course this makes only sense if you can find a simple clock-domain-crossing-solution on system level...

Regards,

Thomas
www.entner-electronics.com - Home of EEBlaster and JPEG-Codec

That's a great idea. It took me a few minutes to work through this but that seems like it would work. The clock I'm using now is 360MHz so a 1.5x clock would be 540MHz. That's pushing the edge, but Xilinx says the BRAM will run at 543MHz in a -2 part. The clock-domain crossing shouldn't be a problem. The clocks are "periodic-synchronous" so you have a known setup time.. (Assuming you use DLLs to keep them phase-locked.)

Xilinx does have an old app note ( https://www.xilinx.com/support/documentation/application_notes/xapp228.pdf ) on using a 2x clock to make a quad-port. In my case the 2x clock would be 720MHz
 
I realized that since I'm doing read-modify-writes, I don't even need the extra semaphore RAMs. Since I'm reading each address two cycles before writing, I can get the semaphores from the data RAMs. When I'm doing a write only, I can precede it by a dummy read to get the semaphores.

I added a diagram of the simplified R-M-W quad-port to that link. http://imgur.com/a/NhNr0
 
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.
 
On Tuesday, November 8, 2016 at 12:13:55 PM UTC-5, Kevin Neilson wrote:
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.

Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:

1) Additional 2 cycles in the decision path!
The longest path of our I-LVT method passes through the LVT as follows:
1- Reading the I-LVT feedbacks
2- Rewriting the I-LVT
3- Reading the I-LVT to generate (through output extraction function) output mux selectors.
With these three cycles, our I-LVT required a very complicated bypassing circuitry to deal with even simple hazards as Write-After-Write.
Your solution adds two cycles in the selection path, one to rewrite the data banks with the I-LVT bits, and the second to read these bits (then extract the selectors). This solution requires caching to bypass this very long decision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the output mux selectors in your method are read from the data banks instead of the LVT. Once a write happens, the output selectors will see the change after 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write (selectors) -> data bank read (selectors)), whereas ours requires only 3 cycles.

2) Modularity:
The additional bits can't accommodate bank selectors for every number of write ports. For instance, you mentioned extra 3 bits in each BRAM line. These 3 bits can code selectors for up to 8 write ports. For more than 8 write ports, the meta-data should be stored in additional BRAMs, which will further increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance, in your diagram, you are using 140Kbits for the data banks and only 2Kbits for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1.5%), however, it eliminates the need for caching (as required by your solution).

Ameer
http://www.ece.ubc.ca/~ameer/
 
On Tuesday, December 13, 2016 at 4:37:42 PM UTC-5, Ameer Abdelhadi wrote:
On Tuesday, November 8, 2016 at 12:13:55 PM UTC-5, Kevin Neilson wrote:
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point..

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.


Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:

1) Additional 2 cycles in the decision path!
The longest path of our I-LVT method passes through the LVT as follows:
1- Reading the I-LVT feedbacks
2- Rewriting the I-LVT
3- Reading the I-LVT to generate (through output extraction function) output mux selectors.
With these three cycles, our I-LVT required a very complicated bypassing circuitry to deal with even simple hazards as Write-After-Write.
Your solution adds two cycles in the selection path, one to rewrite the data banks with the I-LVT bits, and the second to read these bits (then extract the selectors). This solution requires caching to bypass this very long decision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the output mux selectors in your method are read from the data banks instead of the LVT. Once a write happens, the output selectors will see the change after 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write (selectors) -> data bank read (selectors)), whereas ours requires only 3 cycles.

2) Modularity:
The additional bits can't accommodate bank selectors for every number of write ports. For instance, you mentioned extra 3 bits in each BRAM line. These 3 bits can code selectors for up to 8 write ports. For more than 8 write ports, the meta-data should be stored in additional BRAMs, which will further increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance, in your diagram, you are using 140Kbits for the data banks and only 2Kbits for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1..5%), however, it eliminates the need for caching (as required by your solution).

Ameer
http://www.ece.ubc.ca/~ameer/

BTW, our design is available online as an open source library. It's modular, parametrized, optimized for high performance and optimal resources consumption, fully bypassed, and fully tested with a run-in-batch manager for simulation and synthesis.

Just download the Verilog, add it to your project, instantiate the IP module, change to your parameters (e.g. #reads, #writes, data width, RAM depth, bypassing...), and you're ready to go!

Open source libraries:
http://www.ece.ubc.ca/~ameer/opensource.html
https://github.com/AmeerAbdelhadi/

BRAM-based Multi-ported RAM from FPGA'14:
https://github.com/AmeerAbdelhadi/Multiported-RAM
Paper: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Conference-2014Feb-FPGA2014-MultiportedRAM.pdf
Slides: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Talk-2014Feb-FPGA2014-MultiportedRAM.pdf

Enjoy!
 
Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:

Ameer,
Thanks for the response. Yes, there may be some latency disadvantages in my approach. For the cache that I need for the bypass logic, I use a Xilinx dynamic SRL. It's the same size and speed whether or not the cache depth is 2 or 32, so making the cache deeper doesn't make much difference. (There is more address-comparison logic, though.)

As for the memory usage, it just depends on what BRAM width you need. If you need a 512-deep by 64-bit wide BRAM, you have to use a Xilinx simple-dual port BRAM with a width of 72, so then you have 8 bits of each location "wasted" which you can use for ILVT flags. But if you need a 72-bit-wide BRAM for data, then there is no advantage in trying to combine the data and the flags. In my case I just happened to need 69 and had 3 bits left over.

I finished the design that uses the quad-port and I can say it's working well and it simplified my algorithm significantly. My clock speed is 360 MHz which was too fast to use a 2x clock to time-slice the BRAMs, but the I-LVT design works just fine.
Kevin
 

Welcome to EDABoard.com

Sponsor

Back
Top