Open Source GPGPU core

Guest
I've been designing an open source Larrabee-esque GPGPU processor in SystemVerilog and I thought people might find it interesting. Full source code, documentation, tests, tools, etc. are available on github:

https://github.com/jbush001/NyuziProcessor

The processor supports a wide, predicated vector floating point pipeline with 16 lanes and multiple hardware threads to hide memory and computation latency. It also supports multiple cache coherent cores. I've created an LLVM backend for this, so C/C++ code can be compiled for it. It includes support for first class vector types using the GCC vector extensions, as well as a intrinsics to expose specialized instructions.

I've written a 3D engine (software/librender) that is optimized to take advantage both of the vector unit and multiple cores/hardware threads. Here's a video of the standard teapot (with ~2300 triangles) rendering on a single core on FPGA running at 50 Mhz:

http://youtu.be/DsvZorBu4Uk

This image is the emulator rendering Dabrovik's Sponza atrium, ~66k triangles. This took around 200 million instructions to render between 8 virtual cores and 32 hardware threads (at 1024x768):

http://i.imgur.com/sHAsAU5.png

My main purpose of designing this was to be able to experiment with processor architecture with real, empirical data. The neat thing about having all the source to a cycle accurate hardware design is that it is infinitely instrumentable. I've kept notes about some of my findings here:

http://latchup.blogspot.com/

Anyway, comments and suggestions are appreciated, and I'm happy to take contributions if people are interested in hacking on it.
 
I tried building your toolchain on both a 32 and 64 bit amd Ubuntu 14.1
system and get:

Linking CXX shared library ../../../lib/liblldb.so
Python script sym-linking LLDB Python API
Program error: Invalid parameters entered, -h for help.
You entered:
['--buildConfig='
'--srcRoot=/home/johne/Desktop/Nyuzi/NyuziToolchain/tools/lldb'
'--targetDir=/home/johne/Desktop/Nyuzi/NyuziToolchain/build/tools/lldb/source/../scripts'
'--cfgBldDir=/home/johne/Desktop/Nyuzi/NyuziToolchain/build/tools/lldb/source/../scripts'
'--prefix=/home/johne/Desktop/Nyuzi/NyuziToolchain/build'
'--cmakeBuildConfiguration=.', '-m'] (-1)
tools/lldb/source/CMakeFiles/liblldb.dir/build.make:282: recipe for targe
'lib/liblldb.so.3.7.0' failed
make[2]: *** [lib/liblldb.so.3.7.0] Error 255
CMakeFiles/Makefile2:12189: recipe for targe
'tools/lldb/source/CMakeFiles/liblldb.dir/all' failed
make[1]: *** [tools/lldb/source/CMakeFiles/liblldb.dir/all] Error 2
Makefile:133: recipe for target 'all' failed
make: *** [all] Error 2
johne@ouabache:~/Desktop/Nyuzi/NyuziToolchain/build$


John Eaton


---------------------------------------
Posted through http://www.FPGARelated.com
 
On Thursday, February 12, 2015 at 7:04:38 PM UTC-8, jt_eaton wrote:
I tried building your toolchain on both a 32 and 64 bit amd Ubuntu 14.10
system and get:

Linking CXX shared library ../../../lib/liblldb.so
Python script sym-linking LLDB Python API
Program error: Invalid parameters entered, -h for help.
You entered:

It looks like LLDB was not building correctly when the build type wasn't set (I normally build with Debug). I pushed a change to the cmake files that should address this. Let me know if that fixes it.

Thanks

--Jeff
 
On Friday, February 13, 2015 at 6:37:53 PM UTC-8, jt_eaton wrote:
> That fixed it. Ran all the tests and got the picture in the frame buffer.

Great!

> Do any of the tests run verilator to create a vcd dump file?

Yep. All of the cosimulation tests run in Verilator. The compiler tests can be made to run in verilator by defining USE_VERILATOR=1 in the shell environment. The render tests have a target 'verirun' that will run them in Verilator (there are READMEs in those directories with more details)

VCD dumps aren't produced by default, but can be enabled by modifying the makefile in the rtl/ directory, uncommenting the line:

VERILATOR_OPTIONS=--trace --trace-structs

And rebuilding. A file 'trace.vcd' will be written in the same directory. The output files get big fast for non-trivial tests. :)
 
That fixed it. Ran all the tests and got the picture in the frame buffer


Do any of the tests run verilator to create a vcd dump file?


John Eaton


---------------------------------------
Posted through http://www.FPGARelated.com

Ok I found it. Are all of your tests all using the same vcd dump file?

John Eaton


---------------------------------------
Posted through http://www.FPGARelated.com
 
It looks like LLDB was not building correctly when the build type wasn'
set (I normally build with Debug). I pushed a change to the cmake file
that should address this. Let me know if that fixes it.
Thanks

--Jeff

That fixed it. Ran all the tests and got the picture in the frame buffer.

Do any of the tests run verilator to create a vcd dump file?


John Eaton


---------------------------------------
Posted through http://www.FPGARelated.com
 
On Wed, 11 Feb 2015 09:09:18 -0800, jeffbush001 wrote:

I've been designing an open source Larrabee-esque GPGPU processor in
SystemVerilog and I thought people might find it interesting. Full
source code, documentation, tests, tools, etc. are available on github:

https://github.com/jbush001/NyuziProcessor

The processor supports a wide, predicated vector floating point pipeline
with 16 lanes and multiple hardware threads to hide memory and
computation latency. It also supports multiple cache coherent cores.
I've created an LLVM backend for this, so C/C++ code can be compiled for
it. It includes support for first class vector types using the GCC
vector extensions, as well as a intrinsics to expose specialized
instructions.

How many gates does it take once synthesized? Are there any Altera-
specific constructs in code or is it portable?
 
On Sunday, February 15, 2015 at 6:17:13 AM UTC-8, Aleksandar Kuktin wrote:

How many gates does it take once synthesized? Are there any Altera-
specific constructs in code or is it portable?

The default configuration with 1 core takes around 70k LEs on Altera. Almost all of the design is generic behavioral RTL without custom megafunctions. The exception are SRAM and FIFO modules, which generally need to be tweaked for the specific target to infer properly.
 
On Sun, 15 Feb 2015 07:37:13 -0800, jeffbush001 wrote:

On Sunday, February 15, 2015 at 6:17:13 AM UTC-8, Aleksandar Kuktin
wrote:

How many gates does it take once synthesized? Are there any Altera-
specific constructs in code or is it portable?

The default configuration with 1 core takes around 70k LEs on Altera.
Almost all of the design is generic behavioral RTL without custom
megafunctions. The exception are SRAM and FIFO modules, which generally
need to be tweaked for the specific target to infer properly.

Okay, so this sounds fun. Gonna clone it and see what's inside. :)
 

Welcome to EDABoard.com

Sponsor

Back
Top