Guest
I've been designing an open source Larrabee-esque GPGPU processor in SystemVerilog and I thought people might find it interesting. Full source code, documentation, tests, tools, etc. are available on github:
https://github.com/jbush001/NyuziProcessor
The processor supports a wide, predicated vector floating point pipeline with 16 lanes and multiple hardware threads to hide memory and computation latency. It also supports multiple cache coherent cores. I've created an LLVM backend for this, so C/C++ code can be compiled for it. It includes support for first class vector types using the GCC vector extensions, as well as a intrinsics to expose specialized instructions.
I've written a 3D engine (software/librender) that is optimized to take advantage both of the vector unit and multiple cores/hardware threads. Here's a video of the standard teapot (with ~2300 triangles) rendering on a single core on FPGA running at 50 Mhz:
http://youtu.be/DsvZorBu4Uk
This image is the emulator rendering Dabrovik's Sponza atrium, ~66k triangles. This took around 200 million instructions to render between 8 virtual cores and 32 hardware threads (at 1024x768):
http://i.imgur.com/sHAsAU5.png
My main purpose of designing this was to be able to experiment with processor architecture with real, empirical data. The neat thing about having all the source to a cycle accurate hardware design is that it is infinitely instrumentable. I've kept notes about some of my findings here:
http://latchup.blogspot.com/
Anyway, comments and suggestions are appreciated, and I'm happy to take contributions if people are interested in hacking on it.
https://github.com/jbush001/NyuziProcessor
The processor supports a wide, predicated vector floating point pipeline with 16 lanes and multiple hardware threads to hide memory and computation latency. It also supports multiple cache coherent cores. I've created an LLVM backend for this, so C/C++ code can be compiled for it. It includes support for first class vector types using the GCC vector extensions, as well as a intrinsics to expose specialized instructions.
I've written a 3D engine (software/librender) that is optimized to take advantage both of the vector unit and multiple cores/hardware threads. Here's a video of the standard teapot (with ~2300 triangles) rendering on a single core on FPGA running at 50 Mhz:
http://youtu.be/DsvZorBu4Uk
This image is the emulator rendering Dabrovik's Sponza atrium, ~66k triangles. This took around 200 million instructions to render between 8 virtual cores and 32 hardware threads (at 1024x768):
http://i.imgur.com/sHAsAU5.png
My main purpose of designing this was to be able to experiment with processor architecture with real, empirical data. The neat thing about having all the source to a cycle accurate hardware design is that it is infinitely instrumentable. I've kept notes about some of my findings here:
http://latchup.blogspot.com/
Anyway, comments and suggestions are appreciated, and I'm happy to take contributions if people are interested in hacking on it.