J
Julian Kemmerer
Guest
Hi folks,
Here to talk about PipelineC.
https://github.com/JulianKemmerer/PipelineC/wiki
What is it?:
- C-like almost hardware description language
- A compiler that produces VHDL for specific devices/operating frequencies
I am looking for:
- anyone who wants to help me develop (Python, VHDL, C)
- suggestions on how to make PipelineC more useful/new features
- project ideas (heyo open source folks)
In the mean time, I am also here to share my most interesting example so far: Using PipelineC with an AWS F1 instance.
https://github.com/JulianKemmerer/PipelineC/wiki/AWS-F1-DMA-Example
I have made an AMI that you can use to play around with. However, it cannot be made public; I can only share it with specific AWS accounts, please message me if interested.
I want to share with you why I think PipelineC is particularly powerful:
First, it can mostly replace VHDL/Verilog for describing low level, clock by clock, hardware control logic. Consider the following generic VHDL:
-- Combinatorial logic with a storage register
signal the_reg : some_type_t;
signal the_wire : some_type_t;
process(input, the_reg) is -- inputs sync to clk
variable input_variable: some_type_t;
variable the_reg_variable : some_type_t;
begin
input_variable := input;
the_reg_variable := the_reg;
... Do work with 'input_variable', 'the_reg_variable'
and other variables, functions, etc and it kinda looks like C ...
the_wire <= the_reg_variable;
end process;
the_reg <= the_wire when rising_edge(clk);
output <= the_wire;
The equivalent PipelineC is
some_type_t the_reg;
some_type_t some_func_name(some_type_t input)
{
... Do work with 'input', 'the_reg'
... and other variables, functions, etc...
// Return==output
return the_reg;
}
Using that functionality I was able write very RTL-esque serialize+deserialize logic for the AXI4 interface that the AWS F1 shell logic provides to 'customer logic' for DMA. The AXI4 is deserialized to a stream of 4096 byte input data chunks that can be processed by a 'work' function.
I find that most HLS tools have trouble giving the user this sort of low level control, probably under the assumption that its too low level and not meant for software folks to be concerned with. Most hardware description languages are built for exactly this though.
Second, PipelineC can replace the most basic feature of other HLS tools: auto-pipelineing functions:
This AWS example sums 1024 floating point values via an N clock cycle pipelined binary tree of 1023 floating point adders (soft logic, not hard cores yet).
Below is the PipelineC code:
float work(float inputs[1024])
{
// All the nodes of the tree in arrays so can be written using loops
// ~log2(N) levels, max of N values in parallel
float nodes[11][1024]; // Unused elements optimize away
// Assign inputs to level 0
uint32_t i;
for(i=0; i<1024; i=i+1)
{
nodes[0] = inputs;
}
// Do the computation starting at level 1
uint32_t n_adds;
n_adds = 1024/2;
uint32_t level;
for(level=1; level<11; level=level+1)
{
// Parallel sums at this level
for(i=0; i<n_adds; i=i+1)
{
nodes[level] =
nodes[level-1][i*2] + nodes[level-1][(i*2)+1];
}
// Each level decreases adders in next level by half
n_adds = n_adds / 2;
}
// Return the last node in tree
return nodes[10][0];
}
(To be clear, I am NOT claiming that this is the best way to sum floats in hardware - its just a basic example big enough to use most of the FPGA).
The PipelineC tool inserts pipeline registers as needed to meet timing on the particular device technology + operating frequency. I find that most HLS tools are pretty good at this (and will do alot more than inferring pipelines too) but often require some ugly pragmas that - in a way - can make the code undesirably device specific. Hardware description languages can certainly describe the above hardware. But the code will almost certainly describe a pipeline designed specific to device technology/operating frequency - making the code hard for others to reuse even if you are kind enough to share it.
The very capable Virtex Ultrascale+ AWS hardware allows the PipelineC tool to fit the work() function into a pipeline depth/latency of 15 clock cycles (might be able to squeeze into few as 10 clocks). Running at 125MHz, it thus is capable of summing 1024 floating point values in 120 nanoseconds, with an 8 ns cycle time.
work() Pipeline:
- Frequency: 125 MHz, new inputs each cycle
- Latency: 15 clocks / 120 ns
LUTS Registers CARRY8 CLB
322144 137181 16307 62664
Here is the 'main' function / top level for the full hardware implementation:
aws_fpga_dma_outputs_t aws_fpga_dma(aws_fpga_dma_inputs_t i)
{
// Pull messages out of incoming DMA write data
dma_msg_s msg_in;
msg_in = deserializer(i.pcis);
// Convert incoming DMA message bytes to 'work' inputs
work_inputs_t work_inputs;
work_inputs = bytes_to_inputs(msg_in.data);
// Do some work
work_outputs_t work_outputs;
work_outputs = work(work_inputs);
// Convert 'work' outputs into outgoing DMA message bytes
dma_msg_s msg_out;
msg_out.data = outputs_to_bytes(work_outputs);
msg_out.valid = msg_in.valid;
// Put output message into outgoing DMA read data when requested
aws_fpga_dma_outputs_t o;
o.pcis = serializer(msg_out, i.pcis.arvalid);
return o;
}
On the software side, utilizing the FPGA hardware with user space file I/O calls looks like:
// Do work() using the FPGA hardware
work_outputs_t work_fpga(work_inputs_t inputs)
{
// Convert input into bytes
dma_msg_t write_msg;
write_msg = inputs_to_bytes(inputs);
// Write those DMA bytes to the FPGA
dma_write(write_msg);
// Read a DMA bytes back from FPGA
dma_msg_t read_msg;
read_msg = dma_read();
// Convert bytes to outputs and return
work_outputs_t work_outputs;
work_outputs = bytes_to_outputs(read_msg);
return work_outputs;
}
So there you have it: Low level RTL-like control, working right beside highly pipelined logic. All in a familiar C look that could just be compiled with gcc for 'simulation'. Ex. this example uses the same work() function code as hardware description and as the 'golden C model' compiled with gcc to compare against.
In the sense that C abstracts away the hardware specifics of each CPU architecture + memory model, but only at a very minimal level, I want PipelineC to be the same for digital logic. The same PipelineC code should produce computationally equivalent hardware on any FPGA/ASIC device technology through smarts in the compiler. But C/PipelineC obviously doesn't do everything, there isnt a whole lot of higher level abstraction done for you. Its just the bedrock to build shareable libraries.
Some big features PipelineC lacks as of the moment
- Flow control/combinatorial feed-backward signals through N clock pipelined logic
- PipelineC can describe FIFOs, BRAMs (hard BRAM IP is the only IP supported right now) to work with data flows, but the equivalent off a bare combinatorial <= assignment operator feedback is missing
- Multiple clock domains / clock crossings (have some neat ideas about this).
- This would likely be my next big...many month... task?
- The C parser I'm using doesnt let you return constant sized arrays, but PipelineC as a language really should, but I think if I modified it (oh gosh help me?) and said 'use g++' to compile this 'C code that returns arrays' I think it could work out?
Got any ideas on what you'd want to do with PipelineC? Let me know maybe we can make something cool together. Want support for an open source synthesis tool, I can give Yosys a try?
Thanks for your time folks
Here to talk about PipelineC.
https://github.com/JulianKemmerer/PipelineC/wiki
What is it?:
- C-like almost hardware description language
- A compiler that produces VHDL for specific devices/operating frequencies
I am looking for:
- anyone who wants to help me develop (Python, VHDL, C)
- suggestions on how to make PipelineC more useful/new features
- project ideas (heyo open source folks)
In the mean time, I am also here to share my most interesting example so far: Using PipelineC with an AWS F1 instance.
https://github.com/JulianKemmerer/PipelineC/wiki/AWS-F1-DMA-Example
I have made an AMI that you can use to play around with. However, it cannot be made public; I can only share it with specific AWS accounts, please message me if interested.
I want to share with you why I think PipelineC is particularly powerful:
First, it can mostly replace VHDL/Verilog for describing low level, clock by clock, hardware control logic. Consider the following generic VHDL:
-- Combinatorial logic with a storage register
signal the_reg : some_type_t;
signal the_wire : some_type_t;
process(input, the_reg) is -- inputs sync to clk
variable input_variable: some_type_t;
variable the_reg_variable : some_type_t;
begin
input_variable := input;
the_reg_variable := the_reg;
... Do work with 'input_variable', 'the_reg_variable'
and other variables, functions, etc and it kinda looks like C ...
the_wire <= the_reg_variable;
end process;
the_reg <= the_wire when rising_edge(clk);
output <= the_wire;
The equivalent PipelineC is
some_type_t the_reg;
some_type_t some_func_name(some_type_t input)
{
... Do work with 'input', 'the_reg'
... and other variables, functions, etc...
// Return==output
return the_reg;
}
Using that functionality I was able write very RTL-esque serialize+deserialize logic for the AXI4 interface that the AWS F1 shell logic provides to 'customer logic' for DMA. The AXI4 is deserialized to a stream of 4096 byte input data chunks that can be processed by a 'work' function.
I find that most HLS tools have trouble giving the user this sort of low level control, probably under the assumption that its too low level and not meant for software folks to be concerned with. Most hardware description languages are built for exactly this though.
Second, PipelineC can replace the most basic feature of other HLS tools: auto-pipelineing functions:
This AWS example sums 1024 floating point values via an N clock cycle pipelined binary tree of 1023 floating point adders (soft logic, not hard cores yet).
Below is the PipelineC code:
float work(float inputs[1024])
{
// All the nodes of the tree in arrays so can be written using loops
// ~log2(N) levels, max of N values in parallel
float nodes[11][1024]; // Unused elements optimize away
// Assign inputs to level 0
uint32_t i;
for(i=0; i<1024; i=i+1)
{
nodes[0] = inputs;
}
// Do the computation starting at level 1
uint32_t n_adds;
n_adds = 1024/2;
uint32_t level;
for(level=1; level<11; level=level+1)
{
// Parallel sums at this level
for(i=0; i<n_adds; i=i+1)
{
nodes[level] =
nodes[level-1][i*2] + nodes[level-1][(i*2)+1];
}
// Each level decreases adders in next level by half
n_adds = n_adds / 2;
}
// Return the last node in tree
return nodes[10][0];
}
(To be clear, I am NOT claiming that this is the best way to sum floats in hardware - its just a basic example big enough to use most of the FPGA).
The PipelineC tool inserts pipeline registers as needed to meet timing on the particular device technology + operating frequency. I find that most HLS tools are pretty good at this (and will do alot more than inferring pipelines too) but often require some ugly pragmas that - in a way - can make the code undesirably device specific. Hardware description languages can certainly describe the above hardware. But the code will almost certainly describe a pipeline designed specific to device technology/operating frequency - making the code hard for others to reuse even if you are kind enough to share it.
The very capable Virtex Ultrascale+ AWS hardware allows the PipelineC tool to fit the work() function into a pipeline depth/latency of 15 clock cycles (might be able to squeeze into few as 10 clocks). Running at 125MHz, it thus is capable of summing 1024 floating point values in 120 nanoseconds, with an 8 ns cycle time.
work() Pipeline:
- Frequency: 125 MHz, new inputs each cycle
- Latency: 15 clocks / 120 ns
LUTS Registers CARRY8 CLB
322144 137181 16307 62664
Here is the 'main' function / top level for the full hardware implementation:
aws_fpga_dma_outputs_t aws_fpga_dma(aws_fpga_dma_inputs_t i)
{
// Pull messages out of incoming DMA write data
dma_msg_s msg_in;
msg_in = deserializer(i.pcis);
// Convert incoming DMA message bytes to 'work' inputs
work_inputs_t work_inputs;
work_inputs = bytes_to_inputs(msg_in.data);
// Do some work
work_outputs_t work_outputs;
work_outputs = work(work_inputs);
// Convert 'work' outputs into outgoing DMA message bytes
dma_msg_s msg_out;
msg_out.data = outputs_to_bytes(work_outputs);
msg_out.valid = msg_in.valid;
// Put output message into outgoing DMA read data when requested
aws_fpga_dma_outputs_t o;
o.pcis = serializer(msg_out, i.pcis.arvalid);
return o;
}
On the software side, utilizing the FPGA hardware with user space file I/O calls looks like:
// Do work() using the FPGA hardware
work_outputs_t work_fpga(work_inputs_t inputs)
{
// Convert input into bytes
dma_msg_t write_msg;
write_msg = inputs_to_bytes(inputs);
// Write those DMA bytes to the FPGA
dma_write(write_msg);
// Read a DMA bytes back from FPGA
dma_msg_t read_msg;
read_msg = dma_read();
// Convert bytes to outputs and return
work_outputs_t work_outputs;
work_outputs = bytes_to_outputs(read_msg);
return work_outputs;
}
So there you have it: Low level RTL-like control, working right beside highly pipelined logic. All in a familiar C look that could just be compiled with gcc for 'simulation'. Ex. this example uses the same work() function code as hardware description and as the 'golden C model' compiled with gcc to compare against.
In the sense that C abstracts away the hardware specifics of each CPU architecture + memory model, but only at a very minimal level, I want PipelineC to be the same for digital logic. The same PipelineC code should produce computationally equivalent hardware on any FPGA/ASIC device technology through smarts in the compiler. But C/PipelineC obviously doesn't do everything, there isnt a whole lot of higher level abstraction done for you. Its just the bedrock to build shareable libraries.
Some big features PipelineC lacks as of the moment
- Flow control/combinatorial feed-backward signals through N clock pipelined logic
- PipelineC can describe FIFOs, BRAMs (hard BRAM IP is the only IP supported right now) to work with data flows, but the equivalent off a bare combinatorial <= assignment operator feedback is missing
- Multiple clock domains / clock crossings (have some neat ideas about this).
- This would likely be my next big...many month... task?
- The C parser I'm using doesnt let you return constant sized arrays, but PipelineC as a language really should, but I think if I modified it (oh gosh help me?) and said 'use g++' to compile this 'C code that returns arrays' I think it could work out?
Got any ideas on what you'd want to do with PipelineC? Let me know maybe we can make something cool together. Want support for an open source synthesis tool, I can give Yosys a try?
Thanks for your time folks