J
Julian Kemmerer
Guest
Hi folks looking for feedback on PipelineC. Ideas of what to implement next..
I will point you to a recent reddit post which ultimately points to GitHub.
https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_lower_resource_usage/
Here is the code to get you interested:
// This is the unrolled version of the original dct copy-and-pasted algorithm
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/
// PipelineC iterations of dctTransformUnrolled are used
// to unroll the calculation serially in O(n^4) time
// Input 'matrix' and start=1 to begin calculation
// Input 'matrix' must stay constant until return .done
// 'sum' accumulates over iterations/clocks and should be pipelined
// So 'sum' must be a volatile global variable
// Keep track of when sum is valid and be read+written
volatile uint1_t dct_volatiles_valid;
// sum will temporarily store the sum of cosine signals
volatile float dct_sum;
// dct_result will store the discrete cosine transform
// Signal that this is the iteration containing the 'done' result
typedef struct dct_done_t
{
float matrix[DCT_M][DCT_N];
uint1_t done;
} dct_done_t;
volatile dct_done_t dct_result;
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start)
{
// Assume not done yet
dct_result.done = 0;
// Start validates volatiles
if(start)
{
dct_volatiles_valid = 1;
}
// Global func to handle getting to BRAM
// 1) Lookup constants from BRAM (using iterators)
// 2) Increment iterators
// Returns next iterators and constants and will increment when requested
dct_lookup_increment_t lookup_increment;
uint1_t do_increment;
// Only increment when volatiles valid
do_increment = dct_volatiles_valid;
lookup_increment = dct_lookup_increment(do_increment);
// Unpack struct for ease of reading calculation code below
float const_val;
const_val = lookup_increment.lookup.const_val;
float cos_val;
cos_val = lookup_increment.lookup.cos_val;
dct_iter_t i;
i = lookup_increment.incrementer.curr_iters.i;
dct_iter_t j;
j = lookup_increment.incrementer.curr_iters.j;
dct_iter_t k;
k = lookup_increment.incrementer.curr_iters.k;
dct_iter_t l;
l = lookup_increment.incrementer.curr_iters.l;
uint1_t reset_k;
reset_k = lookup_increment.incrementer.increment.reset_k;
uint1_t reset_l;
reset_l = lookup_increment.incrementer.increment.reset_l;
uint1_t done;
done = lookup_increment.incrementer.increment.done;
// Do math for this volatile iteration only when
// can safely read+write volatiles
if(dct_volatiles_valid)
{
// ~~~ The primary calculation ~~~:
// 1) Float * cosine constant from lookup table
float dct1;
dct1 = (float)matrix[k][l] * cos_val;
// 2) Increment sum
dct_sum = dct_sum + dct1;
// 3) constant * Float and assign into the output matrix
dct_result.matrix[j] = const_val * dct_sum;
// Sum accumulates during the k and l loops
// So reset when they are rolling over
if(reset_k & reset_l)
{
dct_sum = 0.0;
}
// Done yet?
dct_result.done = done;
// Reset volatiles once done
if(done)
{
dct_volatiles_valid = 0;
}
}
return dct_result;
}
What does this synthesize to?
Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled).
Consider the 'execution' of the function in time order. The logic consists of:
~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix'
~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier
~34% of time for the 2) Increment sum addition, a floating point adder
~21% of time for the 3) constant * Float, a floating point multiplier
~5% of time for the 3) assignment into the output matrix at [j]
That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.
I will point you to a recent reddit post which ultimately points to GitHub.
https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_lower_resource_usage/
Here is the code to get you interested:
// This is the unrolled version of the original dct copy-and-pasted algorithm
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/
// PipelineC iterations of dctTransformUnrolled are used
// to unroll the calculation serially in O(n^4) time
// Input 'matrix' and start=1 to begin calculation
// Input 'matrix' must stay constant until return .done
// 'sum' accumulates over iterations/clocks and should be pipelined
// So 'sum' must be a volatile global variable
// Keep track of when sum is valid and be read+written
volatile uint1_t dct_volatiles_valid;
// sum will temporarily store the sum of cosine signals
volatile float dct_sum;
// dct_result will store the discrete cosine transform
// Signal that this is the iteration containing the 'done' result
typedef struct dct_done_t
{
float matrix[DCT_M][DCT_N];
uint1_t done;
} dct_done_t;
volatile dct_done_t dct_result;
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start)
{
// Assume not done yet
dct_result.done = 0;
// Start validates volatiles
if(start)
{
dct_volatiles_valid = 1;
}
// Global func to handle getting to BRAM
// 1) Lookup constants from BRAM (using iterators)
// 2) Increment iterators
// Returns next iterators and constants and will increment when requested
dct_lookup_increment_t lookup_increment;
uint1_t do_increment;
// Only increment when volatiles valid
do_increment = dct_volatiles_valid;
lookup_increment = dct_lookup_increment(do_increment);
// Unpack struct for ease of reading calculation code below
float const_val;
const_val = lookup_increment.lookup.const_val;
float cos_val;
cos_val = lookup_increment.lookup.cos_val;
dct_iter_t i;
i = lookup_increment.incrementer.curr_iters.i;
dct_iter_t j;
j = lookup_increment.incrementer.curr_iters.j;
dct_iter_t k;
k = lookup_increment.incrementer.curr_iters.k;
dct_iter_t l;
l = lookup_increment.incrementer.curr_iters.l;
uint1_t reset_k;
reset_k = lookup_increment.incrementer.increment.reset_k;
uint1_t reset_l;
reset_l = lookup_increment.incrementer.increment.reset_l;
uint1_t done;
done = lookup_increment.incrementer.increment.done;
// Do math for this volatile iteration only when
// can safely read+write volatiles
if(dct_volatiles_valid)
{
// ~~~ The primary calculation ~~~:
// 1) Float * cosine constant from lookup table
float dct1;
dct1 = (float)matrix[k][l] * cos_val;
// 2) Increment sum
dct_sum = dct_sum + dct1;
// 3) constant * Float and assign into the output matrix
dct_result.matrix[j] = const_val * dct_sum;
// Sum accumulates during the k and l loops
// So reset when they are rolling over
if(reset_k & reset_l)
{
dct_sum = 0.0;
}
// Done yet?
dct_result.done = done;
// Reset volatiles once done
if(done)
{
dct_volatiles_valid = 0;
}
}
return dct_result;
}
What does this synthesize to?
Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled).
Consider the 'execution' of the function in time order. The logic consists of:
~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix'
~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier
~34% of time for the 2) Increment sum addition, a floating point adder
~21% of time for the 3) constant * Float, a floating point multiplier
~5% of time for the 3) assignment into the output matrix at [j]
That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.