Open source Verilog BCH encoder/decoder

R

Russell Dill

Guest
As part of my research, I needed a BCH encoder/decoder engine. Sadly, such a thing has no existed under a permissive license. Even more depressing is that many students seem to submit Verilog or VHDL engines as a project (or even research), but never release anything that is usable.

Anyway, I'm releasing a BSD licensed Verilog BCH encoder/decoder. It offers:

* Parallel input/output
* Modular components that can be shared across multiple decoders
* Automatic selection of BCH parameters based on data size and errors to be corrected
* Specialized error locators for 1 error and 2 error codes
* Parallel or serial error polynomial generator for codes with 2 or more errors

https://github.com/russdill/bch_verilog

I'm releasing this under BSD because I'd like to see the code used as widely as possible, but I'd still like to get feedback and hopefully improvements merged back in.

As an example, a decoder for a 512 byte data block that corrects up to 12 errors with an 8 bit wide input and an 8 bit wide output currently occupies 1635 slices and operates at up to 191 MHz on a Virtex-6 LX240T-3. Such a decoder would take input for 532 clock cycles (512 data bytes, 20 ecc bytes), calculate for about 28 clock cycles, and then produce output for 512 clock cycles.

The code currently compiles on Icarus Verilog (latest git) and Xilinx XST/Isim (tested with 14.5).
 
the link is expired, Can you the share it. Even i doing research on BCH Encoder and Decoder.

Thank You.
 
wrote in message
news:961f3a7b-97e1-42f2-81d3-1b6092248b54@googlegroups.com...

the link is expired, Can you the share it. Even i doing research on BCH
Encoder and Decoder.

Thank You.


https://www.google.com/search?as_q=bch+encoder&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=&gws_rd=ssl
 
On Friday, January 2, 2015 8:19:22 PM UTC-8, manoj...@gmail.com wrote:
the link is expired, Can you the share it. Even i doing research on BCH Encoder and Decoder.

Thank You.

Which link is expired? The only link in the post is to github, which is fine.
 
On Monday, June 23, 2014 at 1:35:52 AM UTC-7, Russell Dill wrote:
As part of my research, I needed a BCH encoder/decoder engine. Sadly, such a thing has no existed under a permissive license. Even more depressing is that many students seem to submit Verilog or VHDL engines as a project (or even research), but never release anything that is usable.

Anyway, I'm releasing a BSD licensed Verilog BCH encoder/decoder. It offers:

* Parallel input/output
* Modular components that can be shared across multiple decoders
* Automatic selection of BCH parameters based on data size and errors to be corrected
* Specialized error locators for 1 error and 2 error codes
* Parallel or serial error polynomial generator for codes with 2 or more errors

https://github.com/russdill/bch_verilog

I'm releasing this under BSD because I'd like to see the code used as widely as possible, but I'd still like to get feedback and hopefully improvements merged back in.

As an example, a decoder for a 512 byte data block that corrects up to 12 errors with an 8 bit wide input and an 8 bit wide output currently occupies 1635 slices and operates at up to 191 MHz on a Virtex-6 LX240T-3. Such a decoder would take input for 532 clock cycles (512 data bytes, 20 ecc bytes), calculate for about 28 clock cycles, and then produce output for 512 clock cycles.

The code currently compiles on Icarus Verilog (latest git) and Xilinx XST/Isim (tested with 14.5).

Awesome code, thanks for making it available !

Simulations run great on Icarus.

However, I'm having execution time trouble on XST.

On a Thinkpad T540P with 16 GB DDR3, a DATA_BITS=1024,T=8,BITS=8 is still synthesizing after 12 hours (note that 1 of 4 CPUs is fully utilized so
it's not a case of continual memory swapping).

What compute capabilities did you use for DATA_BITS=4096,T=12,BITS=16 on XST ?
How long did it take ?

I am using -loop_iteration_limit 2048 (for my case, 1024 barfs out) and
-opt_level 2


Thanks !
 
On Sunday, May 10, 2015 at 9:33:19 AM UTC-7, pau...@gmail.com wrote:
On Monday, June 23, 2014 at 1:35:52 AM UTC-7, Russell Dill wrote:
As part of my research, I needed a BCH encoder/decoder engine. Sadly, such a thing has no existed under a permissive license. Even more depressing is that many students seem to submit Verilog or VHDL engines as a project (or even research), but never release anything that is usable.

Anyway, I'm releasing a BSD licensed Verilog BCH encoder/decoder. It offers:

* Parallel input/output
* Modular components that can be shared across multiple decoders
* Automatic selection of BCH parameters based on data size and errors to be corrected
* Specialized error locators for 1 error and 2 error codes
* Parallel or serial error polynomial generator for codes with 2 or more errors

https://github.com/russdill/bch_verilog

I'm releasing this under BSD because I'd like to see the code used as widely as possible, but I'd still like to get feedback and hopefully improvements merged back in.

As an example, a decoder for a 512 byte data block that corrects up to 12 errors with an 8 bit wide input and an 8 bit wide output currently occupies 1635 slices and operates at up to 191 MHz on a Virtex-6 LX240T-3. Such a decoder would take input for 532 clock cycles (512 data bytes, 20 ecc bytes), calculate for about 28 clock cycles, and then produce output for 512 clock cycles.

The code currently compiles on Icarus Verilog (latest git) and Xilinx XST/Isim (tested with 14.5).


Awesome code, thanks for making it available !

Simulations run great on Icarus.

However, I'm having execution time trouble on XST.

On a Thinkpad T540P with 16 GB DDR3, a DATA_BITS=1024,T=8,BITS=8 is still synthesizing after 12 hours (note that 1 of 4 CPUs is fully utilized so
it's not a case of continual memory swapping).

What compute capabilities did you use for DATA_BITS=4096,T=12,BITS=16 on XST ?
How long did it take ?

I am using -loop_iteration_limit 2048 (for my case, 1024 barfs out) and
-opt_level 2

I've pushed some updates related to corner cases and syndrome
computation, go ahead and pull and give it another try.

The main thing that will make XST run "forever" is swapping due to lack of RAM. For synthesizing single channel decoders, I'd recommend at least 16GB. For multi-channel, 32GB.
 
On Sunday, May 10, 2015 at 2:42:36 PM UTC-7, Russell Dill wrote:
On Sunday, May 10, 2015 at 9:33:19 AM UTC-7, pau...@gmail.com wrote:
On Monday, June 23, 2014 at 1:35:52 AM UTC-7, Russell Dill wrote:
As part of my research, I needed a BCH encoder/decoder engine. Sadly, such a thing has no existed under a permissive license. Even more depressing is that many students seem to submit Verilog or VHDL engines as a project (or even research), but never release anything that is usable.

Anyway, I'm releasing a BSD licensed Verilog BCH encoder/decoder. It offers:

* Parallel input/output
* Modular components that can be shared across multiple decoders
* Automatic selection of BCH parameters based on data size and errors to be corrected
* Specialized error locators for 1 error and 2 error codes
* Parallel or serial error polynomial generator for codes with 2 or more errors

https://github.com/russdill/bch_verilog

I'm releasing this under BSD because I'd like to see the code used as widely as possible, but I'd still like to get feedback and hopefully improvements merged back in.

As an example, a decoder for a 512 byte data block that corrects up to 12 errors with an 8 bit wide input and an 8 bit wide output currently occupies 1635 slices and operates at up to 191 MHz on a Virtex-6 LX240T-3. Such a decoder would take input for 532 clock cycles (512 data bytes, 20 ecc bytes), calculate for about 28 clock cycles, and then produce output for 512 clock cycles.

The code currently compiles on Icarus Verilog (latest git) and Xilinx XST/Isim (tested with 14.5).


Awesome code, thanks for making it available !

Simulations run great on Icarus.

However, I'm having execution time trouble on XST.

On a Thinkpad T540P with 16 GB DDR3, a DATA_BITS=1024,T=8,BITS=8 is still synthesizing after 12 hours (note that 1 of 4 CPUs is fully utilized so
it's not a case of continual memory swapping).

What compute capabilities did you use for DATA_BITS=4096,T=12,BITS=16 on XST ?
How long did it take ?

I am using -loop_iteration_limit 2048 (for my case, 1024 barfs out) and
-opt_level 2


I've pushed some updates related to corner cases and syndrome
computation, go ahead and pull and give it another try.

The main thing that will make XST run "forever" is swapping due to lack of RAM. For synthesizing single channel decoders, I'd recommend at least 16GB. For multi-channel, 32GB.



Thanks Russel.

I believe you've only changed the Makefile yes ?

What's your approximate synthesis time for sim.v with DATA_BITS=1024,T=8,BITS=8 ?

I'm not sure what you mean by single / multiple channels.

Cheers,

-Paul
 
On Sunday, May 10, 2015 at 3:23:32 PM UTC-7, pau...@gmail.com wrote:
On Sunday, May 10, 2015 at 2:42:36 PM UTC-7, Russell Dill wrote:
On Sunday, May 10, 2015 at 9:33:19 AM UTC-7, pau...@gmail.com wrote:
On Monday, June 23, 2014 at 1:35:52 AM UTC-7, Russell Dill wrote:
As part of my research, I needed a BCH encoder/decoder engine. Sadly, such a thing has no existed under a permissive license. Even more depressing is that many students seem to submit Verilog or VHDL engines as a project (or even research), but never release anything that is usable.

Anyway, I'm releasing a BSD licensed Verilog BCH encoder/decoder. It offers:

* Parallel input/output
* Modular components that can be shared across multiple decoders
* Automatic selection of BCH parameters based on data size and errors to be corrected
* Specialized error locators for 1 error and 2 error codes
* Parallel or serial error polynomial generator for codes with 2 or more errors

https://github.com/russdill/bch_verilog

I'm releasing this under BSD because I'd like to see the code used as widely as possible, but I'd still like to get feedback and hopefully improvements merged back in.

As an example, a decoder for a 512 byte data block that corrects up to 12 errors with an 8 bit wide input and an 8 bit wide output currently occupies 1635 slices and operates at up to 191 MHz on a Virtex-6 LX240T-3. Such a decoder would take input for 532 clock cycles (512 data bytes, 20 ecc bytes), calculate for about 28 clock cycles, and then produce output for 512 clock cycles.

The code currently compiles on Icarus Verilog (latest git) and Xilinx XST/Isim (tested with 14.5).


Awesome code, thanks for making it available !

Simulations run great on Icarus.

However, I'm having execution time trouble on XST.

On a Thinkpad T540P with 16 GB DDR3, a DATA_BITS=1024,T=8,BITS=8 is still synthesizing after 12 hours (note that 1 of 4 CPUs is fully utilized so
it's not a case of continual memory swapping).

What compute capabilities did you use for DATA_BITS=4096,T=12,BITS=16 on XST ?
How long did it take ?

I am using -loop_iteration_limit 2048 (for my case, 1024 barfs out) and
-opt_level 2


I've pushed some updates related to corner cases and syndrome
computation, go ahead and pull and give it another try.

The main thing that will make XST run "forever" is swapping due to lack of RAM. For synthesizing single channel decoders, I'd recommend at least 16GB. For multi-channel, 32GB.




Thanks Russel.

I believe you've only changed the Makefile yes ?

No, you can see the full list of changes here:

https://github.com/russdill/bch_verilog/compare/cfd444733f...cee257ae47

What's your approximate synthesis time for sim.v with DATA_BITS=1024,T=8,BITS=8 ?

tb_sim.v and sim.v were not intended to be synthesizable.

I'm not sure what you mean by single / multiple channels.

Running multiple decoders in parallel
 
I believe you've only changed the Makefile yes ?

No, you can see the full list of changes here:

https://github.com/russdill/bch_verilog/compare/cfd444733f...cee257ae47

Sorry for the imprecise question, my first pull was a week ago so I meant since then, I believe the Makefile only has changed but I may be wrong again :)

What's your approximate synthesis time for sim.v with DATA_BITS=1024,T=8,BITS=8 ?


tb_sim.v and sim.v were not intended to be synthesizable.

Sure, bad question again, I hacked your sim.v (and unfortunately kept the same name ...) to contain bch_syndrome, bch_errors_present, bch_sigma_bma_parallel and bch_error_tmec (+ hook-ups ... etc).

If I try to synthesize the whole thing with T=12, DATA_BITS=4096 I hit a wall
(on a 16GB machine, 1 BCH channel only).

So I narrowed it down:

bch_syndrome, bch_errors_present, bch_sigma_bma_parallel all synthesize individually in 10 minutes or fewer and use < 2GB DDR even if I use
T=64, DATA_BITS=8192 (>5x T and 2x DATA_BITS of above)

However, bch_error_tmec ALONE with
T=12, DATA_BITS=4096 only, takes 1+1/2 hours and reaches 7 GB DDR utilization with a funny pattern of slow ramp-ups and sharp declines - it doesn't use up all the available DDR though, i.e., there are 5 or more GB available at all times.

T=64, DATA_BITS=8192 barfs.

I tried chien separately and got the same result as with bch_error_tmec.


Do you expect chien to be so much harder to synthesize than all the rest ?
From my past experience with Reed-Solomon I sort of expected Berlekamp-Massey
and Chien to be of somewhat comparable complexity.


Thanks !

-Paul
 
On Monday, May 11, 2015 at 6:13:58 PM UTC-7, pau...@gmail.com wrote:
I believe you've only changed the Makefile yes ?

No, you can see the full list of changes here:

https://github.com/russdill/bch_verilog/compare/cfd444733f...cee257ae47

Sorry for the imprecise question, my first pull was a week ago so I meant since then, I believe the Makefile only has changed but I may be wrong again :)

You are looking at the dates of the commits. The commits happened a while ago, but were only recently pushed.

What's your approximate synthesis time for sim.v with DATA_BITS=1024,T=8,BITS=8 ?


tb_sim.v and sim.v were not intended to be synthesizable.

Sure, bad question again, I hacked your sim.v (and unfortunately kept the same name ...) to contain bch_syndrome, bch_errors_present, bch_sigma_bma_parallel and bch_error_tmec (+ hook-ups ... etc).

If I try to synthesize the whole thing with T=12, DATA_BITS=4096 I hit a wall
(on a 16GB machine, 1 BCH channel only).

Here's xilinx_error_tmec with PIPELINE_STAGES=2, DATA_BITS=4096, T=12, BITS=8, REG_RATIO=8

xst:
524.43user 4.14system 8:42.02elapsed 101%CPU (0avgtext+0avgdata 13612104maxresident)k
0inputs+26104outputs (0major+4229163minor)pagefaults 0swaps

map:
11.68user 0.12system 0:11.81elapsed 100%CPU (0avgtext+0avgdata 581040maxresident)k
0inputs+184outputs (0major+155622minor)pagefaults 0swaps

So XST is using nearly 14GB. Depending on your machine's configuration, 16GB of physical memory would not have been enough. Incidentally, the decoder uses 612 slices, and runs at least 200MHz.

So I narrowed it down:

bch_syndrome, bch_errors_present, bch_sigma_bma_parallel all synthesize individually in 10 minutes or fewer and use < 2GB DDR even if I use
T=64, DATA_BITS=8192 (>5x T and 2x DATA_BITS of above)

However, bch_error_tmec ALONE with
T=12, DATA_BITS=4096 only, takes 1+1/2 hours and reaches 7 GB DDR utilization with a funny pattern of slow ramp-ups and sharp declines - it doesn't use up all the available DDR though, i.e., there are 5 or more GB available at all times.

T=64, DATA_BITS=8192 barfs.

I tried chien separately and got the same result as with bch_error_tmec.


Do you expect chien to be so much harder to synthesize than all the rest ?
From my past experience with Reed-Solomon I sort of expected Berlekamp-Massey
and Chien to be of somewhat comparable complexity.

The way I'm dynamically compiling the chien modules is giving XST a hard time. Although going bit-parallel, you do need a lot of parallel multipliers for T=64 (some 512 of them). I've created quite a few variants to get the most out of XST, each variant with it's own strengths and weaknesses. If you change the multiplier in bch_chien_expand from a parallel_standard_multiplier_const1 to a parallel_standard_multilier, you can save some memory, but not the orders of magnitude required.

Here's an example at: PIPELINE_STAGES=1,DATA_BITS=4096,T=32,BITS=8,REG_RATIO=8

xst:
1925.94user 9.58system 31:49.83elapsed 101%CPU (0avgtext+0avgdata 28960864maxresident)k

map:
565608inputs+50552outputs (353major+8315051minor)pagefaults 0swaps
13.37user 0.15system 0:13.60elapsed 99%CPU (0avgtext+0avgdata 613264maxresident)k
2104inputs+208outputs (3major+163868minor)pagefaults 0swaps

So you can see that in this case xst is using about 29GB of memory. I originally targeted the code for around T=3 to T=16. If you want to get up to T=64, you'll likely need to figure out why XST is taking so much memory synthesizing the multipliers, likely by paring this down bit by bit.

Additionally, the number of pipeline stages in error tmec is limited to 2 right now. You might need to go higher to get 64 13 bit terms summed together and compared with zero.

Thanks !

-Paul
 

Welcome to EDABoard.com

Sponsor

Back
Top