GZIP algorithm in FPGA

L

lenz

Guest
Hi,

i tried to find some references for lossless compression
algorithms implemented in fpga. I found a lot about
bitstream compression but (nearly) nothing about fpga
implementations of lossless compression algorithms.

Which lossless compression algorithms are suited for fpga
implementation.


Thank you.
 
lenz wrote:

Hi,

i tried to find some references for lossless compression
algorithms implemented in fpga. I found a lot about
bitstream compression but (nearly) nothing about fpga
implementations of lossless compression algorithms.

Which lossless compression algorithms are suited for fpga
implementation.


Some years ago I had a need for a losslessly compressed database,
and found some source code for the Lempel-Ziv-Welch compression
program (LZW). I extracted the core code from it and inserted it into
the database read-write program, and got it working. I didn't dwell
on the details, but it looked pretty simple to me. It built a 'dictionary'
of multi-byte patters it had seen, and assigned single-byte and then later
9, 10, 11, 12-etc. bit codes for these patterns. These functions seem like
they could be done with an FPGA and an external RAM to hold the
dictionary. (The dictionary is written in the output stream one entry
at a time during compression, and then rebuilt when decompressing.

You should be able to find some source code for LZW somewhere.

Jon
 
lenz19@gmx.de (lenz) wrote in message news:<7f673150.0402171059.2cf41c85@posting.google.com>...
i tried to find some references for lossless compression
algorithms implemented in fpga. I found a lot about
bitstream compression but (nearly) nothing about fpga
implementations of lossless compression algorithms.

Which lossless compression algorithms are suited for fpga
implementation.
Since you mentioned GZIP, I will start with Lempel-Ziv (LZ) based
compression schemes:

Huang, Saxena and McCluskey "A Reliable Compressor on
Reconfigurable Coprocessors"
http://csdl.computer.org/comp/proceedings/fccm/2000/0871/00/08710249abs.htm

Jung and Burleson. "Efficient VLSI for Lempel-Ziv Compression in
Wireless Data Communication Networks"
http://citeseer.nj.nec.com/359751.html

It should be noted that the Systolic Array implementatio described in
the papers would violate the Stac patents. Basically Stac has
patented the notion of using a shift register to store the history
buffer. Though I personally believe the concept of storing the
dictionary in a shift register to be rather obvious... we have opted
not to use a shift register implementation so as to not risk
litigation.

We are in the process of completing a LZ based compression core that
is very similar to LZS, but does not violate the patents. The
performance in a Xilinx V2Pro is expected to be on the order of 100M
bytes/second, with a 512 byte dictionary. We expect to release this
core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.

These papers should atleast get you started. Large CAMs are not a
particularly FPGA friendly structure, but similar structures can be
effectively implemented.

For decompression, sequential implementations will be just as fast as
parallel ones, as on decompression only one location from the
dictionary needs to be read per output. For compression, a parallel
implementation will allow for processing up to 100M tokens per second.
A sequential approach will be much slower.

LZ compression is a very good approach for "generic" data, in areas
such as storage or networking. Since you did not mention what sort of
data you were going to compress, I will make brief mention of non-LZ
approaches.

IBM's ALDC is something worth studying to see an alternative approach
to compression. As far as patents go... the adaptive nature of the
algorithm is really quite novel, and dare I say mathematically quite
cool.

If it is images, one is much better off going with an algorithm that
is tailored to this sort of data-set. For example LOCO/JPEG-LS for
continuous tone images. We are wrapping up an implementation of this
algorithm, and our first hardware implementation is running at 35M
12bit pixels/sec using ~65% of an XC2v1000-4. The challenge with this
algorithm is in the pipelining, as each pixel prediction, error
calculation, and statistics update, must be completed prior to
processing the next pixel. We expect the final implementation to be
quite a bit faster and slightly smaller.

One of the challenges faced implementing compression algorithms in
hardware is that the algorithms are most often specified in the
standards documents in C-pseudocode. So implementation often requires
translating a substantially serial implementation into a pipelineable
design.

The more significant challenge with implementing these algorithms in
hardware is that each pixel/token, is completely dependent upon the
error and updated statistics from the previous pixel/token. If you
can identify what this minimal loop is, and put all other logic into
pipelines before and after this loop, your performance will be limited
solely by the minimal loop. Most of these algorithms look at
minimizing the total compute required, rather than minimizing the
complexity of this minimal loop. The former will dictate software
performance, the later hardware, in optimal implementations. As an
example JPEG-LS in near lossless mode is only marginal slower in
software, but about 50% as fast when implented in hardware, each as
compared to JPEG-LS in lossless mode.

I hope these comments atleast offer a brief introduction. I also hope
that my comments were not too commercial in nature, having mentioned
two cores that we are getting ready to release.


Regards,
Erik Widding.

---
Birger Engineering, Inc. -------------------------------- 617.695.9233
100 Boylston St #1070; Boston, MA 02116 -------- http://www.birger.com
 
Hello,

lenz19@gmx.de (lenz) writes:

Which lossless compression algorithms are suited for fpga
implementation.
If you know in advance information on the type of data (e.g. english
language) then a Huffman Coding could be a quick and small
solution. It could be e.g. easily done with Lookuptable.

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}
 
"Florian-Wolfgang Stock" <f.stock@tu-bs.de> wrote in message
news:40334504$0$17577$9b4e6d93@newsread4.arcor-online.net...

int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}

Nice piece of code and it even compiles and works :) I am not sure I fully
understand the output though.. :)


/Mikhail

--
To reply directly:
matusov at square peg ca
(join the domain name in one word and add a dot before "ca")
 
"MM" <mbmsv@yahoo.com> writes:

"Florian-Wolfgang Stock" <f.stock@tu-bs.de> wrote in message
news:40334504$0$17577$9b4e6d93@newsread4.arcor-online.net...

int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}

Nice piece of code and it even compiles and works :) I am not sure I fully
understand the output though.. :)
the part with the email address is obvious, the thing around is called
mandelbrot-set (in german also called "Apfelmaennchen") - And no, I
havnt implemented it yet on a fpga (but its just some complex
calculation, so it should not be difficult :)

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}
 
We are in the process of completing a LZ based compression core that
is very similar to LZS, but does not violate the patents. The
performance in a Xilinx V2Pro is expected to be on the order of 100M
bytes/second, with a 512 byte dictionary. We expect to release this
core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.
Good info Erik. How does the above performance compare to a 2 or 3 GHz 64
bit processor?


Steve
 
Florian-Wolfgang Stock <f.stock@tu-bs.de> wrote in message news:<40334504$0$17577$9b4e6d93@newsread4.arcor-online.net>...
Hello,

lenz19@gmx.de (lenz) writes:

Which lossless compression algorithms are suited for fpga
implementation.

If you know in advance information on the type of data (e.g. english
language) then a Huffman Coding could be a quick and small
solution. It could be e.g. easily done with Lookuptable.
Here is an implementation from 2000.
http://www.sulimma.de/prak/ss00/projekte/huffman/Huffman.html
It was a student project for my lab course.

Kolja Sulimma
 
news@sulimma.de (Kolja Sulimma) writes:

Florian-Wolfgang Stock <f.stock@tu-bs.de> wrote in message
news:<40334504$0$17577$9b4e6d93@newsread4.arcor-online.net>...
Hello,

lenz19@gmx.de (lenz) writes:

Which lossless compression algorithms are suited for fpga
implementation.

If you know in advance information on the type of data (e.g. english
language) then a Huffman Coding could be a quick and small
solution. It could be e.g. easily done with Lookuptable.

Here is an implementation from 2000.
http://www.sulimma.de/prak/ss00/projekte/huffman/Huffman.html
It was a student project for my lab course.
Exactly something like that I had in mind as I talked from it. The
Restriction with with the advance information could be circumvent by
dynamic generating the Lookuptable (here is the drawback, that you
need 2 Passes over the stuff you want to code).

Florian
--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}
 
Florian-Wolfgang Stock wrote:
news@sulimma.de (Kolja Sulimma) writes:

Florian-Wolfgang Stock <f.stock@tu-bs.de> wrote in message
news:<40334504$0$17577$9b4e6d93@newsread4.arcor-online.net>...
Hello,

lenz19@gmx.de (lenz) writes:

Which lossless compression algorithms are suited for fpga
implementation.

If you know in advance information on the type of data (e.g. english
language) then a Huffman Coding could be a quick and small
solution. It could be e.g. easily done with Lookuptable.

Here is an implementation from 2000.
http://www.sulimma.de/prak/ss00/projekte/huffman/Huffman.html
It was a student project for my lab course.

Exactly something like that I had in mind as I talked from it. The
Restriction with with the advance information could be circumvent by
dynamic generating the Lookuptable (here is the drawback, that you
need 2 Passes over the stuff you want to code).
Isn't there a variation that generates the table on the fly? I don't
know any details, but wouldn't they have this same problem in modem
compression, you can only see the data once? Or do they buffer up a
block before compressing? Back in the early days, an engineer talked to
me about the possibility of patenting a method that worked like this.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
 
Hello,

rickman <spamgoeshere4@yahoo.com> writes:

Florian-Wolfgang Stock wrote:

news@sulimma.de (Kolja Sulimma) writes:

Here is an implementation from 2000.
http://www.sulimma.de/prak/ss00/projekte/huffman/Huffman.html
It was a student project for my lab course.

Exactly something like that I had in mind as I talked from it. The
Restriction with with the advance information could be circumvent by
dynamic generating the Lookuptable (here is the drawback, that you
need 2 Passes over the stuff you want to code).

Isn't there a variation that generates the table on the fly? I don't
know any details, but wouldn't they have this same problem in modem
compression, you can only see the data once? Or do they buffer up a
block before compressing? Back in the early days, an engineer talked to
me about the possibility of patenting a method that worked like this.
That isnt possible. At least not with Normal Huffman Coding. With the
ZIP Alogrithm it is possible to build a dictonary on the fly, so you
need only one pass.
The Huffman coding bases on the fact that it replaces often appearing
characters with short bitstrings, seldom ones with long bitstrings. To
know which character is often/seldom you need to count it. There are
some ways you can reduce this problem - most often used is just a
precomputed dictonary. The is the distribution of an average
english text used. This works good if you want to encode normal
text (or if you know what to expect, make some distribution analysis
on some data and make your dictonary on it).

Another approach could be, to use an internal buffer from a given
size, and build just on that distribution your dictonary.

Florian
--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}
 
You can also use a dynamic dictionary that uses only past history to determine
the contents. Here you would start with a default dictionary, and over time
new entries for frequently occuring phrases would replace ones in the
dictionary that are used less frequently. The first occurrences of such
phrases in the text won't get replaced (unless that phrase is already in the
dictionary), but once enough of them occur any subsequent ones do get
replaced.

Florian-Wolfgang Stock wrote:

Hello,

rickman <spamgoeshere4@yahoo.com> writes:

Florian-Wolfgang Stock wrote:

news@sulimma.de (Kolja Sulimma) writes:

Here is an implementation from 2000.
http://www.sulimma.de/prak/ss00/projekte/huffman/Huffman.html
It was a student project for my lab course.

Exactly something like that I had in mind as I talked from it. The
Restriction with with the advance information could be circumvent by
dynamic generating the Lookuptable (here is the drawback, that you
need 2 Passes over the stuff you want to code).

Isn't there a variation that generates the table on the fly? I don't
know any details, but wouldn't they have this same problem in modem
compression, you can only see the data once? Or do they buffer up a
block before compressing? Back in the early days, an engineer talked to
me about the possibility of patenting a method that worked like this.

That isnt possible. At least not with Normal Huffman Coding. With the
ZIP Alogrithm it is possible to build a dictonary on the fly, so you
need only one pass.
The Huffman coding bases on the fact that it replaces often appearing
characters with short bitstrings, seldom ones with long bitstrings. To
know which character is often/seldom you need to count it. There are
some ways you can reduce this problem - most often used is just a
precomputed dictonary. The is the distribution of an average
english text used. This works good if you want to encode normal
text (or if you know what to expect, make some distribution analysis
on some data and make your dictonary on it).

Another approach could be, to use an internal buffer from a given
size, and build just on that distribution your dictonary.

Florian
--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
I*l+_*_<6&&26-++u;_=2*l*_+e/80*.09-1,l=I)I=l*l-_*_-2+m/27.;}
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Steve Casselman <sc.nospam@vcc.com> wrote:
We are in the process of completing a LZ based compression core that
is very similar to LZS, but does not violate the patents. The
performance in a Xilinx V2Pro is expected to be on the order of 100M
bytes/second, with a 512 byte dictionary. We expect to release this
core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.

Good info Erik. How does the above performance compare to a 2 or 3 GHz 64
bit processor?
I find this to be a very stupid - and irrelevant - question. Not only is
this majorly affected by the bitness of the CPU, a 2Ghz genral purpose
CPU is going to have (comapred to FPGA) very large amounts of very fast
"blockram". CPUs do most things faster than FPGA-s, and that includes
most kinds of compression. This doesn't mean they aren't interesting
in FPGA-s or that FPGA-s couldn't do even those things alongside CPU-s
- but it does reduce the number of things for which FPGA-s are practical
if they are alongside CPUs.

Now of course, if you are power limited then said CPU-s go missing from
the picture (though not entirely - look at the performance and power needs
of say VIA C3).

--
Sander

+++ Out of cheese error +++
 
Sander Vesik wrote:
Steve Casselman <sc.nospam@vcc.com> wrote:
We are in the process of completing a LZ based compression core that
is very similar to LZS, but does not violate the patents. The
performance in a Xilinx V2Pro is expected to be on the order of 100M
bytes/second, with a 512 byte dictionary. We expect to release this
core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.

Good info Erik. How does the above performance compare to a 2 or 3 GHz 64
bit processor?


I find this to be a very stupid - and irrelevant - question. Not only is
this majorly affected by the bitness of the CPU, a 2Ghz genral purpose
CPU is going to have (comapred to FPGA) very large amounts of very fast
"blockram". CPUs do most things faster than FPGA-s, and that includes
most kinds of compression. This doesn't mean they aren't interesting
in FPGA-s or that FPGA-s couldn't do even those things alongside CPU-s
- but it does reduce the number of things for which FPGA-s are practical
if they are alongside CPUs.

Now of course, if you are power limited then said CPU-s go missing from
the picture (though not entirely - look at the performance and power needs
of say VIA C3).
Your comments seem odd. You have no understanding of why Steve was
asking the question. So how can you presume to judge it to be
"stupid"? Your analysis is just one way of looking at the comparison.
If nothing else, perhaps he wanted the comparison just as a way of
judging the speed in less technical terms, for presentation to
management, for example.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
 
Isn't there a variation that generates the table on the fly?
Yes there is ... it is called Golomb Rice coding ...

Its a Huffman-variation and it assumes a geometric distribution
At the moment im still simulating in C to estimate the coding
performance - FPGA implementation should be easy ...
it seems to work quite well on dpcm-images ...


bye,
Michael
 
Florian-Wolfgang Stock wrote:
Isn't there a variation that generates the table on the fly? I don't
know any details, but wouldn't they have this same problem in modem
compression, you can only see the data once? Or do they buffer up a
block before compressing? Back in the early days, an engineer talked to
me about the possibility of patenting a method that worked like this.


That isnt possible. At least not with Normal Huffman Coding. With the
ZIP Alogrithm it is possible to build a dictonary on the fly, so you
need only one pass.
The Huffman coding bases on the fact that it replaces often appearing
characters with short bitstrings, seldom ones with long bitstrings. To
know which character is often/seldom you need to count it.
I've never played with this problem before but I fail to see why you
couldn't do it adaptively - at the cost of more memory of course. Just
maintain a heap of occurence count of codes up until this point. To
emit a word search for it the heap (hashing would do) and it's position
in the heap directly translates into it's Huffman code. Then update it's
count which could cause the code to move up in the heap and thus update
a bunch of word -> code mappings.

To decode you'd do almost exactly the same, except you're looking up
Huffman codes in the heap and translating them to words. (Of course I
assume that previously unseen words are emitted escaped in some way, eg.
prefixed with a reserved code.)

Improvements include starting with a good default dictionary and
heuristics for resetting the dictionary when attained compression is too
low.

Is anything wrong with this obvious approach?

/Tommy
 
user@domain.invalid wrote:
I've never played with this problem before but I fail to see why you
couldn't do it adaptively - at the cost of more memory of course. Just
maintain a heap of occurence count of codes up until this point. To
emit a word search for it the heap (hashing would do) and it's position
in the heap directly translates into it's Huffman code. Then update it's
count which could cause the code to move up in the heap and thus update
a bunch of word -> code mappings.
Adaptive huffman codes have been around since at least the early 1980s.

To decode you'd do almost exactly the same, except you're looking up
Huffman codes in the heap and translating them to words. (Of course I
assume that previously unseen words are emitted escaped in some way, eg.
prefixed with a reserved code.)

Improvements include starting with a good default dictionary and
heuristics for resetting the dictionary when attained compression is too
low.

Is anything wrong with this obvious approach?
You may want to read the following:

igm.univmlv.fr/~mac/DOC/C10.ps
http://www.cs.duke.edu/~jsv/Papers/Vit87.jacm.ps.gz

If you are interested in (adaptive) huffman codes

--
Sander

+++ Out of cheese error +++
 
user@domain.invalid wrote:
Florian-Wolfgang Stock wrote:
Isn't there a variation that generates the table on the fly? I don't
know any details, but wouldn't they have this same problem in modem
compression, you can only see the data once? Or do they buffer up a
block before compressing? Back in the early days, an engineer talked to
me about the possibility of patenting a method that worked like this.


That isnt possible. At least not with Normal Huffman Coding. With the
ZIP Alogrithm it is possible to build a dictonary on the fly, so you
need only one pass.
The Huffman coding bases on the fact that it replaces often appearing
characters with short bitstrings, seldom ones with long bitstrings. To
know which character is often/seldom you need to count it.

I've never played with this problem before but I fail to see why you
couldn't do it adaptively - at the cost of more memory of course. Just
maintain a heap of occurence count of codes up until this point. To
emit a word search for it the heap (hashing would do) and it's position
in the heap directly translates into it's Huffman code. Then update it's
count which could cause the code to move up in the heap and thus update
a bunch of word -> code mappings.

To decode you'd do almost exactly the same, except you're looking up
Huffman codes in the heap and translating them to words. (Of course I
assume that previously unseen words are emitted escaped in some way, eg.
prefixed with a reserved code.)

Improvements include starting with a good default dictionary and
heuristics for resetting the dictionary when attained compression is too
low.

Is anything wrong with this obvious approach?
Maybe. If I understand what you are saying, initially a sequence will
be sent unencoded until enough information is saved up to assign codes
to various strings that occur. But the xmitting end sees the raw stream
while the rxing end sees the encoded end. How does the rxing end know
when the first code is sent and what it stands for? Each time a new
code is sent, the rxing end needs to know what is happening. Or does
the decision to send a code instead of the data depend only on the
history and not on the present data?

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
 
Maybe. If I understand what you are saying, initially a sequence will
be sent unencoded until enough information is saved up to assign codes
to various strings that occur. But the xmitting end sees the raw stream
while the rxing end sees the encoded end. How does the rxing end know
when the first code is sent and what it stands for? Each time a new
code is sent, the rxing end needs to know what is happening. Or does
the decision to send a code instead of the data depend only on the
history and not on the present data?
Some compression schemes have a way to indicate that the next N bytes
are raw data, don't look in the dictionary.

You can also use the data stream (the part you have sent already)
as a dictionary, encoding things like copy X bytes starting -Y bytes
back from here.

In general, most good compression techniques are a tradeoff on memory
CPU/cache usage, and compression efficiency.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.
 
Sander Vesik <sander@haldjas.folklore.ee> wrote in message news:<1077324833.209845@haldjas.folklore.ee>...
Steve Casselman <sc.nospam@vcc.com> wrote:
We are in the process of completing a LZ based compression core that
is very similar to LZS, but does not violate the patents. The
performance in a Xilinx V2Pro is expected to be on the order of 100M
bytes/second, with a 512 byte dictionary. We expect to release this
core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.

Good info Erik. How does the above performance compare to a 2 or 3 GHz 64
bit processor?
Steve,

We have not run optimized the reference encoder or run it on such a
fast machine yet. But I can make a guess as to the perfomance based
on the instruction flow. Best case one might get 2-3M bytes/second
throughput. I am being very generous with the estimate, unless I have
overlooked an optimization.

I am generally wary to give performance numbers for our reference
encoders, as we generally write them to prove an algorithm "works",
i.e. for test bench data generation and verification of compression
ratios. It would be rather misleading for us to say "our hardware
goes x-times faster than our software implementation" as we invest a
great deal of effort in the hardware optimizations and the algorithms,
and almost none in the software beyond demonstrating the correctness
of the implementation and algorithm. Our reference encoder probably
won't get anywhere near 2MB/sec.


Regards,
Erik Widding.


P.S. I grouped this response with one below, as they are closely
related.

*******************************************

Sander,

I find this to be a very stupid - and irrelevant - question.
Actually I found Steve's question to be right on target, and quite
typical of those asked by our customers. I had not yet had a chance
to reply. As for the rest of your comments... just about everything
in your post is incorrect.

Not only is
this majorly affected by the bitness of the CPU, a 2Ghz genral purpose
CPU is going to have (comapred to FPGA) very large amounts of very fast
"blockram".
Most of these algorithms deal with memory access in a very
predetermined fashion globally, with random access on a much more
localized basis. So the lack of large amounts of blockram does not
benefit the processor implementation. If memory is needed, one
generally adds SDRAM next to the FPGA.

CPUs do most things faster than FPGA-s, and that includes
most kinds of compression.
If this were true the compression engines for tape drives would not be
located in the drive, but rather run on the host processors
themselves. As an example, for minimally compressible data in our
LZ-derivative (or LZS for that matter) with a 512 byte
history-buffer/dictionary there will be 512 comparisons and 512
address increments, requiring 1024 instructions minimum. In an FPGA
these 512 comparisons can be executed in a single clock period. The
FPGA clock rate may only be 100MHz to the processors 3GHz, or a ratio
of 1:30, whereas the FPGA only requires one clock cycle to the
processors 1000+.

I do not know of a single compression algorithm that will run faster
on a general purpose processor than it will run on an FPGA with
external SDRAM. Many won't even need the external memory to beat the
processor. Assume both implementations are done by experts in the
respective areas.

Feel free to propose an algorithm where I am wrong. It has been my
experience that the hardware implementations are generally one to
three orders of magnitude faster than the software, with the
performance normalized for the cost of the respective hardware.


Regards,
Erik Widding.

---
Birger Engineering, Inc. -------------------------------- 617.695.9233
100 Boylston St #1070; Boston, MA 02116 -------- http://www.birger.com
 

Welcome to EDABoard.com

Sponsor

Back
Top