Where are Huffman encoding applications?

W

Weng Tianxiang

Guest
Hi,
I want to know how many fields Huffman encoding thoery are used.

As I can counted,
1. In Window operating system, similar encoding is used to conpress
text files;
2. Text compression in ZIP files;
3. JPEG fram compression;
4. MPEG format compression? which one uses it?

Wireless communication?

Any other applications?

Thank you.

Weng
 
Weng Tianxiang wrote:
Hi,
I want to know how many fields Huffman encoding thoery are used.

As I can counted,
1. In Window operating system, similar encoding is used to conpress
text files;
2. Text compression in ZIP files;
3. JPEG fram compression;
4. MPEG format compression? which one uses it?

Wireless communication?

Any other applications?

Thank you.

Weng
I am sure in JPEG and MPEG.
 
On 1 Aug 2006 04:02:31 -0700, "Weng Tianxiang" <wtxwtx@gmail.com>
wrote in comp.programming:

Hi,
I want to know how many fields Huffman encoding thoery are used.

As I can counted,
1. In Window operating system, similar encoding is used to conpress
text files;
2. Text compression in ZIP files;
3. JPEG fram compression;
4. MPEG format compression? which one uses it?

Wireless communication?

Any other applications?

Thank you.

Weng
FAX transmission, although the Huffman codes are predefined and not
generated dynamically from the data.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
 
Hi Jack,
Fax application cannot be counted as full Huffman encoding. The reason
is the statistics for characters are not dynamically collected.

Thank you.

Weng

Jack Klein wrote:
On 1 Aug 2006 04:02:31 -0700, "Weng Tianxiang" <wtxwtx@gmail.com
wrote in comp.programming:

Hi,
I want to know how many fields Huffman encoding thoery are used.

As I can counted,
1. In Window operating system, similar encoding is used to conpress
text files;
2. Text compression in ZIP files;
3. JPEG fram compression;
4. MPEG format compression? which one uses it?

Wireless communication?

Any other applications?

Thank you.

Weng

FAX transmission, although the Huffman codes are predefined and not
generated dynamically from the data.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
 
Fax application cannot be counted as full Huffman encoding. The reason
is the statistics for characters are not dynamically collected.
Also for JPEG, etc. in most cases the predefined default Huffman-tables are
used (Althought the file-format would support custom tables). I think the
term "Huffman" is also widely used for such "static" applications.

Regards,

Thomas

P.S.: I am wondering what the reason for your question is?

www.entner-electronics.com
 
Hi Thomas,
When Huffman frequency table is available, the decoding procedure is
stright forward.

I know many electronic books use Huffman encoding method to compress
the book full context.

Usually in the first pass of full context, Huffman frequency table is
established , in the 2nd pass, a Hash function is used to locate a new
word entry, then encode, based on the Huffman frequency table, and
output it.

I want to know how Hash function is used in the 2nd pass.

Any books or tips?

Thank you.

Weng


Thomas Entner wrote:
Fax application cannot be counted as full Huffman encoding. The reason
is the statistics for characters are not dynamically collected.

Also for JPEG, etc. in most cases the predefined default Huffman-tables are
used (Althought the file-format would support custom tables). I think the
term "Huffman" is also widely used for such "static" applications.

Regards,

Thomas

P.S.: I am wondering what the reason for your question is?

www.entner-electronics.com
 
Weng Tianxiang wrote:
Hi Thomas,
When Huffman frequency table is available, the decoding procedure is
stright forward.

I know many electronic books use Huffman encoding method to compress
the book full context.

Usually in the first pass of full context, Huffman frequency table is
established , in the 2nd pass, a Hash function is used to locate a new
word entry, then encode, based on the Huffman frequency table, and
output it.

I want to know how Hash function is used in the 2nd pass.

I think you are asking: how do I generate the huffman codes given the
frequency of each symbol. If so look at:

http://en.wikipedia.org/wiki/Huffman_coding

in the section "basic technique".

Cheers,

Andy.
 
Andy,
No. I want to know the hash function.

For example, I have a Huffman encoding table as follows:

name Fre encode
....
Andy 15, "011",
Ray 20, "110",
....

Now a text string "Andy Ray wrote" is incoming, I want to know how to
match the incoming "Andy" and the table entry "Andy".

One must establish a 1-to-1 relationship between the incoming word and
the table entry.

What I can imagine to do in software is to create a hash function, then
if there are more entries responding to one entry, further string match
must be made.

I am wondering about whether or not a hardware design can does it.

Thank you.

Weng

Andy Ray wrote:
Weng Tianxiang wrote:
Hi Thomas,
When Huffman frequency table is available, the decoding procedure is
stright forward.

I know many electronic books use Huffman encoding method to compress
the book full context.

Usually in the first pass of full context, Huffman frequency table is
established , in the 2nd pass, a Hash function is used to locate a new
word entry, then encode, based on the Huffman frequency table, and
output it.

I want to know how Hash function is used in the 2nd pass.



I think you are asking: how do I generate the huffman codes given the
frequency of each symbol. If so look at:

http://en.wikipedia.org/wiki/Huffman_coding

in the section "basic technique".

Cheers,

Andy.
 
Weng Tianxiang wrote:
Andy,
No. I want to know the hash function.

For example, I have a Huffman encoding table as follows:

name Fre encode
...
Andy 15, "011",
Ray 20, "110",
...

Now a text string "Andy Ray wrote" is incoming, I want to know how to
match the incoming "Andy" and the table entry "Andy".

One must establish a 1-to-1 relationship between the incoming word and
the table entry.

What I can imagine to do in software is to create a hash function, then
if there are more entries responding to one entry, further string match
must be made.
A trie might be an option. I once had a conversation with someone who
worked at a company that needed to match strings quickly in hardware
and I believe they might've used a trie for that, although I could be
remembering wrong since it was a few years ago.

If you haven't encountered tries, think of them as deterministic
finite-state automata shaped like a tree, where the root is the
start state, and transitions (edges) from one node to the next
correspond to symbols on your input.

For example, if you want to recognize the strings "car", "cat",
and "cup", your trie would look like this:


(start)--->( )-+--->( )----->(end)
c | u p
|
+--->( )-+--->(end)
a | r
|
+--->(end)
t

Basically, you start at the start state, absorb a token and follow
the edge associated with it, and repeat until you either don't have
a matching edge or hit an end state.

The nice thing about a trie is that you don't have to worry about
collisions or worst-case behavior. The time to match is a linear
function of the length of the string you're matching.

A straightforward trie where every input symbol is a character (hence,
probably 8 bits or maybe even larger) might be a little wasteful of
space. But there are ways around that. The most obvious one is to
say that the input is a string of bits and each bit is an input
symbol. This makes your tree 8 times as deep as with 8-bit characters
as input symbols, but it has the advantage that each node can have
no more than 2 children (as compared to 256). A nice middle ground
might be to make your input symbols pairs of bits or sequences of
4 bits.

- Logan
 
Hi Logan,
Your method is interesting, but is far from what I am looking for.

It is too slow. My target is 64-bit inputs on every clock and one must
find its entry within 1-2 clocks and resolve entry conflicts if any at
the same time.

This design, if succeeded, can be widely used for any dynamic Huffman
encoding.

I think there is no such hardware design so far in the world and why I
found that almost all Huffman encoding applications are static as in
FAX, electronic books, ZIP files, JPEG, MPEG and so on.

Thank you.

Weng

Logan Shaw wrote:
Weng Tianxiang wrote:
Andy,
No. I want to know the hash function.

For example, I have a Huffman encoding table as follows:

name Fre encode
...
Andy 15, "011",
Ray 20, "110",
...

Now a text string "Andy Ray wrote" is incoming, I want to know how to
match the incoming "Andy" and the table entry "Andy".

One must establish a 1-to-1 relationship between the incoming word and
the table entry.

What I can imagine to do in software is to create a hash function, then
if there are more entries responding to one entry, further string match
must be made.

A trie might be an option. I once had a conversation with someone who
worked at a company that needed to match strings quickly in hardware
and I believe they might've used a trie for that, although I could be
remembering wrong since it was a few years ago.

If you haven't encountered tries, think of them as deterministic
finite-state automata shaped like a tree, where the root is the
start state, and transitions (edges) from one node to the next
correspond to symbols on your input.

For example, if you want to recognize the strings "car", "cat",
and "cup", your trie would look like this:


(start)--->( )-+--->( )----->(end)
c | u p
|
+--->( )-+--->(end)
a | r
|
+--->(end)
t

Basically, you start at the start state, absorb a token and follow
the edge associated with it, and repeat until you either don't have
a matching edge or hit an end state.

The nice thing about a trie is that you don't have to worry about
collisions or worst-case behavior. The time to match is a linear
function of the length of the string you're matching.

A straightforward trie where every input symbol is a character (hence,
probably 8 bits or maybe even larger) might be a little wasteful of
space. But there are ways around that. The most obvious one is to
say that the input is a string of bits and each bit is an input
symbol. This makes your tree 8 times as deep as with 8-bit characters
as input symbols, but it has the advantage that each node can have
no more than 2 children (as compared to 256). A nice middle ground
might be to make your input symbols pairs of bits or sequences of
4 bits.

- Logan
 
Weng Tianxiang wrote:
Andy,
No. I want to know the hash function.

For example, I have a Huffman encoding table as follows:

name Fre encode
...
Andy 15, "011",
Ray 20, "110",
...

Now a text string "Andy Ray wrote" is incoming, I want to know how to
match the incoming "Andy" and the table entry "Andy".

One must establish a 1-to-1 relationship between the incoming word and
the table entry.

What I can imagine to do in software is to create a hash function, then
if there are more entries responding to one entry, further string match
must be made.

I am wondering about whether or not a hardware design can does it.

Ok - you're asking about decoding, not encoding.

I don't think there's a hash function for huffman codes, though I'd love
to be proven wrong.

You can, however, decode huffman codes with a binary search tree encoded
into a rom/ram. This works because no code is a prefix of another. You
can alter the tree to take more than one input bit at a time at the
expense of hardware.

A fully parallel hardware solution needs CAM (with don't cares), but
that's really expensive to implement.

Cheers,

Andy.
 
Hi Andy,
"Ok - you're asking about decoding, not encoding."

No, It is encoding!

Weng
 
Weng Tianxiang wrote:
Your method is interesting, but is far from what I am looking for.
It is too slow.
I found that almost all Huffman encoding applications are static as in
FAX, electronic books, ZIP files, JPEG, MPEG and so on.
You are wrong there. ZIP files and most EBooks use some variant of LZ
encoding using a suffix tree (which might be feasible in hardware),
followed by Huffman coding - static *or* dynamic - on the resultant
symbols. Zip in particular uses the deflate algorithm, which can emit
either static or dynamically-generated Huffman data on each block.
But AFAIK, the Huffman table is *not* updated with each symbol. There
was a kid on sci.crypt a while back banging on about a per-symbol
dynamic Huffman technique he'd invented, perhaps look there.

JPEG and MPEG use a DCT (and optionally in MPEG, motion compensation)
before applying further compression. The DCT is *not* huffman by
another name, its a discrete cosine transform.

Also, look for information on Suffix Trees (used in Lempel-Zif
encoding), and you might find it can be done in a limited way in
hardware. The tree is built incrementally, and looks kinda like
Logan's trie, except that the first symbol of a sequence is at
the leaf, with the last symbol at the root. Kinda...

In general, though Huffman coding was thought to be "optimal" two
decades ago, newer techniques have been shown to be better. For
example, arithmetic coding (needs two passes over the data though),
and the Burrows-Wheeler transform (used in bzip).

There should be enough terms for you to go googling and perhaps find
a more suitable algorithm than Huffman. Don't forget - compression is
simply the process of removing whatever is predictable from a data
stream. The better you understand your data, the better an algorithm
can predict it, and the better your compressor can be.

Good luck!

Clifford Heath.
 
Hi Cliffor,
I appreciate your response very much and you are an experienced expert
in the field.

I am a newbie in the field and really like to learn more. But I am
determined to learn all those things you have mentioned better.

What books can you suggest for me to buy and read about the fields?

I have a book "Data Compression Book" by Mark Nelson.

I want to learn two arithmatic algorithm too.

Thank you.

Weng



Clifford Heath wrote:
Weng Tianxiang wrote:
Your method is interesting, but is far from what I am looking for.
It is too slow.
I found that almost all Huffman encoding applications are static as in
FAX, electronic books, ZIP files, JPEG, MPEG and so on.

You are wrong there. ZIP files and most EBooks use some variant of LZ
encoding using a suffix tree (which might be feasible in hardware),
followed by Huffman coding - static *or* dynamic - on the resultant
symbols. Zip in particular uses the deflate algorithm, which can emit
either static or dynamically-generated Huffman data on each block.
But AFAIK, the Huffman table is *not* updated with each symbol. There
was a kid on sci.crypt a while back banging on about a per-symbol
dynamic Huffman technique he'd invented, perhaps look there.

JPEG and MPEG use a DCT (and optionally in MPEG, motion compensation)
before applying further compression. The DCT is *not* huffman by
another name, its a discrete cosine transform.

Also, look for information on Suffix Trees (used in Lempel-Zif
encoding), and you might find it can be done in a limited way in
hardware. The tree is built incrementally, and looks kinda like
Logan's trie, except that the first symbol of a sequence is at
the leaf, with the last symbol at the root. Kinda...

In general, though Huffman coding was thought to be "optimal" two
decades ago, newer techniques have been shown to be better. For
example, arithmetic coding (needs two passes over the data though),
and the Burrows-Wheeler transform (used in bzip).

There should be enough terms for you to go googling and perhaps find
a more suitable algorithm than Huffman. Don't forget - compression is
simply the process of removing whatever is predictable from a data
stream. The better you understand your data, the better an algorithm
can predict it, and the better your compressor can be.

Good luck!

Clifford Heath.
 
Weng Tianxiang wrote:
I am a newbie in the field and really like to learn more. But I am
determined to learn all those things you have mentioned better.
What books can you suggest for me to buy and read about the fields?
Hmm. Not sure I'd recommend books - they tend to be too theoretical.
If I were you I'd enter some of the search words into wikipedia and
google and read any technical papers you find that look authoritative.
<http://en.wikipedia.org/wiki/Data_compression> is a good start.
Code can also be useful, though often very cryptic - people who write
compressors are often very bad at writing readable code.

What's the source of your data? Are the 64 bits broken into fields,
say 4x16-bit fields? What patterns might you expect? Do the values
tend to cluster around a certain range, or tend to be close to
previous readings? All these things affect the predicability of your
data and so are intimately connected with the compression approach
you must choose.
 
Weng Tianxiang wrote:
Your method is interesting, but is far from what I am looking for.

It is too slow. My target is 64-bit inputs on every clock and one must
find its entry within 1-2 clocks and resolve entry conflicts if any at
the same time.

This design, if succeeded, can be widely used for any dynamic Huffman
encoding.

I think there is no such hardware design so far in the world and why I
found that almost all Huffman encoding applications are static as in
FAX, electronic books, ZIP files, JPEG, MPEG and so on.
Please do not top-post. Your reply belongs after, or intermixed
with, the *snipped* material to which you reply.

Huffman encoding is rarely used today. Adaptive Huffman is more
common. For a complete review see: "The Compression Book" by Mark
Nelson and Jean-Loup Gailly, M & T Books, ISBN 1-5581-434-1.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@maineline.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE maineline address!
 
CBFalconer wrote:
Weng Tianxiang wrote:

Your method is interesting, but is far from what I am looking for.

It is too slow. My target is 64-bit inputs on every clock and one must
find its entry within 1-2 clocks and resolve entry conflicts if any at
the same time.

This design, if succeeded, can be widely used for any dynamic Huffman
encoding.

I think there is no such hardware design so far in the world and why I
found that almost all Huffman encoding applications are static as in
FAX, electronic books, ZIP files, JPEG, MPEG and so on.

Please do not top-post. Your reply belongs after, or intermixed
with, the *snipped* material to which you reply.

Huffman encoding is rarely used today. Adaptive Huffman is more
common. For a complete review see: "The Compression Book" by Mark
Nelson and Jean-Loup Gailly, M & T Books, ISBN 1-5581-434-1.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@maineline.net)
Available for consulting/temporary embedded and systems.
http://cbfalconer.home.att.net> USE maineline address!
Hi Chuck,
Thank you for your advice.

I have bought the book "The Compression Book".

Weng
 

Welcome to EDABoard.com

Sponsor

Back
Top