M
Martin Brown
Guest
On 02/12/2021 09:52, Don Y wrote:
That used to be true but most content these days apart from HTML and
flat text files is already compressed with something like a crude ZIP.
MS Office 2007 onwards files are thinly disguised ZIP files.
It isn\'t clear to me what issue you are trying to address. Recognising
the header info for each compression method and the using it will get
you a long way but ISTR it has already been done. A derivative of the
IFL DOS utility programme that used to do that for example.
AV programs that can see into (some) archive files also use this method.
But if you have control of what is being done you can detect files where
the method you are using cannot make any improvements and just copy it.
You definitely want to use the byte entropy to classify them then. That
will tell you fairly quickly whether or not a given block of disk is
JPEG, HTML or EXE without doing much work at all.
There were utilities of this sort back in the 1990\'s why reinvent the
wheel? It is harder now since there are even more rival formats.
That was a part of what I used to do to numeric data files way back
exploiting the fact that spreadsheets treat a blank cell as 0. Then
compress by RLE and then hit that with ZIP. It got quite close to what a
modern compression algorithm can do.
Transforms that only alter the value of the symbols but not their
frequency should make no difference at all to the compressibility with
the entropy based variable bit length encoding methods being used today.
Trying every one is a bit too brute force for my taste. Trying the ones
that might be appropriate for the dataset would be my preference.
I think you will find this gets complicated and slow.
I do something not unlike what you are proposing to recognise fragments
of damaged JPEG files and then splice a plausible header on the front to
take a look at what it looks like. Enough systems use default Huffman
tables that there is a fair chance of getting a piece of an image back.
Most modern lossless compressors optimise their symbol tree heavily and
so you have no such crib to deal with a general archive file. The
impossible ones are the backups using proprietary unpublished algorithms
- I can tell from byte entropy that they are pretty good though.
BTW do you have time to run that (now rather large benchmark) through
the Intel C/C++ compiler? I have just got it to compile with Clang for
the M1 and am hoping to run the benchmarks on Friday. My new target
Intel CPU to test is 12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)
I\'m sure there will be some since every port of notionally \"Portable\"
software to a new compiler uncovers coding defects (or compiler
defects). Clang for instance doesn\'t honour I64 length in printf.
--
Regards,
Martin Brown
On 12/2/2021 2:24 AM, Martin Brown wrote:
On 01/12/2021 22:31, Don Y wrote:
On 12/1/2021 11:52 AM, Dave Platt wrote:
In article <so764l$r9v$1@gonzo.revmaps.no-ip.org>,
Jasen Betts <usenet@revmaps.no-ip.org> wrote:
On 2021-12-01, Don Y <blockedofcourse@foo.invalid> wrote:
I\'m looking for \"established\" archive formats and/or compression
formats (the thinking being that an archive can always be
subsequently
compressed).
If by compressed you mean made smaller, that\'s obviously false.
If we interpret \"compressed\" to mean \"compressed without information
loss\", Jasen is correct. This can\'t be done.
No, you are assuming there is no other (implicit) source of
information that the compressor can rely upon.
He is stating a well known and general result.
That only applies in the general case. The fact that most compressors
achieve *some* compression means the general case is RARE in the wild;
typically encountered when someone tries to compress already compressed
content.
That used to be true but most content these days apart from HTML and
flat text files is already compressed with something like a crude ZIP.
MS Office 2007 onwards files are thinly disguised ZIP files.
My compressor obviously relies on the fact that 0xFE does not
occur in ascii text. (If it did, I\'d have to encode *it* in some
other manner)
[Unapplicable \"proof\" elided]
His general point is true though.
It isn\'t important to the issues I\'m addressing.
It isn\'t clear to me what issue you are trying to address. Recognising
the header info for each compression method and the using it will get
you a long way but ISTR it has already been done. A derivative of the
IFL DOS utility programme that used to do that for example.
AV programs that can see into (some) archive files also use this method.
*If* compression is used, IT WILL ALREADY HAVE BEEN APPLIED BEFORE I
ENCOUNTER THE (compressed) FILE(s). Any increase or decrease in file
size will already have been \"baked in\". There is no value to my being
able to \"lecture\" the content creator that his compression actually
INCREASED the size of his content. (caps are for emphasis, not shouting).
But if you have control of what is being done you can detect files where
the method you are using cannot make any improvements and just copy it.
[Compression also affords other features that are absent in its absence.
In particular, most compressors include checksums -- either implied
or explicit -- that further act to vouch for the integrity of the
content. Can you tell me if \"foo.txt\" is corrupted? What about
\"foo.zip\"?]
*My* concern is being able to recover the original file(s). REGARDLESS
OF THE COMPRESSORS AND ARCHIVERS USED TO GET THEM INTO THEIR CURRENT FORM.
You definitely want to use the byte entropy to classify them then. That
will tell you fairly quickly whether or not a given block of disk is
JPEG, HTML or EXE without doing much work at all.
A user can use an off-the-shelf archiver to \"bundle\" multiple files
into a single \"archive file\". So, I need to be able to \"unbundle\"
them, regardless of the archiver he happened to choose -- hence my
interest in \"archive formats\".
There were utilities of this sort back in the 1990\'s why reinvent the
wheel? It is harder now since there are even more rival formats.
A user can often opt to \"compress\" that resulting archive (or, the archive
program may offer that as an option applied while the archive is built).
(Or, an individual file without \"bundling\")
So, in order to unbundle the archive (or recover the singleton), I need
to be able to UNcompress it. Hence my interest in compressors.
A user *could* opt to encrypt the contents. If so, I won\'t even attempt
to access the original files. I have no desire to expend resource
\"guessing\" secrets!
He can also opt to apply some other (wacky, home-baked) encoding or
compression
scheme (e.g., when sending executables through mail, I routinely change the
file extenstion to \"xex\" and prepend some gibberish at the front of the
file
to obscure its signature -- because some mail scanners will attempt to
decompress compressed files to \"protect\" the recipients, otherwise wrapping
it in a ZIP would suffice). If so, I won\'t even attempt to access the
original file(s).
That was a part of what I used to do to numeric data files way back
exploiting the fact that spreadsheets treat a blank cell as 0. Then
compress by RLE and then hit that with ZIP. It got quite close to what a
modern compression algorithm can do.
One can argue that a user might do some other \"silly\" transform (ROT13?)
so I could cover those bases with (equally silly) inversions. I want to
identify the sorts of *likely* \"processes\" to which some (other!) user
could have subjected a file\'s (or group of files\') content and be able
to reverse them.
Transforms that only alter the value of the symbols but not their
frequency should make no difference at all to the compressibility with
the entropy based variable bit length encoding methods being used today.
[I recently encountered some dictionaries that were poorly disguised ZIP
archives]
If the user *chose* to encode his content in BNPF, then I want to be able
to *decode* that content. (as long as I don\'t have to \"guess secrets\"
or try to reverse engineer some wacky coding/packing scheme)
Its a relatively simple problem to solve --once you\'ve identified the
range of *common* archivers/encoders/compressors that might be used!
(e.g., SIT is/was common on Macs)
Trying every one is a bit too brute force for my taste. Trying the ones
that might be appropriate for the dataset would be my preference.
Unless there is some other redundant structure in the file you cannot
compress a file where the bytewise entropy is ln(256) or nearly so.
You also have to work much harder to get that very last 1% of
additional compression too - most algorithms don\'t even try.
PNG is one of the better lossless image ones and gets ~ln(190)
ZIP on a larger files gets very close indeed ~ln(255.7)
I think you will find this gets complicated and slow.
I do something not unlike what you are proposing to recognise fragments
of damaged JPEG files and then splice a plausible header on the front to
take a look at what it looks like. Enough systems use default Huffman
tables that there is a fair chance of getting a piece of an image back.
Most modern lossless compressors optimise their symbol tree heavily and
so you have no such crib to deal with a general archive file. The
impossible ones are the backups using proprietary unpublished algorithms
- I can tell from byte entropy that they are pretty good though.
BTW do you have time to run that (now rather large benchmark) through
the Intel C/C++ compiler? I have just got it to compile with Clang for
the M1 and am hoping to run the benchmarks on Friday. My new target
Intel CPU to test is 12600K which looks like it could be a real winner.
(ie slam it through the compiler and send me back the error msgs)
I\'m sure there will be some since every port of notionally \"Portable\"
software to a new compiler uncovers coding defects (or compiler
defects). Clang for instance doesn\'t honour I64 length in printf.
--
Regards,
Martin Brown