OT: card storage

On Sun, 16 May 2010 18:46:14 -0700, D Yuniskis <not.going.to.be@seen.com>
wrote:

Hi Joseph,

JosephKK wrote:
It's too bad someone hasn't hacked together a laptop that
is *just* a monitor + keyboard! I imagine there would be
a market for such a beast.

It not only has been done, but is still available, mouse included.
But most of such currently are just repurposed laptops. Probably more
user hacks than commercial sales. A great excuse to buy a laptop that is
otherwise relatively CPU underpowered.

This is something that *looks* like a laptop but, in
reality, is just the laptop's *screen* with a video
INput and keyboard with a PS/2 or USB OUTput?

Pointers, please?
Sounds a lot like an X-terminal. Due to volume issues the laptop may yet
be lower cost.
 
Hi Joseph,

JosephKK wrote:
On Sun, 16 May 2010 18:46:14 -0700, D Yuniskis <not.going.to.be@seen.com
wrote:

JosephKK wrote:
It's too bad someone hasn't hacked together a laptop that
is *just* a monitor + keyboard! I imagine there would be
a market for such a beast.
It not only has been done, but is still available, mouse included.
But most of such currently are just repurposed laptops. Probably more
user hacks than commercial sales. A great excuse to buy a laptop that is
otherwise relatively CPU underpowered.
This is something that *looks* like a laptop but, in
reality, is just the laptop's *screen* with a video
INput and keyboard with a PS/2 or USB OUTput?

Pointers, please?

Sounds a lot like an X-terminal. Due to volume issues the laptop may yet
be lower cost.
No. An X terminal has a processor in it, understands the
X protocol, has a network interface, etc.

I.e., if I gave you an LCD monitor and a keyboard, you could
never run xdm -- unless you added a processor and a NIC.
The device I am describing could be used "as a TV" (with
an NTSC-VGA adapter) -- something you aren't going to do
with an X Terminal.
 
On Mon, 17 May 2010 07:37:06 -0700, D Yuniskis
<not.going.to.be@seen.com> wrote:

Hi Paul,

Paul Keinanen wrote:
On Sun, 16 May 2010 10:08:23 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:

The most tedious are the print manuals that were never
available (publicly) in electronic form. In addition
to the actual *scanning*, there is a lot of work
getting the manuals "disassembled" to the point where
they *can* be scanned (e.g., ripping "perfect binding").

Have you tried photographing those pages using a digital camera ?

It's too labor intensive. You have to arrange for the book
to be held open "enough", even lighting, transfer photos
to PC, convert to TIFF, trim them, import them (in the correct
order) to a PDF, etc.
I have seen a video footage of a machine used by a library (sorry, I
did not record the details), which opened the book about 120 degrees.
One arm took the next page down to horizontal level, a horizontal
glass sheet was put on the page to make sure the page was truly
horizontal, the flash light was activated and the class sheet was
removed and the sequence restarted.

The sequence took about 2-3 seconds. Apparently some auto-focus was
used, since the distance between the lens and the paper changed each
time a new page was added.

The odd pages could be processed in one run and a separate run would
be required to process the even pages (including flipping the page and
inverting the picture order).


A lot of time, you end up with crappy image quality "in the
binding edge" as the paper curls and you can't get a clear
view of stuff at that edge, etc.
Put a heavy glass sheet on the page you are photographing, this will
flatten out the page and you will get equal focus across the page.

Instead, I bring the manuals to a print shop and have them
*cut* the binding edge off of the pages. They have large,
electric stack paper cutters (do ~1000 pages at a time
*without* the inevitable skew that a manual/guillotine paper
cutter imparts to the cut!). Then, I can just feed the
"individual pages" through the document feeder (instead of
having to manually flip pages, etc.).
If you do the trouble of carrying the manuals to the print shop, why
not let them scan the pages ?

In a law abiding print shops you may have to prove that you have the
copyright to make your own copies.

----------------

Then there is the question how to store the scanned pages and also how
to distribute the (web) pages in a bandwidth efficient way.

I previously thought that storing and displaying the scanned pages as
simply bilevel (1 bit/pixel) bitmaps (typically run length encoded as
in faxes) would be sufficient, however, such page pictures look
horrible and the OCR software does not reliably make sense of the
text.

1 bit/pixel is really too little and 8 bits/pixel would be excessive.
How many bits/pixel would be sufficient for pleasant visual rendering
or required by OCR software ?
 
Hi Paul,

Paul Keinanen wrote:
On Mon, 17 May 2010 07:37:06 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:

The most tedious are the print manuals that were never
available (publicly) in electronic form. In addition
to the actual *scanning*, there is a lot of work
getting the manuals "disassembled" to the point where
they *can* be scanned (e.g., ripping "perfect binding").
Have you tried photographing those pages using a digital camera ?
It's too labor intensive. You have to arrange for the book
to be held open "enough", even lighting, transfer photos
to PC, convert to TIFF, trim them, import them (in the correct
order) to a PDF, etc.

I have seen a video footage of a machine used by a library (sorry, I
did not record the details), which opened the book about 120 degrees.
One arm took the next page down to horizontal level, a horizontal
glass sheet was put on the page to make sure the page was truly
horizontal, the flash light was activated and the class sheet was
removed and the sequence restarted.
I suspect such a device is considerably beyond my *practical*
budget! :>

The sequence took about 2-3 seconds. Apparently some auto-focus was
used, since the distance between the lens and the paper changed each
time a new page was added.

The odd pages could be processed in one run and a separate run would
be required to process the even pages (including flipping the page and
inverting the picture order).
Yes. I do similarly when running the sheets through the
document feeder. Once "prepared", I can do 5 or 6 pages
a minute -- not too bad but, when you have tens of thousands
of pages... :<

Instead, I bring the manuals to a print shop and have them
*cut* the binding edge off of the pages. They have large,
electric stack paper cutters (do ~1000 pages at a time
*without* the inevitable skew that a manual/guillotine paper
cutter imparts to the cut!). Then, I can just feed the
"individual pages" through the document feeder (instead of
having to manually flip pages, etc.).

If you do the trouble of carrying the manuals to the print shop, why
not let them scan the pages ?
1) I didn't realize they could do this
2) it's probably not inexpensive
3) copyright issues

In a law abiding print shops you may have to prove that you have the
copyright to make your own copies.
Exactly. It seems like the attitude towards this waxes and wanes.
And, no doubt, varies based on who's working on that day, etc.

Then there is the question how to store the scanned pages and also how
to distribute the (web) pages in a bandwidth efficient way.

I previously thought that storing and displaying the scanned pages as
simply bilevel (1 bit/pixel) bitmaps (typically run length encoded as
in faxes) would be sufficient, however, such page pictures look
horrible and the OCR software does not reliably make sense of the
text.
That is where the manual aspects come into play. You need to review
the results of the scan to decide how best to proceed. I've not found
any "magic bullet" -- unless you don't care about size (or quality).

1 bit/pixel is really too little and 8 bits/pixel would be excessive.
How many bits/pixel would be sufficient for pleasant visual rendering
or required by OCR software ?
It depends on the sizes of the typefaces used. Note that this
can vary within a document.

And, whether there are illustrations, etc.

Sometimes, you get really grainy images -- as if there was
dust on the scanner (though it is NOT the scanner that is
the source of the problem).

For decent typeface sizes, I will use 1bpp at 400-600dpi.
This is readable *and* OCR-able (not to be confused with
ocre-able -- which is the ability to turn something into ocre!)
Other times, I will use 8bpp and drop down to 300dpi
(trying to balance the added image depth against the
decreased resolution).

I wrote some utilities to create *4* bit TIFFs but very few
programs will recognize this encoding (despite adhering to the
letter of the spec).

I generally avoid the OCR stage as it requires *lots* of
proofreading. Images often get mishandled. Text often
gets misrecognized (remember, these are "computer manuals"
so "pigx" and "pigy" might be real "words" despite the OCR
packages attempts to "fix" them into "pigs" and "piggy", etc.).
I figure just creating the (electronic) documents is
enough of a "donation" so if folks want to grumble, they
can go find better versions (hint: most of this stuff is
simply NOT AVAILABLE). :>

"If you don't like what I'm serving for dinner, you're welcome
to eat elsewhere..."
 
D Yuniskis wrote:
Paul Keinanen wrote:
In a law abiding print shops you may have to prove that
you have the copyright to make your own copies.

Exactly. It seems like the attitude towards this waxes
and wanes. And, no doubt, varies based on who's working
on that day, etc.
My standard comment to those that "make the decision" at
places like Kinko's is. "These copies are being made because
the technicians at my shop at complete assh**es and destroy
original manuals every time they pick one up. Now if I can
keep them from writing on the monitor screens with a Sharpie
I'd be a happy camper."

Usually I get the eye roll, but after they stop laughing,
they authorize the full copying and or scanning of the
documents.

Jeff


--
“Egotism is the anesthetic that dulls the pain of stupidity.”
Frank Leahy, Head coach, Notre Dame 1941-1954

http://www.stay-connect.com
 
Hi Jeff,

Jeffrey D Angus wrote:
D Yuniskis wrote:
Paul Keinanen wrote:
In a law abiding print shops you may have to prove that
you have the copyright to make your own copies.

Exactly. It seems like the attitude towards this waxes
and wanes. And, no doubt, varies based on who's working
on that day, etc.

My standard comment to those that "make the decision" at
places like Kinko's is. "These copies are being made because
the technicians at my shop at complete assh**es and destroy
original manuals every time they pick one up. Now if I can
keep them from writing on the monitor screens with a Sharpie
I'd be a happy camper."

Usually I get the eye roll, but after they stop laughing,
they authorize the full copying and or scanning of the
documents.
Ha! I'm not sure I want to rely on that sort of
response... :<
 
D Yuniskis wrote:
Hi Jeff,

Jeffrey D Angus wrote:
D Yuniskis wrote:
Paul Keinanen wrote:
In a law abiding print shops you may have to prove that
you have the copyright to make your own copies.

Exactly. It seems like the attitude towards this waxes
and wanes. And, no doubt, varies based on who's working
on that day, etc.

My standard comment to those that "make the decision" at
places like Kinko's is. "These copies are being made because
the technicians at my shop at complete assh**es and destroy
original manuals every time they pick one up. Now if I can
keep them from writing on the monitor screens with a Sharpie
I'd be a happy camper."

Usually I get the eye roll, but after they stop laughing,
they authorize the full copying and or scanning of the
documents.

Ha! I'm not sure I want to rely on that sort of
response... :

That's just a sample of Jeff's 'VERY' warped sense of humor. You'll
get used to it. ;-)


--
Anyone wanting to run for any political office in the US should have to
have a DD214, and a honorable discharge.
 
On Mon, 17 May 2010 08:08:26 -0700, D Yuniskis <not.going.to.be@seen.com>
wrote:

Hi Joseph,

JosephKK wrote:
On Sun, 16 May 2010 18:46:14 -0700, D Yuniskis <not.going.to.be@seen.com
wrote:

JosephKK wrote:
It's too bad someone hasn't hacked together a laptop that
is *just* a monitor + keyboard! I imagine there would be
a market for such a beast.
It not only has been done, but is still available, mouse included.
But most of such currently are just repurposed laptops. Probably more
user hacks than commercial sales. A great excuse to buy a laptop that is
otherwise relatively CPU underpowered.
This is something that *looks* like a laptop but, in
reality, is just the laptop's *screen* with a video
INput and keyboard with a PS/2 or USB OUTput?

Pointers, please?

Sounds a lot like an X-terminal. Due to volume issues the laptop may yet
be lower cost.

No. An X terminal has a processor in it, understands the
X protocol, has a network interface, etc.

I.e., if I gave you an LCD monitor and a keyboard, you could
never run xdm -- unless you added a processor and a NIC.
The device I am describing could be used "as a TV" (with
an NTSC-VGA adapter) -- something you aren't going to do
with an X Terminal.
All read up. Now i see what you want. The best approach still looks
like a really serious hack of a laptop. The monitor part is going to be
really tough. Keyboard and one or more pointing devices should be pretty
easy.
You may have to hack the power brick, or the batteries or both.
Once USB3 (3 Gb/s) becomes common you only need the one interface.
 
On Mon, 17 May 2010 14:45:04 -0700, D Yuniskis
<not.going.to.be@seen.com> wrote:

Hi Paul,

Paul Keinanen wrote:
On Mon, 17 May 2010 07:37:06 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:

Then there is the question how to store the scanned pages and also how
to distribute the (web) pages in a bandwidth efficient way.

I previously thought that storing and displaying the scanned pages as
simply bilevel (1 bit/pixel) bitmaps (typically run length encoded as
in faxes) would be sufficient, however, such page pictures look
horrible and the OCR software does not reliably make sense of the
text.

That is where the manual aspects come into play. You need to review
the results of the scan to decide how best to proceed. I've not found
any "magic bullet" -- unless you don't care about size (or quality).
I think it is important to keep the distinction between
scanning/storage format and on the other hand the publishing format.

These days 1 TB of storage costs practically nothing (and an other TB
for backup), IMHO the source should be scanned and stored with the
best available resolution and bit planes, possibly with some very mild
compression.

You can then make some 1 bit/pixel encoding for publishing and heavy
compression.

After a few years, you can reprocesses your digital source archives,
without rescanning the original documents when better software is
available, in order to produce smaller or higher quality publishing
formats.

1 bit/pixel is really too little and 8 bits/pixel would be excessive.
How many bits/pixel would be sufficient for pleasant visual rendering
or required by OCR software ?

It depends on the sizes of the typefaces used. Note that this
can vary within a document.

And, whether there are illustrations, etc.

Sometimes, you get really grainy images -- as if there was
dust on the scanner (though it is NOT the scanner that is
the source of the problem).

For decent typeface sizes, I will use 1bpp at 400-600dpi.
This is readable *and* OCR-able (not to be confused with
ocre-able -- which is the ability to turn something into ocre!)
Other times, I will use 8bpp and drop down to 300dpi
(trying to balance the added image depth against the
decreased resolution).

I wrote some utilities to create *4* bit TIFFs but very few
programs will recognize this encoding (despite adhering to the
letter of the spec).
4 bit/pixel might be a usable format for _storage_, since this can
register the varying illumination, the whiteness of the paper and how
black the ink is. This might be usable information when postprocessing
to 1 bit/pixel.

I generally avoid the OCR stage as it requires *lots* of
proofreading. Images often get mishandled. Text often
gets misrecognized (remember, these are "computer manuals"
so "pigx" and "pigy" might be real "words" despite the OCR
packages attempts to "fix" them into "pigs" and "piggy", etc.).
As a compromise, you might publish the scans as bit maps, however, it
might be a good idea to run your original scans through some OCR
software and use the result to build an index. While a "pig" might be
a bit unexpected in a computer manual index, there is much less manual
proofreading.

IMHO the worst problem with scanned documents is that it does not
usually contain a searchable index, so including even somewhat flaked
index would be a great service.

I figure just creating the (electronic) documents is
enough of a "donation" so if folks want to grumble, they
can go find better versions (hint: most of this stuff is
simply NOT AVAILABLE). :

"If you don't like what I'm serving for dinner, you're welcome
to eat elsewhere..."
Scanning fragile (and often disintegrating) paper documents is a way
of preserve our cultural heritage.

Unfortunately, intellectual property laws (with protection times
decades after the IP holders death), may in fact cause a loss of the
human intellectual heritage.
 
In article <hssd1b$fcn$1@speranza.aioe.org>,
D Yuniskis <not.going.to.be@seen.com> wrote:
Hi Paul,

Paul Keinanen wrote:
On Mon, 17 May 2010 07:37:06 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:

The most tedious are the print manuals that were never
available (publicly) in electronic form. In addition
to the actual *scanning*, there is a lot of work
getting the manuals "disassembled" to the point where
they *can* be scanned (e.g., ripping "perfect binding").
Have you tried photographing those pages using a digital camera ?
It's too labor intensive. You have to arrange for the book
to be held open "enough", even lighting, transfer photos
to PC, convert to TIFF, trim them, import them (in the correct
order) to a PDF, etc.

I have seen a video footage of a machine used by a library (sorry, I
did not record the details), which opened the book about 120 degrees.
One arm took the next page down to horizontal level, a horizontal
glass sheet was put on the page to make sure the page was truly
horizontal, the flash light was activated and the class sheet was
removed and the sequence restarted.

I suspect such a device is considerably beyond my *practical*
budget! :

The sequence took about 2-3 seconds. Apparently some auto-focus was
used, since the distance between the lens and the paper changed each
time a new page was added.

The odd pages could be processed in one run and a separate run would
be required to process the even pages (including flipping the page and
inverting the picture order).

Yes. I do similarly when running the sheets through the
document feeder. Once "prepared", I can do 5 or 6 pages
a minute -- not too bad but, when you have tens of thousands
of pages... :
At the Dutch tax office I fell in love with some Fujitsu scanner,
capable of two sided scanning.
(It is on sale, *refurbished*, second hand... Where do you hear
that, PC equipment that is worth revising, then doing hundred of
Euro's.)

There are machines like the Brother MFX-8860DN.
This scans sheets two sided, dozens at a time, with the sheet
feeder. (It prints two-sided too. It copies. It faxes.)
It is not cheap, but seems like a good deal and well supported
by Linux.

Groetjes Albert.

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
 
Michael A. Terrell wrote:
D Yuniskis wrote:
Ha! I'm not sure I want to rely on that sort of
response... :


That's just a sample of Jeff's 'VERY' warped sense of humor. You'll
get used to it. ;-)
Aye, but it works. That's the bottom line.

Jeff


--
“Egotism is the anesthetic that dulls the pain of stupidity.”
Frank Leahy, Head coach, Notre Dame 1941-1954

http://www.stay-connect.com
 
On Tue, 18 May 2010 00:07:29 +0300, Paul Keinanen <keinanen@sci.fi>
wrote:

I have seen a video footage of a machine used by a library (sorry, I
did not record the details), which opened the book about 120 degrees.
Ooooh... I want one of those for plagerism. If I'm going to break the
law, I might was well go first class.

I copy a few service manuals, where the original is in ring binder
format. One of my customers has a Canon ImageRunner 5000 copier,
scanner, printer, etc conglomeration. Here's a video clip of it
scanning both sides of service manual:
<http://802.11junk.com/jeffl/crud/CanonImageRunner5000.wmv> (4MBytes)
Unfortunately, scanning large size foldout pages had to be done by
hand and usually in pieces. Some of the results are here:
<http://802.11junk.com/jeffl/AN-SRD-21/>
<http://802.11junk.com/jeffl/AN-SRD-22/>
Bottom line is that it's a HUGE waste of time trying to scan anything
on a typical home bed scanner. The 180 page AN/SRD-22 manual took
about 45 minutes (including screwups) on the Canon ImageRunner. I
once did a similar manual at home on my HP bed scanner which took a
total of about 6 hours to scan, cleanup, make searchable, and assemble
into a document.

--
Jeff Liebermann jeffl@cruzio.com
150 Felker St #D http://www.LearnByDestroying.com
Santa Cruz CA 95060 http://802.11junk.com
Skype: JeffLiebermann AE6KS 831-336-2558
 
Hi Paul,

Paul Keinanen wrote:
Then there is the question how to store the scanned pages and also how
to distribute the (web) pages in a bandwidth efficient way.

I previously thought that storing and displaying the scanned pages as
simply bilevel (1 bit/pixel) bitmaps (typically run length encoded as
in faxes) would be sufficient, however, such page pictures look
horrible and the OCR software does not reliably make sense of the
text.
That is where the manual aspects come into play. You need to review
the results of the scan to decide how best to proceed. I've not found
any "magic bullet" -- unless you don't care about size (or quality).

I think it is important to keep the distinction between
scanning/storage format and on the other hand the publishing format.
In my case, they are one in the same. I'm not in this as a "business"
(I am uncompensated for the *many* hours it takes to convert the
documents)

These days 1 TB of storage costs practically nothing (and an other TB
for backup), IMHO the source should be scanned and stored with the
best available resolution and bit planes, possibly with some very mild
compression.
You'd be amazed at how quickly that eats up disk space! I scanned
a disintegrating book on origami a few years ago seeking to
preserve color, etc. It was over 100MB compressed. You can't
store very many books if you preserve that much detail. :<

You can then make some 1 bit/pixel encoding for publishing and heavy
compression.

After a few years, you can reprocesses your digital source archives,
without rescanning the original documents when better software is
available, in order to produce smaller or higher quality publishing
formats.
<grin> I don't know about *you*, Paul, but I don't get enough
sleep as it is! :> I want things over and done with *now*. :-/

1 bit/pixel is really too little and 8 bits/pixel would be excessive.
How many bits/pixel would be sufficient for pleasant visual rendering
or required by OCR software ?
It depends on the sizes of the typefaces used. Note that this
can vary within a document.

And, whether there are illustrations, etc.

Sometimes, you get really grainy images -- as if there was
dust on the scanner (though it is NOT the scanner that is
the source of the problem).

For decent typeface sizes, I will use 1bpp at 400-600dpi.
This is readable *and* OCR-able (not to be confused with
ocre-able -- which is the ability to turn something into ocre!)
Other times, I will use 8bpp and drop down to 300dpi
(trying to balance the added image depth against the
decreased resolution).

I wrote some utilities to create *4* bit TIFFs but very few
programs will recognize this encoding (despite adhering to the
letter of the spec).

4 bit/pixel might be a usable format for _storage_, since this can
register the varying illumination, the whiteness of the paper and how
black the ink is. This might be usable information when postprocessing
to 1 bit/pixel.
But, it's a "proprietary format", then. I used this on a manual
I produced and it was nothing but trouble since I had to explicitly
"unpack" each image before I could create the final artwork...
then, repack everything to conserve space on disk.

I generally avoid the OCR stage as it requires *lots* of
proofreading. Images often get mishandled. Text often
gets misrecognized (remember, these are "computer manuals"
so "pigx" and "pigy" might be real "words" despite the OCR
packages attempts to "fix" them into "pigs" and "piggy", etc.).

As a compromise, you might publish the scans as bit maps, however, it
might be a good idea to run your original scans through some OCR
software and use the result to build an index. While a "pig" might be
a bit unexpected in a computer manual index, there is much less manual
proofreading.

IMHO the worst problem with scanned documents is that it does not
usually contain a searchable index, so including even somewhat flaked
index would be a great service.
I guess I look at it differently. The original PAPER document
didn't have a (electronic) searchable index and "somehow" seemed
to work. So, if the electronic document doesn't have that
searchable index, it's no *loss* (it's just not a *gain*!).

E.g., I have lots of novels that I would love to preserve
in this way. I don't care if they are available as text.
I just want to be able to re-read them after the paper
versions have disintegrated (paperbacks being notoriously
short-lived). So, an "image" of a page that my brain
can process -- even if it doesn't have enough fidelity for
an OCR package to handle -- is quite adequate.

I figure just creating the (electronic) documents is
enough of a "donation" so if folks want to grumble, they
can go find better versions (hint: most of this stuff is
simply NOT AVAILABLE). :

"If you don't like what I'm serving for dinner, you're welcome
to eat elsewhere..."

Scanning fragile (and often disintegrating) paper documents is a way
of preserve our cultural heritage.

Unfortunately, intellectual property laws (with protection times
decades after the IP holders death), may in fact cause a loss of the
human intellectual heritage.
See AEK's work at bitsavers.org. Be prepared to be blown away!
(be friendly to the server as I think it's his personal expense)
 
Jeffrey D Angus wrote:
Michael A. Terrell wrote:
D Yuniskis wrote:
Ha! I'm not sure I want to rely on that sort of
response... :


That's just a sample of Jeff's 'VERY' warped sense of humor. You'll
get used to it. ;-)

Aye, but it works. That's the bottom line.

I know it works. :)

I just thought it was only fair to issue the standard warning: 'This
individual is classed as "Mostly Harmless!"' Do not look him directly
in his good eye, or take his last doughnut and your chances of survival
will be 93%. ;-)


--
Anyone wanting to run for any political office in the US should have to
have a DD214, and a honorable discharge.
 
D Yuniskis wrote:
You'd be amazed at how quickly that eats up disk space! I scanned
a disintegrating book on origami a few years ago seeking to
preserve color, etc. It was over 100MB compressed. You can't
store very many books if you preserve that much detail. :

Have you tried 'Paperport'? Its compression is impressive. Its .max
file format makes small files, and you can drag the individual pages
into chapters or whole documents. The basic version was shipped with a
lot of flatbed scanners a few years ago, and includes a stand alone
reader.

--
Anyone wanting to run for any political office in the US should have to
have a DD214, and a honorable discharge.
 
Michael A. Terrell wrote:
D Yuniskis wrote:
You'd be amazed at how quickly that eats up disk space! I scanned
a disintegrating book on origami a few years ago seeking to
preserve color, etc. It was over 100MB compressed. You can't
store very many books if you preserve that much detail. :

Have you tried 'Paperport'? Its compression is impressive. Its .max
file format makes small files, and you can drag the individual pages
into chapters or whole documents. The basic version was shipped with a
lot of flatbed scanners a few years ago, and includes a stand alone
reader.
I have to use formats that are "open" and/or widely
accepted (which often ends up with them being "open").
I don't live in *just* the "Windows World"
 
D Yuniskis wrote:
Michael A. Terrell wrote:
D Yuniskis wrote:
You'd be amazed at how quickly that eats up disk space! I scanned
a disintegrating book on origami a few years ago seeking to
preserve color, etc. It was over 100MB compressed. You can't
store very many books if you preserve that much detail. :

Have you tried 'Paperport'? Its compression is impressive. Its .max
file format makes small files, and you can drag the individual pages
into chapters or whole documents. The basic version was shipped with a
lot of flatbed scanners a few years ago, and includes a stand alone
reader.

I have to use formats that are "open" and/or widely
accepted (which often ends up with them being "open").
I don't live in *just* the "Windows World"

Paperport will print to a PDF driver program if you want. I like it for
storing the raw scans because the file size VS image quality is great. I
don't know if it works with other OS or not.


--
Anyone wanting to run for any political office in the US should have to
have a DD214, and a honorable discharge.
 
On May 18, 9:43 am, Jeff Liebermann <je...@cruzio.com> wrote:
scanner, printer, etc conglomeration.  Here's a video clip of it
scanning both sides of service manual:
http://802.11junk.com/jeffl/crud/CanonImageRunner5000.wmv
Man, that's a looooooong download for 15 seconds of video.

Thanks,
Rich
 
On Tue, 18 May 2010 10:33:20 -0700, D Yuniskis
<not.going.to.be@seen.com> wrote:

Hi Paul,

Paul Keinanen wrote:


These days 1 TB of storage costs practically nothing (and an other TB
for backup), IMHO the source should be scanned and stored with the
best available resolution and bit planes, possibly with some very mild
compression.

You'd be amazed at how quickly that eats up disk space! I scanned
a disintegrating book on origami a few years ago seeking to
preserve color, etc. It was over 100MB compressed. You can't
store very many books if you preserve that much detail. :
The storage cost for those 100 MB would be about one cent with current
1 TB drives.
 
In article <so84v5pq3ghe4udl10q2nokrhh6g7dgam5@4ax.com>,
Paul Keinanen <keinanen@sci.fi> wrote:
On Mon, 17 May 2010 14:45:04 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:

Hi Paul,

Paul Keinanen wrote:
On Mon, 17 May 2010 07:37:06 -0700, D Yuniskis
not.going.to.be@seen.com> wrote:


Then there is the question how to store the scanned pages and also how
to distribute the (web) pages in a bandwidth efficient way.

I previously thought that storing and displaying the scanned pages as
simply bilevel (1 bit/pixel) bitmaps (typically run length encoded as
in faxes) would be sufficient, however, such page pictures look
horrible and the OCR software does not reliably make sense of the
text.

That is where the manual aspects come into play. You need to review
the results of the scan to decide how best to proceed. I've not found
any "magic bullet" -- unless you don't care about size (or quality).

I think it is important to keep the distinction between
scanning/storage format and on the other hand the publishing format.

These days 1 TB of storage costs practically nothing (and an other TB
for backup), IMHO the source should be scanned and stored with the
best available resolution and bit planes, possibly with some very mild
compression.

You can then make some 1 bit/pixel encoding for publishing and heavy
compression.

After a few years, you can reprocesses your digital source archives,
without rescanning the original documents when better software is
available, in order to produce smaller or higher quality publishing
formats.
A good advice.

1 bit/pixel is really too little and 8 bits/pixel would be excessive.
How many bits/pixel would be sufficient for pleasant visual rendering
or required by OCR software ?

It depends on the sizes of the typefaces used. Note that this
can vary within a document.

And, whether there are illustrations, etc.

Sometimes, you get really grainy images -- as if there was
dust on the scanner (though it is NOT the scanner that is
the source of the problem).

For decent typeface sizes, I will use 1bpp at 400-600dpi.
This is readable *and* OCR-able (not to be confused with
ocre-able -- which is the ability to turn something into ocre!)
Other times, I will use 8bpp and drop down to 300dpi
(trying to balance the added image depth against the
decreased resolution).

I wrote some utilities to create *4* bit TIFFs but very few
programs will recognize this encoding (despite adhering to the
letter of the spec).

4 bit/pixel might be a usable format for _storage_, since this can
register the varying illumination, the whiteness of the paper and how
black the ink is. This might be usable information when postprocessing
to 1 bit/pixel.

I generally avoid the OCR stage as it requires *lots* of
proofreading. Images often get mishandled. Text often
gets misrecognized (remember, these are "computer manuals"
so "pigx" and "pigy" might be real "words" despite the OCR
packages attempts to "fix" them into "pigs" and "piggy", etc.).

As a compromise, you might publish the scans as bit maps, however, it
might be a good idea to run your original scans through some OCR
software and use the result to build an index. While a "pig" might be
a bit unexpected in a computer manual index, there is much less manual
proofreading.
It seems that Adobe has software to add OCR to a bitmap document.
That means text is searchable. For an example see the old issues
of Forth Dimensions (http//www.forth.org ) under the heading
Forth Online documentation. So although you're looking at
a scan you can search for e.g. DROP and get it right most of the
time.

(But I'm convinced that there will be a time that you ocr
a 19-th century book, and the result will be better than
the original.)

IMHO the worst problem with scanned documents is that it does not
usually contain a searchable index, so including even somewhat flaked
index would be a great service.
See above.
I figure just creating the (electronic) documents is
enough of a "donation" so if folks want to grumble, they
can go find better versions (hint: most of this stuff is
simply NOT AVAILABLE). :

"If you don't like what I'm serving for dinner, you're welcome
to eat elsewhere..."

Scanning fragile (and often disintegrating) paper documents is a way
of preserve our cultural heritage.

Unfortunately, intellectual property laws (with protection times
decades after the IP holders death), may in fact cause a loss of the
human intellectual heritage.
This is one of my grave concerns. The program SchoonSchip of
(Nobel price winner) Veltman has a nice manual, that is free.
The original manual (197x, mainly of historic interest) sits behind a
(ca) 30 Euro fee. (I'm involved with this, trying to port SchoonSchip
from 68K assembler to Intel.)
It is not hard to imagine a hardcore Elsevier executive to drop
all papers not downloaded for 5 years.
(This has been a seminal activity for the "standard model"
in physics, but what do they know ...)

Throughout history it has been a fight to have libraries in shape.
We don't need another destructive force, besides wars and
ignorance.

Please note that IP laws give protection. We are in no obligation
to exert these rights to the full. A movement that establishes
the habit of pushing all legacy documentation into the public domain
would get my backing.

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
 

Welcome to EDABoard.com

Sponsor

Back
Top