[Blindmath] InftyReader problem: output contains image tags/commands rather than text

Tue May 26 17:41:24 UTC 2009

Hi All,

I'm having a rather perplexing InftyReader issue, and I'm hoping that
perhaps, someone has an idea of what's going on and how I could fix it.

I'm using a demo version of the latest version of InftyReader, version
2.7.9.0. I was sent a scanned image of a file, and the person scanning made
sure that it was binary, black-and-white, and they gave me versions scanned
at both 400 and 600 DPI (i.e., one of each). They sent me the scan as a PDF
(image, not OCR-ed), TIFF, and BMP, and the results are all the same. When I
run them through InftyReader to produce LaTeX output, the InftyReader log
claims that both character and math recognition has taken place; however,
the output file contains, following the preamble, only IncludeGraphics
commands to include EPS files that I assume are the images of the pages
because there are as many EPS files as there are pages of the scan (in other
words, all that seems to have happened is that the page images have been
converted to EPS and references to them placed in the LaTeX file via the
IncludeGraphics command, rather than any OCR having occurred because there
is absolutely no OCR-ed text in my file). I tried using XHTML+MathML as the
output format, too, and pretty much the same results-when I looked at it in
Notepad, the files contained, after the header, only self-closing paragraph
(i.e., <p/>) tags in which were nested self-closing image tags (i.e.,
<img/>) and so of course there was no OCR-ed text for IE+MathPlayer to
display either when I tried opening the XHTML file in the web browser. The
person sent me two copies of the TIFF files, one scanned in CapturePerfect
(the software package which is included with the Cannon scanner that is
being used) and the other in Kurzweil 3000, both of which produced the same
results, and the PDF and BMP files were scanned via CapturePerfect.

What stumps me is that I tried scanning a page of something else myself with
the required settings in Kurzweil 1000 version 11.03 (i.e., black-and-white,
static thresholding to produce the binary image, scanned at 400 DPI because
the Optimize Scanning feature determined that 300 was the best resolution,
using the Brightness setting that Optimize found to be the best, using Image
Scanning Only mode to produce the TIFF file), and it worked fine in that the
LaTeX and XHTML output files contained both math and text. Also, I tried
running InftyReader on a publisher file of a third book (this PDF wasn't
just an un-OCR-ed image, as I could access its text with Adobe Reader, so it
had already been OCR-ed when I'd gotten it from the publisher), and again,
no problems at all. So I'm mystified: what's wrong with the files that I'm
receiving scanned from the other person that are causing such drastic
differences in output results?

I just found out today from the person scanning the files that some of the
pages contain photographs and certain words highlighted (not like highlights
where a person has highlighted something important as they've read, but
highlights that have been printed in the text itself to emphasize points),
and that these are in color in the book, but I'm told that everything is
black-and-white in the scan. I know that the InftyReader About file states
that the program erases noise and photos before recognizing, but that better
recognition results will occur if one does this manually before running it
through the program. Would these photos and the highlights that are probably
being considered as noise by the software really cause such a drastic
problem that would produce no text at all in the output file? The About
document states that recognition wouldn't happen if there were any parts of
the scan that were color or grayscale, but this shouldn't be the issue if
the image were scanned in binary/black-and-white, and if it was an issue,
wouldn't the log file say somewhere that recognition wasn't successful?

The scanned images sent to me from the person took an incredibly small
amount of time to be recognized after I hit the Start OCR button, if that
matters. I think that it perhaps only took maybe 1 or 2 minutes to recognize
a 10-page file. The Status Line of the window did indicate the initial
progress messages (i.e., initializing OCR dictionary, image pre-processing,
etc.), but shouldn't recognition of 10 pages possibly have taken longer. I
know that the publisher PDF, which worked fine, contained 12 pages, and I
know it definitely took longer than 2 minutes for the Infty OCR to complete.

For the TIFF file which I scanned, I unfortunately have no idea whether the
single page contained any photos or highlights. I didn't bookmark the page
in the book, no longer have that TIFF, and my scanner is in for repair at
the moment, so I can't try scanning any page myself that I know contains
highlights/photos.

I also have Version 2.44 of the InftyReader (free, command-line version),
and I get the same results with that one, except that IncludeGraphics
commands aren't even present in the LaTeX file and so it only contains the
preamble and Center (i.e., to produce text formatted center in the
PDF/DVI/whatever if the LaTeX were to be compiled/typeset) commands (the
output file produced with the demo also included the Center commands, so
that the images would have been centered in the compiled output). Since the
command-line version can't produce XHTML and can only produce IML+MathML, I
installed the latest version of ChattyInfty as well (as a demo) to be able
to look at the IML in the way it was meant to be viewed, and same result as
the XHTML (i.e., the P and IMG tags only). According to the log file, again,
no errors, and the progress messages displayed in the Command Prompt all
indicate that things went well. For the publisher PDF I mentioned above (the
12-page book chapter), I ran it through the UDC (Universal Document
Converter; this was a demo of the latest version which apparently can
perform unlimited and unrestricted conversions with only some watermarks
left in the file that are not present with the paid product) to create a
TIFF before running it through v2.44, since the older version can't handle
PDF's directly, and despite ensuring that the vertical and horizontal DPI of
the TIFF was either 400 or 600 (I tried both) and ensuring that it was B/W,
I kept getting told that I needed to check the resolution of the image
because it wasn't one supported by the program-perhaps, those watermarks
caused issues?

Since the publisher PDF and TIFF I scanned myself worked in the demo, I'm
assuming there's something wrong with the files sent by the other person?
Does anyone have any clue what the issue could be? Is there some other OCR
setting that's being missed? Are the photos and highlights causing the
issue, and they need to manually be removed before running through
InftyReader? Am I doing anything wrong (i.e., I've selected the Input File,
selected the input and output formats, selected English as the language and
All Math Symbols for the level, set the Newline Code option to be At The End
Of Each Paragraph, set the resolution correctly, I've tried having both
neither Recognition Mode checked and the Accuracy mode checked; I left the
LaTeX preamble to default which, even if I realized I should have included
some extra packages after looking at the file, still should have produced at
least some text in the TEX file, I would think)?

Any ideas are most welcome. Apologies for the length of this message, and
its somewhat scattered nature, but I tried to include everything I could
think of that I've done and that might help. If I didn't provide something I
should have, let me know-I have all the scans that were sent to me still and
can re-run them through and such. Thanks in advance-any suggestions at all
are really appreciated!

Regards,
Maria