[Blindmath] InftyReader problem: output contains image tags/commands rather than text

Wed Jun 3 18:52:16 UTC 2009

Hello Maria.  It has taken a while to get the right people to look at 
your PDF files, but the reason for their non-conversion is now clear. 
These scans all have a dark border around the page.  Infty Reader 
considers that to indicate a graphic.  We have seen this problem before, 
and the Infty group is working on a solution in a future release.  It is 
not as easy as it might seem to distinguish graphics from text!  In the 
meantime the solution is to remove those dark bands (which probably were 
introduced when the paper was scanned.  I am sure that a person 
knowledgeable in computer graphics file editing can find a way to edit 
out those black bands, but one needs to be careful not to reduce the 
resolution.  Infty Reader works great at 600 dpi but less well at 400 dpi.

My advice to Infty Reader usersis to avoid noise around the page when 
scanning.  If it can't be avoided in anyother way, use a large piece of 
white paper behind the page being scanned.

hope this is helpful!

John Gardner

On 5/26/2009 10:41 AM, Maria Kristic wrote:
> Hi All,
>
> I'm having a rather perplexing InftyReader issue, and I'm hoping that
> perhaps, someone has an idea of what's going on and how I could fix it.
>
> I'm using a demo version of the latest version of InftyReader, version
> 2.7.9.0. I was sent a scanned image of a file, and the person scanning made
> sure that it was binary, black-and-white, and they gave me versions scanned
> at both 400 and 600 DPI (i.e., one of each). They sent me the scan as a PDF
> (image, not OCR-ed), TIFF, and BMP, and the results are all the same. When I
> run them through InftyReader to produce LaTeX output, the InftyReader log
> claims that both character and math recognition has taken place; however,
> the output file contains, following the preamble, only IncludeGraphics
> commands to include EPS files that I assume are the images of the pages
> because there are as many EPS files as there are pages of the scan (in other
> words, all that seems to have happened is that the page images have been
> converted to EPS and references to them placed in the LaTeX file via the
> IncludeGraphics command, rather than any OCR having occurred because there
> is absolutely no OCR-ed text in my file). I tried using XHTML+MathML as the
> output format, too, and pretty much the same results-when I looked at it in
> Notepad, the files contained, after the header, only self-closing paragraph
> (i.e.,<p/>) tags in which were nested self-closing image tags (i.e.,
> <img/>) and so of course there was no OCR-ed text for IE+MathPlayer to
> display either when I tried opening the XHTML file in the web browser. The
> person sent me two copies of the TIFF files, one scanned in CapturePerfect
> (the software package which is included with the Cannon scanner that is
> being used) and the other in Kurzweil 3000, both of which produced the same
> results, and the PDF and BMP files were scanned via CapturePerfect.
>
> What stumps me is that I tried scanning a page of something else myself with
> the required settings in Kurzweil 1000 version 11.03 (i.e., black-and-white,
> static thresholding to produce the binary image, scanned at 400 DPI because
> the Optimize Scanning feature determined that 300 was the best resolution,
> using the Brightness setting that Optimize found to be the best, using Image
> Scanning Only mode to produce the TIFF file), and it worked fine in that the
> LaTeX and XHTML output files contained both math and text. Also, I tried
> running InftyReader on a publisher file of a third book (this PDF wasn't
> just an un-OCR-ed image, as I could access its text with Adobe Reader, so it
> had already been OCR-ed when I'd gotten it from the publisher), and again,
> no problems at all. So I'm mystified: what's wrong with the files that I'm
> receiving scanned from the other person that are causing such drastic
> differences in output results?
>
> I just found out today from the person scanning the files that some of the
> pages contain photographs and certain words highlighted (not like highlights
> where a person has highlighted something important as they've read, but
> highlights that have been printed in the text itself to emphasize points),
> and that these are in color in the book, but I'm told that everything is
> black-and-white in the scan. I know that the InftyReader About file states
> that the program erases noise and photos before recognizing, but that better
> recognition results will occur if one does this manually before running it
> through the program. Would these photos and the highlights that are probably
> being considered as noise by the software really cause such a drastic
> problem that would produce no text at all in the output file? The About
> document states that recognition wouldn't happen if there were any parts of
> the scan that were color or grayscale, but this shouldn't be the issue if
> the image were scanned in binary/black-and-white, and if it was an issue,
> wouldn't the log file say somewhere that recognition wasn't successful?
>
> The scanned images sent to me from the person took an incredibly small
> amount of time to be recognized after I hit the Start OCR button, if that
> matters. I think that it perhaps only took maybe 1 or 2 minutes to recognize
> a 10-page file. The Status Line of the window did indicate the initial
> progress messages (i.e., initializing OCR dictionary, image pre-processing,
> etc.), but shouldn't recognition of 10 pages possibly have taken longer. I
> know that the publisher PDF, which worked fine, contained 12 pages, and I
> know it definitely took longer than 2 minutes for the Infty OCR to complete.
>
> For the TIFF file which I scanned, I unfortunately have no idea whether the
> single page contained any photos or highlights. I didn't bookmark the page
> in the book, no longer have that TIFF, and my scanner is in for repair at
> the moment, so I can't try scanning any page myself that I know contains
> highlights/photos.
>
> I also have Version 2.44 of the InftyReader (free, command-line version),
> and I get the same results with that one, except that IncludeGraphics
> commands aren't even present in the LaTeX file and so it only contains the
> preamble and Center (i.e., to produce text formatted center in the
> PDF/DVI/whatever if the LaTeX were to be compiled/typeset) commands (the
> output file produced with the demo also included the Center commands, so
> that the images would have been centered in the compiled output). Since the
> command-line version can't produce XHTML and can only produce IML+MathML, I
> installed the latest version of ChattyInfty as well (as a demo) to be able
> to look at the IML in the way it was meant to be viewed, and same result as
> the XHTML (i.e., the P and IMG tags only). According to the log file, again,
> no errors, and the progress messages displayed in the Command Prompt all
> indicate that things went well. For the publisher PDF I mentioned above (the
> 12-page book chapter), I ran it through the UDC (Universal Document
> Converter; this was a demo of the latest version which apparently can
> perform unlimited and unrestricted conversions with only some watermarks
> left in the file that are not present with the paid product) to create a
> TIFF before running it through v2.44, since the older version can't handle
> PDF's directly, and despite ensuring that the vertical and horizontal DPI of
> the TIFF was either 400 or 600 (I tried both) and ensuring that it was B/W,
> I kept getting told that I needed to check the resolution of the image
> because it wasn't one supported by the program-perhaps, those watermarks
> caused issues?
>
> Since the publisher PDF and TIFF I scanned myself worked in the demo, I'm
> assuming there's something wrong with the files sent by the other person?
> Does anyone have any clue what the issue could be? Is there some other OCR
> setting that's being missed? Are the photos and highlights causing the
> issue, and they need to manually be removed before running through
> InftyReader? Am I doing anything wrong (i.e., I've selected the Input File,
> selected the input and output formats, selected English as the language and
> All Math Symbols for the level, set the Newline Code option to be At The End
> Of Each Paragraph, set the resolution correctly, I've tried having both
> neither Recognition Mode checked and the Accuracy mode checked; I left the
> LaTeX preamble to default which, even if I realized I should have included
> some extra packages after looking at the file, still should have produced at
> least some text in the TEX file, I would think)?
>
> Any ideas are most welcome. Apologies for the length of this message, and
> its somewhat scattered nature, but I tried to include everything I could
> think of that I've done and that might help. If I didn't provide something I
> should have, let me know-I have all the scans that were sent to me still and
> can re-run them through and such. Thanks in advance-any suggestions at all
> are really appreciated!
>
> Regards,
> Maria
>
> _______________________________________________
> Blindmath mailing list
> Blindmath at nfbnet.org
> http://www.nfbnet.org/mailman/listinfo/blindmath_nfbnet.org
> To unsubscribe, change your list options or get your account info for Blindmath:
> http://www.nfbnet.org/mailman/options/blindmath_nfbnet.org/john.gardner%40orst.edu
>