[Blindmath] InftyReader problem: output contains image tags/commands rather than text

Alastair Irving alastair.irving at sjc.ox.ac.uk
Tue May 26 20:23:14 UTC 2009


Hi

I've experienced similar problems myself with certain images, but never 
been able to determine what's wrong with the image to cause it.


Alastair
Maria Kristic wrote:
> Hi All,
> 
> I'm having a rather perplexing InftyReader issue, and I'm hoping that
> perhaps, someone has an idea of what's going on and how I could fix it.
> 
> I'm using a demo version of the latest version of InftyReader, version
> 2.7.9.0. I was sent a scanned image of a file, and the person scanning made
> sure that it was binary, black-and-white, and they gave me versions scanned
> at both 400 and 600 DPI (i.e., one of each). They sent me the scan as a PDF
> (image, not OCR-ed), TIFF, and BMP, and the results are all the same. When I
> run them through InftyReader to produce LaTeX output, the InftyReader log
> claims that both character and math recognition has taken place; however,
> the output file contains, following the preamble, only IncludeGraphics
> commands to include EPS files that I assume are the images of the pages
> because there are as many EPS files as there are pages of the scan (in other
> words, all that seems to have happened is that the page images have been
> converted to EPS and references to them placed in the LaTeX file via the
> IncludeGraphics command, rather than any OCR having occurred because there
> is absolutely no OCR-ed text in my file). I tried using XHTML+MathML as the
> output format, too, and pretty much the same results-when I looked at it in
> Notepad, the files contained, after the header, only self-closing paragraph
> (i.e., <p/>) tags in which were nested self-closing image tags (i.e.,
> <img/>) and so of course there was no OCR-ed text for IE+MathPlayer to
> display either when I tried opening the XHTML file in the web browser. The
> person sent me two copies of the TIFF files, one scanned in CapturePerfect
> (the software package which is included with the Cannon scanner that is
> being used) and the other in Kurzweil 3000, both of which produced the same
> results, and the PDF and BMP files were scanned via CapturePerfect.
> 
> What stumps me is that I tried scanning a page of something else myself with
> the required settings in Kurzweil 1000 version 11.03 (i.e., black-and-white,
> static thresholding to produce the binary image, scanned at 400 DPI because
> the Optimize Scanning feature determined that 300 was the best resolution,
> using the Brightness setting that Optimize found to be the best, using Image
> Scanning Only mode to produce the TIFF file), and it worked fine in that the
> LaTeX and XHTML output files contained both math and text. Also, I tried
> running InftyReader on a publisher file of a third book (this PDF wasn't
> just an un-OCR-ed image, as I could access its text with Adobe Reader, so it
> had already been OCR-ed when I'd gotten it from the publisher), and again,
> no problems at all. So I'm mystified: what's wrong with the files that I'm
> receiving scanned from the other person that are causing such drastic
> differences in output results?
> 
> I just found out today from the person scanning the files that some of the
> pages contain photographs and certain words highlighted (not like highlights
> where a person has highlighted something important as they've read, but
> highlights that have been printed in the text itself to emphasize points),
> and that these are in color in the book, but I'm told that everything is
> black-and-white in the scan. I know that the InftyReader About file states
> that the program erases noise and photos before recognizing, but that better
> recognition results will occur if one does this manually before running it
> through the program. Would these photos and the highlights that are probably
> being considered as noise by the software really cause such a drastic
> problem that would produce no text at all in the output file? The About
> document states that recognition wouldn't happen if there were any parts of
> the scan that were color or grayscale, but this shouldn't be the issue if
> the image were scanned in binary/black-and-white, and if it was an issue,
> wouldn't the log file say somewhere that recognition wasn't successful?
> 
> The scanned images sent to me from the person took an incredibly small
> amount of time to be recognized after I hit the Start OCR button, if that
> matters. I think that it perhaps only took maybe 1 or 2 minutes to recognize
> a 10-page file. The Status Line of the window did indicate the initial
> progress messages (i.e., initializing OCR dictionary, image pre-processing,
> etc.), but shouldn't recognition of 10 pages possibly have taken longer. I
> know that the publisher PDF, which worked fine, contained 12 pages, and I
> know it definitely took longer than 2 minutes for the Infty OCR to complete.
> 
> For the TIFF file which I scanned, I unfortunately have no idea whether the
> single page contained any photos or highlights. I didn't bookmark the page
> in the book, no longer have that TIFF, and my scanner is in for repair at
> the moment, so I can't try scanning any page myself that I know contains
> highlights/photos.
> 
> I also have Version 2.44 of the InftyReader (free, command-line version),
> and I get the same results with that one, except that IncludeGraphics
> commands aren't even present in the LaTeX file and so it only contains the
> preamble and Center (i.e., to produce text formatted center in the
> PDF/DVI/whatever if the LaTeX were to be compiled/typeset) commands (the
> output file produced with the demo also included the Center commands, so
> that the images would have been centered in the compiled output). Since the
> command-line version can't produce XHTML and can only produce IML+MathML, I
> installed the latest version of ChattyInfty as well (as a demo) to be able
> to look at the IML in the way it was meant to be viewed, and same result as
> the XHTML (i.e., the P and IMG tags only). According to the log file, again,
> no errors, and the progress messages displayed in the Command Prompt all
> indicate that things went well. For the publisher PDF I mentioned above (the
> 12-page book chapter), I ran it through the UDC (Universal Document
> Converter; this was a demo of the latest version which apparently can
> perform unlimited and unrestricted conversions with only some watermarks
> left in the file that are not present with the paid product) to create a
> TIFF before running it through v2.44, since the older version can't handle
> PDF's directly, and despite ensuring that the vertical and horizontal DPI of
> the TIFF was either 400 or 600 (I tried both) and ensuring that it was B/W,
> I kept getting told that I needed to check the resolution of the image
> because it wasn't one supported by the program-perhaps, those watermarks
> caused issues?
> 
> Since the publisher PDF and TIFF I scanned myself worked in the demo, I'm
> assuming there's something wrong with the files sent by the other person?
> Does anyone have any clue what the issue could be? Is there some other OCR
> setting that's being missed? Are the photos and highlights causing the
> issue, and they need to manually be removed before running through
> InftyReader? Am I doing anything wrong (i.e., I've selected the Input File,
> selected the input and output formats, selected English as the language and
> All Math Symbols for the level, set the Newline Code option to be At The End
> Of Each Paragraph, set the resolution correctly, I've tried having both
> neither Recognition Mode checked and the Accuracy mode checked; I left the
> LaTeX preamble to default which, even if I realized I should have included
> some extra packages after looking at the file, still should have produced at
> least some text in the TEX file, I would think)?
> 
> Any ideas are most welcome. Apologies for the length of this message, and
> its somewhat scattered nature, but I tried to include everything I could
> think of that I've done and that might help. If I didn't provide something I
> should have, let me know-I have all the scans that were sent to me still and
> can re-run them through and such. Thanks in advance-any suggestions at all
> are really appreciated!
> 
> Regards,
> Maria
> 
> _______________________________________________
> Blindmath mailing list
> Blindmath at nfbnet.org
> http://www.nfbnet.org/mailman/listinfo/blindmath_nfbnet.org
> To unsubscribe, change your list options or get your account info for Blindmath:
> http://www.nfbnet.org/mailman/options/blindmath_nfbnet.org/alastair.irving%40sjc.ox.ac.uk





More information about the BlindMath mailing list