[Blindmath] pdf intrigue

Jason White jason at jasonjgw.net
Fri Mar 20 04:47:08 UTC 2009


Jonathan Godfrey <a.j.godfrey at massey.ac.nz> wrote:
> 3. many pdf files do not convert to text cleanly as spaces are either  
> included in funny places and then not in others. Line breaks between  
> words is another frustration.

I think this is the result of proportional spacing. Basically, Postscript and
PDF use layout operators to control the position of each character precisely.
Software that converts the PDF to text has to examine the spacing and
determine where to place spaces in the output file.

Good typesetting software such as TeX adjusts the spacing between characters
and between words so as to align both the left and right edges of the printed
text, which makes the print easier to read when it is done well. TeX has a
reputation for being particularly good in this regard.

Thus the effect of the high-quality justification algorithms is to make it
harder for PDF to text converters to determine the spacing between words
correctly.
> 4. If I make a pdf straight from the source code it is often a mess (point 3).
> 5. If I make the dvi file and then convert to pdf the problems with  
> point 3 remain.
> 6. When I go through the process of making the dvi file, then the post 
> script file and then  making the pdf from the post script file it ends up 
> considerably easier to read the text with jaws.

I don't know, but I would suggest running pdftotext on the file to see whether
it does a better job. It's available for Linux - I'm not a Windows user, thus
I can't comment about Jaws or Adobe.
> 7. Points 4 and 5 also lead to character strings involving an f are  
> often not converted to text properly. This includes the strings "ff",  
> "fi", and "ffi" just to illustrate three different problems. The  
> laborious creation of the pdf (point 6) seems to work for these  
> character strings.

In some fonts, those strings are actually represented as single characters
rather than as two separate characters, and the problem is that your text
converter isn't recognizing this.

The reason for representing strings such as "fi" as single combined characters
have to do with kerning or ligatures - I can't remember which one now, but
they're both standard typographical techniques which TeX supports, and which
improve the quality of the typeset print.






More information about the BlindMath mailing list