[Blindmath] Extracting bitmap images from pdf files

Richard Baldwin baldwin at dickbaldwin.com
Thu Jan 26 15:19:39 UTC 2012


Hi Pranav,

Well, this is an interesting approach from a technical viewpoint. I found a
good pdf to docx conversion site (http://www.zamzar.com/). Actually that
site can convert many different formats to many different formats. It is
free and easy to use. I don't know if it is accessible or not.

I converted a relatively small pdf file containing 30 images with figure
numbers and captions plus about 15 images without figure numbers and
captions to a docx file. Then, like you suggested, I changed the extension
from docx to zip and opened the file in WinZip. (This was new to me and is
a good thing to know. I had never taken the time to look into the structure
of a docx file.)

Just like you suggested, there was a folder in the zip file named media
that contained 32 jpg files. Those are very interesting files. It looks
like every page that contains one or more images is converted to a jpg file
after first removing all of the text. It even seems to remove the text that
is part of the original image. What you end up with are jpg files with
images the size of pdf pages that are made up of one or more smaller images
in the correct locations to match their locations on the original pdf page.

Very interesting, but definitely not usable by a blind student. The only
way to get access to the original images is to open the jpg file in an
image editor and to crop out the individual images. As Snidely Whiplash
would say, "Curses, foiled again by Adobe."

It's beginning to look like whichever way I turn, extracting the images
from the pdf files is going to require the services of a sighted person who
can operate an image editing program to crop the individual images out of
jpg representations of the pdf pages.

Thanks again for the input,
Dick Baldwin



On Thu, Jan 26, 2012 at 8:08 AM, Richard Baldwin <baldwin at dickbaldwin.com>wrote:

> Hi Pranav,
>
> Thanks for the information. However, this might be a "solution of last
> resort" for us because both Amanda and I tend to stay as far away from MS
> Word as possible. Fortunately, when people send docx files to me, as they
> often do, I can read them in the Google viewer. I don't have anything on my
> computer that will read docx files and I doubt that Amanda has anything on
> her computer that will read them either.
>
> However, it sounds like it might work without the requirement to actually
> use MS Word by dealing strictly with the files and opening the docx file in
> WinZip. There are dozens of sites that claim to convert pdf files to word
> files, so I will take a look at some of them and give it a shot.
>
> Dick Baldwin
>
>
> On Thu, Jan 26, 2012 at 5:00 AM, Pranav Lal <pranav.lal at gmail.com> wrote:
>
>> Hi Richard,
>>
>> I have not tried this but:
>> 1. Convert the pdf file to Microsoft Word docX format.
>> 2. Unzip the docX file.
>> 3. You get all the images in one of the folders in the resulting docX
>> expansion.
>> Note:
>> The docX file behaves like a zip archive so rename to a *.zip and then
>> extract it.
>>
>> Finally, for PDF to word conversion, I use a paid program called Abbyy PDF
>> transformer.
>>
>> Send me some of the files you need to convert and I would be happy to try
>> the above approach.
>> Pranav
>>
>>
>> _______________________________________________
>> Blindmath mailing list
>> Blindmath at nfbnet.org
>> http://nfbnet.org/mailman/listinfo/blindmath_nfbnet.org
>> To unsubscribe, change your list options or get your account info for
>> Blindmath:
>>
>> http://nfbnet.org/mailman/options/blindmath_nfbnet.org/baldwin%40dickbaldwin.com
>>
>
>
>
> --
> Richard G. Baldwin (Dick Baldwin)
> Home of Baldwin's on-line Java Tutorials
> http://www.DickBaldwin.com
>
> Professor of Computer Information Technology
> Austin Community College
> (512) 223-4758
> mailto:Baldwin at DickBaldwin.com
> http://www.austincc.edu/baldwin/
>



-- 
Richard G. Baldwin (Dick Baldwin)
Home of Baldwin's on-line Java Tutorials
http://www.DickBaldwin.com

Professor of Computer Information Technology
Austin Community College
(512) 223-4758
mailto:Baldwin at DickBaldwin.com
http://www.austincc.edu/baldwin/



More information about the BlindMath mailing list