[nabs-l] question about Jaws and PDFs

Darrell Shandrow darrell.shandrow at gmail.com
Sun Jan 24 14:50:54 UTC 2010


Hello Rania,

I assume you're asking about PDF2TXT? The documentation is below:

Version 3.3
April 6, 2009
Copyright 2005 - 2009 by Jamal Mazrui
LGPL license 
Contents
Description 
Installation 
Choosing PDF Source and TXT Target 
Text Extraction Settings 
Viewing Area 
Toggling between a File and Folder List 
Configuration Check Boxes 
Action Buttons 
URL Source, 
Hot Keys 
The Log File 
Command Line Operation 
File Association 
Development Notes 
Description

PDF to TXT (also written PDF2TXT) is a free program for converting files in
Portable Document Format (.pdf extension) to plain text(.txt extension). The
program lets you convert multiple files in a single, batch operation, either
from a GUI dialog or a console-mode command line. The resulting text files
can be read in almost any editing or viewing program. PDF2TXT, itself, also
includes a plain text view for reading PDF files. The program should work on
any version of Windows. 

Installation

The installation program for PDF2TXT is called p2tsetup.exe. When executed,
it prompts for an installation folder for the program. The default folder is
c:\pdf2txt. Although this is not a standard location for programs on a
Windows computer, a benefit is fewer keystrokes to type whenever you
manually enter the path to a PDF file or folder. If you want a standard
installation folder, however, respond to the prompt by entering
C:\Program Files\PDF to Text 
The installation process creates a program group for PDF2TXT on the Windows
start menu, containing choices to launch PDF2TXT, read Documentation for
PDF2TXT, and uninstall PDF2TXT. Also created is a desktop shortcut with an
associated hot key, enabling PDF2TXT to be conveniently launched by pressing
Control+Alt+Shift+P. Another shortcut is placed in the Send To folder so
that a PDF may be viewed in PDF2TXT via the context menu in Windows
Explorer. 


Choosing PDF Source and TXT Target

After PDF2TXT is installed, launching it activates a main dialog with
several capabilities and settings. First, it prompts you to select a PDF
source. This can be either a single PDF file or a folder containing multiple
PDF files (another section explains how it can also be an Internet URL). In
the initial edit box, you can type the full path to the file or folder
desired. Alternatively, you can tab to buttons that invoke different sub
dialogs depending on whether you want to choose a file or folder as the PDF
source. (Yet another option, described later, is to pass the path to the PDF
source as a parameter on the command line when pdf2txt.exe is launched.) 
By default, the PDF source is the folder c:\pdf2TXT\pdf. Any source may be
chosen, however, and the program remembers the last one used. 

Similarly, an edit box and associated button let you specify the target
folder for converted files. These will have the same base name, but an
extension of .txt instead of .pdf. The default target folder is
c:\pdf2TXT\txt. Note that the PDF source may be either a file or folder, but
the TXT target is always a folder. 


Text Extraction Settings

Two settings fundamentally affect how text is extracted from a PDF. If the
PDF requires a password to unlock its content, type it in the edit box
provided. If the PDF is an image format without textual characters -- e.g.,
the result of a scan -- mark the checkbox so that optical character
recognition (OCR) is performed instead of the usual techniques of extracting
text. This OCR technique was originally posted at
http://EmpowermentZone.com/pdf2ocr.zip 
OCR is a much slower and more error-prone process, but it may be the best
option when the usual methods do not work. This technique uses Google
Tesseract, the best open source OCR available, which is not as good as
commercial OCR packages. Due to technical issues, there is not a simple way
of aborting an OCR process that has already started. It is possible,
however, by launching another copy of PDF2TXT, which clears the deck during
its startup phase. 

Another checkbox lets you additionally produce a .htm target file -- in HTML
format. This uses a different conversion technique, originally posted at
http://EmpowermentZone.com/pdf2htm.zip 

This may be worth trying if the .txt result is unsatisfactory. It may also
be useful for webmasters who want to post AN HTML alternative to a PDF. The
conversion translates visual aspects of the PDF such as fonts, but not
structural elements such as headings, unfortunately. To further increase
conversion options, a different technique is also used for producing the
.txt file with this checkbox, using the PDFToText.exe utility that is also
seperately available at
http://www.foolabs.com/xpdf/home.html 


Viewing Area

Within the main dialog, a read-only, multi-line edit control serves as a
viewing area between the source and target controls just discussed. This
scrollable view can show one of three kinds of information: (1) the text of
a PDF, 2) a list of PDF files, or (3) the results of a batch conversion. The
label for the viewing area changes to indicate the kind of information being
shown: "View file," "View folder," or "View results." 
You can navigate the viewing area with standard windows keystrokes, e.g.,
Control+Home or Control+End to go to the top or bottom of text. Control+F
lets you search forward for a string of characters, and Control+Shift+F lets
you search backward. F3 searches for the same string again in the forward
direction, and Shift+F3 searches again backward. Control+G lets you go to a
percent completion point through the file being viewed. Control+K sets a
bookmark for the file, Control+Shift+K clears it, and Alt+K goes to it. 

You can press Shift with arrow keys to select text or Control+A to select
all. Alternatively, you can press F8 to set the starting point of a
selection, navigate to the ending point desired, and then press Shift+F8 to
select the text between these points. 

Press Control+C to copy selected text to the clipboard. Alternatively, press
Control+Shift+C, or Alt+F8, to copy and append to the clipboard, adding to
rather than replacing its existing text. A form feed or page break character
(ANSI code 12) will separate each clip copied there. Control+F8 is a
shortcut that copies all text in the viewing area without having to select
it first, equivalent to Control+A followd by Control+C. 

If you invoke the Open button and choose a PDF from its sub dialog, the text
of the PDF will be placed in the viewing area, and keyboard focus will go
there. If you invoke the Select button to choose a PDF folder instead of a
file, its list of PDFs will be shown. A status bar at the bottom of the
dialog indicates the current position in the viewing area. 


Toggling between a File and Folder List

The Look button behaves in a special way when the viewing area has focus. If
you press Alt+L when in the viewing area, PDF2TXT will toggle between a
folder and file view. If viewing a folder, PDF2TXT will switch to a view of
the file that was on the line containing the caret. If viewing a file,
PDF2TXT will switch to a view of the folder that contained the file. In
addition, PDF2TXT will automatically search for the name of the file last
viewed and place the caret just after it if found. 
This feature lets you easily explore the PDFs in a folder, one after
another. Initially, You might display a list of files by pressing Alt+L when
the PDF source is a folder. You can then arrow down through the list until
you find a PDF you want to view. At that point, press Alt+L to view the
file. When you want to continue exploring the folder list again, press Alt+L
to return to it at the position of the file you last viewed. 


Configuration Check Boxes

Four check boxes let you configure PDF2TXT. The one labeled "Include
subfolders," will look for PDF files not only in the specified folder, but
in subfolders under it. For example, you could probably convert many PDF
files on your computer by checking this setting and specifying the c:\ root
folder as the PDF source! This setting is unchecked by default. 
The check box labeled "Move PDF when done" will transfer a PDF to a
subfolder called "Done" after a successful conversion. This is a subfolder
of the PDF2TXT program folder, with a default location of c:\pdf2TXT\done.
The benefit of this check box is that PDF files are stored away for backup
after they have been converted to text. This setting is unchecked by
default. 

The checkbox labeled "Replace TXT if found" determines whether to skip a
conversion if a corresponding target file already exists. If you do not
check the setting to move source files when done, you may want to check this
setting so that unnecessary time is not spent on repeatedly converting PDF
files left in the source folder, since they then will be skipped if
corresponding target files already exist. This setting is checked by
default. 

The Append check box determines whether a detailed conversion log file is
newly created each time a conversion is run. This setting is checked by
default so that previous information is not lost. A section below further
describes the log file. 


Action Buttons

The remaining controls of the main dialog are buttons that perform various
actions. The Convert button is the default: the one that will be activated
by pressing Enter on any control except another button. The viewing area
will show the results of a batch conversion. This information includes the
number of pages in each PDF converted. It also indicates when a conversion
was either not possible or was skipped because the target file already
existed and you chose not to replace files. 
Press Escape if you need to abort a batch conversion of many files that is
taking too long! Note that this program is relatively quick, however,
compared to other available methods of converting PDF files to text.
Moreover, its batch mode feature lets you run conversions unattended. 

The source for a conversion is treated differently if the viewing area has
focus. If viewing a list of PDFs in a folder or on a web page, then PDF2TXT
regards the source as the file name on the current line (the one containing
the caret). Thus, you can cursor to a PDF of interest and press Enter to
convert it to text. If successfully converted, PDF2TXT assumes you may also
want to examine its content in the viewing area, so a Look command is
automatically performed as well (see below). If there is a conversion error,
however, PDF2TXT leaves the error message in the viewing area. If you have
been examining a list of PDFs and decide you want to convert them all rather
than a single file, navigate to the top line of the viewing area that lists
the number of PDFs in the list, and then press Enter. 

If the source edit box already specifies what you want to view, or a path is
easy to type into it, then the Look button is quicker to use than the Open
or Select sub dialog. Activating the Look button takes the current source
specification and goes to a view of either the text of a source file or the
list of a source folder, putting focus in the view area so you can read the
information. 

The Defaults button restores the default configuration settings of PDF2TXT.
You can use it to return to the initial folders and checkbox settings. 

The Explorer button lets you browse the source, target, or done folder with
Windows Explorer. It allows you to examine files that either have been
converted or would not convert--thus needing other approaches to access
their content. 

The Quit button closes PDF2TXT. Alt+F4 does the same thing. 

The Help button displays this complete documentation in the default web
browser. For context-sensitive help on a particular control, press F1 when
it has focus. Hence, you can tab through the dialog and press F1 on each
control to learn how to use it. 


URL Source,

If you are connected to the Internet, you can specify a URL as a PDF source
instead of a file or folder on your local computer. The URL can be the
complete download path to a PDF on the Internet. Alternatively, the URL can
be the path to a web page containing one or more links to PDF files. You can
use Internet Explorer to navigate to such a web page and then invoke the
"Grab URL" button to put its URL into the source edit box of PDF2TXT. 
The Look button works with a URL source similarly to a local file or folder.
For example, you can press Alt+L to view a list of PDFs on a web page. The
toggling feature, described above, is also supported, allowing you to
consecutively examine the PDFs linked to a web page. If you view a PDF on
the Internet, PDF2TXT will automatically download a copy to the PDF
subfolder of the program folder, e.g., to
c:\pdf2txt\pdf 

The Convert button also works with a URL source. Thus, you can easily
convert all PDFs on a web page with a single batch operation! 


Hot Keys

Almost all controls of PDF2TXT are directly usable with unique, mnemonic Alt
key combinations based on the initial letter of the control's label. Thus,
as you become familiar with the controls, you can operate them more quickly
with hot keys rather than navigating to them with the tab key or mouse. For
example, press Alt+P to go to the edit box for typing a PDF source, or Alt+S
to select a source folder from a tree view of your computer. Press Alt+L to
look at a file or folder, or Alt+V to red what is already in the viewing
area. Press Alt+I to toggle the "Include subfolders" setting, or Alt+D to
restore all defaults. The text extraction settings in the second row of
controls use a letter corresponding to the second syllable or word, i.e.,
Alt+W for the Password edit box and Alt+F for the Image Format checkbox. 

The Log File

The conversion log file is named log.txt and located in the Done subfolder
of the PDF2TXT program folder. It records information about each attempt to
convert a PDF to TXT file. It indicates whether the conversion succeeded
(meaning any resulting text), and then lists many attributes of the PDF,
including security settings that could explain a failed conversion. 
There is a choice to view the log file in the PDF2TXT program group off the
Start Menu. You can also get to the file via the Explore button of the
PDF2TXT program, choosing the Done folder to navigate with Windows Explorer.
Additionally, you can open the file in another application through its
direct path (default settings):
c:\pdf2txt\done\log.txt 

If the log file grows larger than you want, simply delete it or uncheck the
setting that configures PDF2TXT to append to an existing log file. Each use
of the Convert button would then generate a new log file. This information
is more detailed than the results placed in the viewing area. 


Command Line Operation

The pdf2txt.exe executable may be run with various command line parameters.
The parameters can set values for controls in the main dialog. Parameters
can also cause PDF2TXT to run in an automatic, console mode--without a
dialog box or further user intervention involved. 
When the .pdf extension is associated with the PDF2TXT program (explained in
another section), Windows Explorer or Internet Explorer will open a PDF file
by launching PDF2TXT with the name of the PDF passed as a parameter on the
command line. If PDF2TXT is launched with more than one command line
parameter, however, the program will assume you want to run it in console
rather than GUI mode. The syntax for parameters is described as follows. If
a PDF source file, folder, or URL is specified, it must be the first
parameter. If a TXT target folder is specified, it must be the second
parameter. The source or target must be enclosed in quotes if its name
contains spaces. 

All parameters besides source and target names begin with a space and
forward slash (/), followed by the hot key letter in the dialog
corresponding to the setting affected. A trailing plus (+) sign in the
parameter indicates a status of On, and a minus (-) sign indicates Off. The
plus sign can also be omitted to indicate On. Capitalization does not
matter. Here is a list of parameters: 

a = Automatic, console mode (use /a- to force GUI mode with multiple
parameters)
i = Include subfolders
m = Move PDF when done
r = Replace TXT if found
d = Default settings (no /d- is defined)
g = Grab URL as source from Internet Explorer (no /g- is defined) 

For example, to convert all files using default settings except for the Move
setting, you could enter:
pdf2txt /d /m 

To use current settings except grab a URL as source, enter:
pdf2txt /a /g 

To convert files from a temporary folder to the current folder, enter:
pdf2txt "c:\temp files" . 

To do the same, but in GUI rather than console mode, enter:
pdf2txt "c:\temp files" . /a- 

For greater console mode convenience, another version of PDF2TXT, having the
abbreviated name p2t.exe, is also available in the program folder. This
version only runs in console mode, whether zero, one, or more parameters are
specified. It uses "standard output" to display conversion results. The
shorter executable name means less characters to type on the command line.
For example, to run a batch conversion in console mode using the current
settings of PDF2TXT, you could simply enter
p2t 

Like DOS commands generally, the above assumes that you have either made
c:\pdf2txt the current directory or included it in a PATH statement. 


File Association

The PDF2TXT group on the Start Menu contains shortcuts for changing what
program automatically opens a file with a .pdf extension in Windows
Explorer. If you decide that you like the interface of PDF2TXT enough to
make it the default program for PDF files, you can set the file association
accordingly. Later, if you decide you want to return to the conventional
association, you can do that, too. 
When the .pdf extension is associated with PDF2TXT, an application such as
Windows Explorer when opening a file, or Internet Explorer after downloading
a file, will pass the name of the PDF as a command-line parameter to
pdf2txt.exe. When the program is launched in this way, it automatically
invokes the Look button, placing text of the PDF in the viewing area and
putting keyboard focus there. 


Development Notes

I welcome comments and suggestions on PDF to TXT. For the technically
curious, I developed it with the PowerBASIC programming language from
http://PowerBASIC.com
and a couple of third party libraries: EZGUI from
http://EZGUI.com
and QuickPDF from
http://QuickPDF.com 
An alternate text extraction technique is tried if the first one fails,
using the GetText.exe utility that is also available seperately at
http://www.kryltech.com
The file GetText.txt in the PDF2TXT program folder contains the license for
this utility. 

The OCR is done by incorporating the open source PDF2OCR package, available
at
http://EmpowermentZone.com/pdf2ocr.zip 

Some status messages are spoken with the JAWS, System Access, or Window-Eyes
screen reader if currently active. These direct speech messages are produced
with APIs via a component of the SayTools library, which is also available
seperately at
http://EmpowermentZone.com/saysetup.exe 

The PowerBASIC code to PDF2TXT, itself (but not commercial libraries used),
is open source under the Lesser General Public License (LGPL), documented at
http://gnu.org 

This Windows program is the successor to my first version of PDF2TXT,
developed several years ago as a DOS-based, command-line only utility. Ideas
and feedbak from the discussion list
ProgrammingBlind at FreeLists.org
have aided the design and testing of PDF2TXT. The latest version is
available at the same address,
http://EmpowermentZone.com/p2tsetup.exe 

You can download it with the Elevate Version hotkey, F11. This checks
whether a newer version is available, and offers to install it. 

Jamal Mazrui
jamal at EmpowermentZone.com  

-----Original Message-----
From: nabs-l-bounces at nfbnet.org [mailto:nabs-l-bounces at nfbnet.org] On Behalf
Of Rania
Sent: Sunday, January 24, 2010 7:21 AM
To: National Association of Blind Students mailing list
Subject: Re: [nabs-l] question about Jaws and PDFs

I have never herd about this!
Can you give me more information?
Rania,
"For everyone who thought I couldn't do it.
For everyone who thought I shouldn't do it.
For everyone who said, 'It's impossible."
See you at the finish line."
~Christopher Reeve

----- Original Message -----
From: "Darrell Shandrow" <darrell.shandrow at gmail.com>
To: "'National Association of Blind Students mailing list'" 
<nabs-l at nfbnet.org>
Sent: Sunday, January 24, 2010 6:31 AM
Subject: Re: [nabs-l] question about Jaws and PDFs


> Hello Rachel,
>
> In order to help you in the most effective way possible, let me start by
> asking you some questions. Don't worry if you can't answer them all. Just 
> do
> your best and I'll guide you to the rest of the needed answers.
>
> What version of JAWS are you running? Do you have Adobe Reader on your
> computer? If so, which version? Do you have any scanning and reading
> products like Kurzweil K1000 or OpenBook installed on your computer? If 
> so,
> what versions?
>
> There are a number of ways to read PDF documents. Some PDFs are fully
> accessible, many can be read with some difficulty and far too many remain
> completely out of our reach without a significant amount of expensive
> assistive technology.
>
> There is one free solution that can read many PDF documents. It is called
> PDF2TXT, and it has been developed by a blind computer programmer. Visit
> http://www.empowermentzone.com/p2tsetup.exe to install the program.
>
> Regards,
>
> Darrell
>
>
>
> _______________________________________________
> nabs-l mailing list
> nabs-l at nfbnet.org
> http://www.nfbnet.org/mailman/listinfo/nabs-l_nfbnet.org
> To unsubscribe, change your list options or get your account info for 
> nabs-l:
>
http://www.nfbnet.org/mailman/options/nabs-l_nfbnet.org/raniaismail04%40gmai
l.com 


_______________________________________________
nabs-l mailing list
nabs-l at nfbnet.org
http://www.nfbnet.org/mailman/listinfo/nabs-l_nfbnet.org
To unsubscribe, change your list options or get your account info for
nabs-l:
http://www.nfbnet.org/mailman/options/nabs-l_nfbnet.org/darrell.shandrow%40g
mail.com





More information about the NABS-L mailing list