[blindLaw] accessible solution for splitting large pdfs
tim at timeldermusic.com
tim at timeldermusic.com
Tue Feb 7 21:47:38 UTC 2023
Sai,
I'm interested in your FOIA access case. Please keep me posted. Also consider ABBY FineReader Pro (perpetual license).
-----Original Message-----
From: Sai <sai at fiatfiendum.org>
Sent: Tuesday, February 7, 2023 8:22 AM
To: JJ Johnston <jeffjayjohnston at gmail.com>
Cc: Blind Law Mailing List <blindlaw at nfbnet.org>
Subject: Re: [blindLaw] accessible solution for splitting large pdfs
(copying to list since others may be interested and it took me two hours to
write)
I tested this using Acrobat Pro 2017 on Windows 11 both sighted and with NVDA (I don't have JAWS).
1. Acrobat Pro's accessibility
> I was told that it is inaccessible to JAWS; is this true?
I can't give a very reliable answer on this.
I'm only in blind when I'm not at home. I only use Windows at home on my desktop, sighted (because I can control lighting conditions); I have NVDA installed but only use it very very rarely, so I am not proficient. I've never used JAWS.
If I'm out of my home, and need to use a computer, I use my Macbook Air with VoiceOver or Android phone using TalkBack, and am proficient in those.
On Windows, export is gotten to by opening the file, alt f for file menu, down until you get 'export to' submenu, right, right again on Word (it's the first option in the submenu), select the versin you want, press enter, then standard Windows save file dialog. I don't know how to trigger the menu in NVDA except using the standard (non-screenreader) alt commands, but otherwise it seems fairly normal.
There are parts of Acrobat Pro that are just weirdly designed GUI in general (very much like Microsoft Word 16), which I don't like regardless of whether I'm operating in sighted or blind mde, but I didn't notice anything that would specifically be worse in a screen reader.
My Macbook is stalling on system updates at the moment, so I haven't tested this in VoiceOver just now. But from memory, it was unremarkable — basically the same interface as any other app, and no particular issues using it.
However, I haven't tried using Acrobat extensively in blind mode — I've only needed to read things and edit notes (in text or Google Docs) or the like, not to do more technical things like this.
2. Export from PDF
> If I understand correctly, Adobe Pro has a feature to export to Word.
This would be ideal for me: I'd rather work with a .docx than a PDF. My
question: is this exporting an exact copy of text and formatting, or is it merely an OCR?
This was news to me, but turns out Acrobat Pro can indeed convert to Word (DOCX).
It also can export to Word 97-2003 (DOC), "accessible" text (whatever that means), plain text, rich text format (RTF), and (in theory) Excel spreadsheet (but that is effectively unusable).
I've attached the 2022 ACB Convention program in PDF (the version I
downloaded) and versions I just exported from Acrobat Pro for you to compare — Word, Word 97-2003, accessible text, and RTF.
At a quick check, the Word export is lossy. Some of the images are corrupted — e.g. the Chase ad on page 5 erased parts at the bottom of the image of a braille display, probably because the OCR interpreted it as text and tried to remove "background" that was actually photo. The whole ads are images in the original, so there's the usual OCR lossiness — e.g. that same ad has a headline "Commitment to access and inclusion", which OCR interpreted as "CoII1II1itrnent to access and inclusion" in the Word version. The alt text of the first image (a river-spanning ridge) is gone, whereas the PDF had alt text. Same thing with the page 17 Microsoft ad, which has text in an inset box within the image — the inset is removed, there's erratic change to white background, the font is different (and inconsistently so), and alt text gone. Etc.
It seems OK with the parts that were text with very basic formatting in the PDF, so I believe this is mostly due to the usual problems with OCR, combined with Word not really being a layout / graphic design format where PDF is, differences in the fonts available to OCR on my system vs used in the graphic design, and unexplainable removal of accessibility metadata.
So in short, no, it is NOT an exact copy. If you intend to reuse this for export for sighted people, they won't like it, and it will often be impossible for you to tell even where things are broken if you're operating blind. If you're operating sighted, and have very high proficiency, you might be able to manually patch that up to match the original, with a lot of work.
There is nothing that can give you a non-lossy export from PDF. PDF is fundamentally designed as a layout and print/display design format, not a word or data processing format.
If you have any choice, you should only ever treat it as a final format that things go to and not return from. But if you need it for use in software or a braille display, and you don't care that visual things like formatting and images get broken, it is a workable option.
As for spreadsheet exports: they're garbage and I can't recommend using them unless you are operating sighted, only have it in PDF format, and want something marginally better than copy-and-paste to work from to recreate the spreadsheet.
My actual experience with this was when I tried converting a rasterized PDF spreadsheet I got via FOIA, and the result was completely useless.
In fairness to Acrobat, that one was almost total garbage in the government's PDF version too — they exported a large table into multiple pages (both rows and columns didn't fit), and then rasterized it (converted to image, removing all text and metadata), and removed all info about even document boundaries.
As a side note, the government's refusal to produce accessible documents in FOIA is a disputed part of an ongoing case. One decision went against me on this point (because I only told them in the FOIA that I wanted it in accessible format, but didn't say I was blind); see first part of the "analysis" section in Sai v TSA, 315 F. Supp. 218, 233–35 (D.D.C. 2018),
https://scholar.google.com/scholar_case?case=16239104146207287839#p233 .
There is still a pending question about electronic/native format copies in general, which could effectively trump that loss and which seems strongly inclined in my favor (except as to TSA's practice of merging a bunch of documents into one, which went against me); see part A of analysis section in Sai v TSA, 466 F. Supp. 3d 35, 44–51 (D.D.C. 2020)
https://scholar.google.com/scholar_case?case=2576139784660925888#p44 . If these issues interest you legally and you'd like to know more or help out, please get in touch; I am represented by Sidley Austin, but they are not specialized in accessibility issues.
As a test, I also just tried converting a non-rasterized, native electronic PDF spreadsheet that I made myself for filing in that FOIA case. I created it in Google Spreadsheets. Because CM/ECF only accepts PDFs, I exported it as two PDF pages in very very small font (but digital, so it can be zoomed as much as you want), with headers repeated on each page, and sent an Excel copy directly to chambers and opposing counsel. The PDF version I created is, I believe, about as accessible and well formatted as possible for a spreadsheet to be in PDF, and therefore should be the best plausible scenario for re-export to Excel.
Unfortunately, the PDF to Excel export is nearly unusable. It does have all the text, and at least a couple rows, but it lost the column boundaries for most of the rows, and completely failed to deal with the table being split into two pages in the PDF.
I've attached the Excel file I created (as exported by Google Spreadsheets), labeled "original", the PDF from Spreadsheets that I actually filed; and the Excel I just re-exported from that PDF in Acrobat Pro, labeled "re-export from Acrobat".
3. Split PDF
This is a bit hidden but very straightfrward once you find it. Under tools, organize pages, there's a "split" command.
That gives you the option to split by "number of pages", "file size", or "top level bookmarks" (if it has any).
The ACB 2022 program PDF did not have bookmarks. So I've attached an example of "split by pages", 50 pages per.
The result is 4 files, with "_part1" etc at the end of the file name.
4. Columns to plain
> I'd frequently prefer to convert multi-columned pages to single columns.
Will Adobe Pro do this--or does your answer depend whether the PDF is a scanned image or editable text?
Technically yes, practically no.
If it's editable text in Acrobat (either natively or via OCR), in Pro, you can edit text boxes. So you could edit the second column, cut all the text, edit the first column, go to the end, paste it, and resize the box so it fits within the page. That may not be possible if the lines are short, and it is not possible to reflow lines across different pages in PDF — it's a page based format. This is a a major pain to do even sighted, and I think trying to do it blind would be hair-pullingly bad.
If it isn't editable text, you can cut the image of the second column and paste it below the first, resize both to fit, and OCR, but I expect this would be an even worse pain to do and have even worse than usual OCR output.
So that's the technical yes. It is in theory possible. If you really really had to keep the other formatting, you can, sorta. But pragmatically, no.
Save your sanity and don't do this.
If at all possible get it in non column format to start.
If not possible, and you don't care about format, and if you're lucky because the PDF metadata is structured well, then export to text or RTF might do this automatically (because they don't have columns at all, so it's forced to be serialized). This is definitely your best option if it works.
If you're not lucky, it'll do a whole line at a time (so in order it'll go column 1 line 1, column 2 line 1, column 1 line 2, etc). That's effectively unusable without a lot of editing afterwards.
it would be easier in my opinion to copy and paste the text — page by page, column by column — into a new document. That would still be extremely annoying and tedious to do, but not nearly as bad as actually editing the PDF.
I have in fact done this (or equivalent) a few times, when I needed to be able to edit or reflow the content, or just have a more usable arrangement to read through when in blind mode, and spending a few hours on this was worth the result. I can't recommend it if you have any better options, but it does work.
I hope that helps. The attached files should give you a reasonably representative sample of Acrobat Pro's output when converting and splitting files.
Sincerely,
Sai
President, Fiat Fiendum, Inc., a 501(c)(3)
On Mon, Feb 6, 2023 at 11:33 PM JJ Johnston <jeffjayjohnston at gmail.com>
wrote:
> Hello Sai,
>
> This was interesting info about Adobe Pro. I was told that it is
> inaccessible to JAWS; is this true?
>
> If I understand correctly, Adobe Pro has a feature to export to Word.
> This would be ideal for me: I'd rather work with a .docx than a PDF.
> My
> question: is this exporting an exact copy of text and formatting, or
> is it merely an OCR?
>
> Finally, I'd frequently prefer to convert multi-columned pages to
> single columns. Will Adobe Pro do this--or does your answer depend
> whether the PDF is a scanned image or editable text?
>
> Thanks for your info. I know nothing about this software and Googling
> wasn't answering my questions.
>
> Appreciatively,
> Jay
>
> -----Original Message-----
> From: BlindLaw <blindlaw-bounces at nfbnet.org> On Behalf Of Sai via
> BlindLaw
> Sent: Saturday, August 20, 2022 2:17 AM
> To: Blind Law Mailing List <blindlaw at nfbnet.org>
> Cc: Sai <sai at fiatfiendum.org>
> Subject: Re: [blindLaw] accessible solution for splitting large pdfs
>
> 1. Acrobat Pro can do this easily.
>
> It's $60 via TechSoup if you have (or work for) a US non-profit:
> https://www.techsoup.org/adobe (There may be similar deals for non-US
> nonprofits, but I don't know.)
>
> Just be sure to get actual Acrobat Pro (current version is 2020), not
> the new "Creative Cloud" or "DC" which require a yearly subscription
> and don't work properly when offline.
>
> It's $538 for normal license:
> https://helpx.adobe.com/download-install/kb/acrobat-2020-downloads.htm
> l &
> https://commerce.adobe.com/checkout/email/?items%5B0%5D%5Bid%5D=586750
> 01ACEBE288DBDA18D701134F56&cli=adobe_com&co=US&lang=en
>
>
> 2. I believe OSX Preview (which comes with OSX) can do basic
> operations like splitting PDFs. Just select a set of pages and export
> those to a new PDF.
>
>
> 3. If you're comfortable using Unix there are several totally free
> command line tools with similar functionality. Obviously they don't
> have fancy GUI, but then, do you really care about a graphical interface?
>
> For example, pdftk can split, merge, etc. There are several tools that
> can do more advanced stuff that Acrobat itself won't do, like
> pdfresurrect (unpacks hidden previous revisions in a PDF), pdfcrack
> (cracks password protected PDFs), origami (extract, modify, etc PDF contents), etc.
>
> E.g. origami is a very flexible PDF manipulation library:
> https://github.com/gdelugre/origami (which has a GTK based GUI
> available, https://rubygems.org/gems/pdfwalker ), but requires you to
> know (or
> learn) the programming language Ruby.
>
> You can install Ubuntu in Windows 10 & 11 via WSL, on OSX using
> BootCamp, VMware, VirtualBox, or similar, or as your primary OS using
> an installation DVD or USB drive ( https://ubuntu.org has instructions).
>
>
> 4. pdftk is also available for Windows & OSX, with both command line
> and GUI options — free for the full command line version & basic GUI
> version,
> $4 for full GUI version:
>
> https://www.pdflabs.com/tools/pdftk-server/
> https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
>
> I've only used the Unix CLI version so I can't comment on the GUI
> version, but I can confirm that the pdftk CLI is quite useful (even
> though I also have Acrobat Pro).
>
> Sincerely,
> Sai
> President, Fiat Fiendum, Inc., a 501(c)(3)
>
> Sent from my mobile phone; please excuse the concision and autocorrect
> errors.
>
> On Sat, 20 Aug 2022, 06:12 Justin Harford via BlindLaw, <
> blindlaw at nfbnet.org>
> wrote:
>
> > Hello
> >
> > PDF split and merge is an app for iOS which might do the trick. I
> > just took a look at a file that had about 250 pages and it looks
> > like you can split it in equal intervals among other options.
> >
> > It's not free, but not very expensive either.
> >
> > Justin Harford
> > Oregon Bell Academy Coordinator
> >
> >
> > > On Aug 19, 2022, at 9:55 PM, Rahul Bajaj via BlindLaw <
> > blindlaw at nfbnet.org> wrote:
> > >
> > > Hi all,
> > >
> > > As a practicing attorney, I often have to deal with very bulky
> > > files
> > [300+
> > > pages] in my work. JAWS tends to freeze when such a large file is
> > > opened
> > in
> > > Adobe. One workaround that I have found is to split the file, such
> > > that I can extract the relevant pages from the bulky file and read
> > > them as a separate PDF. DOes anyone know of any good, preferably
> > > free, solutions
> > that
> > > do this?
> > >
> > > I'd basically just have to key in the page numbers that I would
> > > want to made into a spearate PDF.
> > >
> > > Warmly,
> > > Rahul
> > >
> > > --
> > > --
> > > Rahul Bajaj
> > > Attorney, Ira Law
> > > Senior Associate Fellow, Vidhi Centre for Legal Policy Rhodes
> > > Scholar (India and Linacre 2018), University of Oxford Co-Founder,
> > > Mission Accessibility Special Correspondent on the rights of
> > > persons with disabilities, Oxford Human Rights Hub Coordinator of
> > > the working group on accessibility, e-Committee, Supreme Court of
> > > India _______________________________________________
> > > BlindLaw mailing list
> > > BlindLaw at nfbnet.org
> > > http://nfbnet.org/mailman/listinfo/blindlaw_nfbnet.org
> > > To unsubscribe, change your list options or get your account info
> > > for
> > BlindLaw:
> > >
> > http://nfbnet.org/mailman/options/blindlaw_nfbnet.org/blindstein%40g
> > ma
> > il.com
> >
> > _______________________________________________
> > BlindLaw mailing list
> > BlindLaw at nfbnet.org
> > http://nfbnet.org/mailman/listinfo/blindlaw_nfbnet.org
> > To unsubscribe, change your list options or get your account info
> > for
> > BlindLaw:
> > http://nfbnet.org/mailman/options/blindlaw_nfbnet.org/sai%40fiatfien
> > du
> > m.org
> >
> _______________________________________________
> BlindLaw mailing list
> BlindLaw at nfbnet.org
> http://nfbnet.org/mailman/listinfo/blindlaw_nfbnet.org
> To unsubscribe, change your list options or get your account info for
> BlindLaw:
>
> http://nfbnet.org/mailman/options/blindlaw_nfbnet.org/jeffjayjohnston%
> 40gmail.com
>
>
More information about the BlindLaw
mailing list