Using PDFTOTEXT to convert a batch of PDFs to text and splitting them by page

I can’t believe how hard it was to find this (also, I know basically nothing about bash scripting), so maybe the next person who Googles this will find this post and save themselves a few minutes:

(replace ’999′ with the number of pages in a document)

for f in *.PDF;
   do
         for i in {1..999};
         do
         pdftotext -f "$i" -l $l "$i" -layout $f "${f%.PDF}_$1.txt";
     done;
done

Or:
for f in *.PDF; do for i in {1..999}; do pdftotext -f "$i" -l $l "$i" -layout $f "${f%.PDF}_$i.txt"; done; done

The above script will tell pdftotext to take every .PDF file and convert each page into a separate text file in the format original_file_name_pagenumber.txt

Follow me on Twitter: @dancow. If you're interested in learning programming, check out my online-book-in-progress, The Bastards Book of Ruby.

Discussion1 Comment Category Uncategorized Tags , ,

One Response to Using PDFTOTEXT to convert a batch of PDFs to text and splitting them by page

  1. for f in *.tif; do tesseract $f “${f%.tif}.txt”; done;

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>