How to OCR and merge PDF documents using free command line utilities on Windows

How to OCR and merge PDF documents using free command line utilities on Windows

The procedure described here could work for Linux and Mac as well as the same utilities are available on those platform. I have only tested this in the Windows 10 environment. All command are run in the standard Windows 10 command prompt.

Tesseract is the primary utility that is used to OCR. However, Tessaract does not accept a PDF file as input hence we have to follow a convoluted process of converting the PDF to PNG's by page, then running Tessaract on each page to produce an OCR version of the PDF page, and finally merge all the individual PDF pages into a single file.

Download the utilities

  1. PDFtoPNG - https://dl.xpdfreader.com/xpdf-tools-win-4.02.zip
  2. Tesseract - https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0.20190314.exe
  3. https://sourceforge.net/projects/qpdf/files/qpdf/10.0.1/qpdf-10.0.1-bin-msvc32.zip/download

Make sure all of these are in the same folder along with the input PDF file. Alternatively, add all of them to your PATH environment variable. Then open a new command prompt window.

Step 1: Convert PDF file to PNG

pdftopng input.pdf intermediate-file

Step 2: OCR png images of PDF pages to create OCR'ed version of the page

FORFILES /S /M *.png /C "cmd /c tesseract @fname.png @fname pdf"

Step 3: Merge PDF files together

Ensure all your input PDF files are in the same directory and run the following command.

qpdf --empty --pages *.pdf -- out.pdf

Show Comments