The procedure described here could work for Linux and Mac as well as the same utilities are available on those platform. I have only tested this in the Windows 10 environment. All command are run in the standard Windows 10 command prompt.
Tesseract is the primary utility that is used to OCR. However, Tessaract does not accept a PDF file as input hence we have to follow a convoluted process of converting the PDF to PNG's by page, then running Tessaract on each page to produce an OCR version of the PDF page, and finally merge all the individual PDF pages into a single file.
Download the utilities
- PDFtoPNG - https://dl.xpdfreader.com/xpdf-tools-win-4.02.zip
- Tesseract - https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v126.96.36.19990314.exe
Make sure all of these are in the same folder along with the input PDF file. Alternatively, add all of them to your PATH environment variable. Then open a new command prompt window.
Step 1: Convert PDF file to PNG
pdftopng input.pdf intermediate-file
Step 2: OCR png images of PDF pages to create OCR'ed version of the page
FORFILES /S /M *.png /C "cmd /c tesseract @fname.png @fname pdf"
Step 3: Merge PDF files together
Ensure all your input PDF files are in the same directory and run the following command.
qpdf --empty --pages *.pdf -- out.pdf