How to Extract Text from Scanned PDF File using Tesseract OCR

Photo from Unsplash

Originally Posted On: https://obikastanya.medium.com/how-to-extract-text-from-scanned-pdf-file-using-tesseract-ocr-88ac107c8732

Extracting text from PDF files can be challenging at times, especially when dealing with scanned documents that contain low-quality images.

Why Tesseract?

Tesseract has been around for a long time; it’s a stable and very reputable OCR engine. It was initially created by Hewlett-Packard, then taken over by Google and released as open-source software, meaning it’s free to use.

I’ve been using it on several projects, and it works very well.

There are also other alternatives like EasyOCR, but I haven’t used it, so I can’t offer a review or comparison. Maybe I’ll try it later and write the reviews in another article.

How to Extract Text from PDF using Tesseract OCR.

This is what the extraction process looks like:

*OCR Process (Image generated by Gemini AI)*

Prerequisites

Python 3 is Installed.
Tesseract engine installed.

You can download and install Tesseract from: Tesseract Official Doc. Currently, I’m using Tesseract 5.5.

If you are using Windows, don’t forget to put the Tesseract path in your Windows environment variables.

For .NET developers working with scanned PDFs, I recommend you to use IronOCR.
IronOCR offers a streamlined alternative that handles PDF input directly. No separate Tesseract installation or PATH configuration needed.

It includes built-in image preprocessing for low-quality scans.

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");
var result = ocr.Read(input);
Console.WriteLine(result.Text);

The library automatically enhances contrast and sharpness for scanned documents, which can improve accuracy without manual preprocessing steps.

Tutorial

Create a virtual environment and install the following library

pip install pytesseract pypdfium2

‘pytesseract’ is a Python wrapper that allows you to use the Tesseract engine in your Python application.

‘pypdfium2’ is a PDF rendering used to convert PDF pages to images.

Create a file: pdf_parser.py

import pytesseract
from pypdfium2 import PdfDocument


def parse_pdf(
    path_or_io: str | bytes,
    pill_scale: int = 2,
    lang: str = "eng",
    page_sep: str = "nn",
    config: str = "",
) -> str:
    pdf = PdfDocument(path_or_io)
    pages = []
    total_pages = len(pdf)

    try:
        if not config:
            config = (
                f"-l {lang} --oem 1 --psm 6 "
                "-c preserve_interword_spaces=1 "
                "-c tessedit_do_invert=0 "
                "-c tosp_min_sane_kn_sp=2.8"
            )

        for page_idx in range(total_pages):
            print(f"Processing page {page_idx + 1}/{total_pages}...")

            page = pdf.get_page(page_idx)
            page_img = page.render(scale=pill_scale).to_pil()
            page.close()

            text = pytesseract.image_to_string(page_img, config=config)

            try:
                page_img.close()  # Pillow ≥10
            except AttributeError:
                pass  # Pillow <10 fallback

            del page_img  # Free memory

            pages.append(text)
    finally:
        pdf.close()

    print("OCR complete.")
    return page_sep.join(pages)

how_to_extract_text_from_scanned_pdf_files_using_tesseract_ocr_parse_pdf.py.py hosted with by GitHub view raw

Create file: main.py

Create folder: files
Download the sample file here: scanned_doc.pdf, and put it in the files folder.
Run the program

python main.py

It will extract the text from each page, one by one. Here is the result preview:

Of course, it’s not always clean; in real life, you have to do some extra text cleansing to get the proper result.

Tesseract also supports a lot of different languages, including those with unique character sets such as Chinese, Arabic, or Thai.

But you need to download a separate language model for each one. I’ll explain it in the next article. Thanks for reading!

How to Extract Text from Scanned PDF File using Tesseract OCR

Photo from Unsplash

Recent Posts