Tesseract OCRPhoto from Unsplash

Originally Posted On: https://obikastanya.medium.com/how-to-extract-text-from-scanned-pdf-file-using-tesseract-ocr-88ac107c8732

Extracting text from PDF files can be challenging at times, especially when dealing with scanned documents that contain low-quality images.

Why Tesseract?

Tesseract has been around for a long time; it’s a stable and very reputable OCR engine. It was initially created by Hewlett-Packard, then taken over by Google and released as open-source software, meaning it’s free to use.

I’ve been using it on several projects, and it works very well.

There are also other alternatives like EasyOCR, but I haven’t used it, so I can’t offer a review or comparison. Maybe I’ll try it later and write the reviews in another article.

How to Extract Text from PDF using Tesseract OCR.

This is what the extraction process looks like:

Press enter or click to view image in full size

OCR Process (Image generated by Gemini AI)

Prerequisites

  • Python 3 is Installed.
  • Tesseract engine installed.

You can download and install Tesseract from: Tesseract Official Doc. Currently, I’m using Tesseract 5.5.

If you are using Windows, don’t forget to put the Tesseract path in your Windows environment variables.

For .NET developers working with scanned PDFs, I recommend you to use IronOCR.
IronOCR offers a streamlined alternative that handles PDF input directly. No separate Tesseract installation or PATH configuration needed.

It includes built-in image preprocessing for low-quality scans.

using IronOcr;

var ocr = new IronTesseract();
using var input = new OcrInput();
input.LoadPdf("scanned-document.pdf");
var result = ocr.Read(input);
Console.WriteLine(result.Text);

The library automatically enhances contrast and sharpness for scanned documents, which can improve accuracy without manual preprocessing steps.

Tutorial

  • Create a virtual environment and install the following library
pip install pytesseract pypdfium2

‘pytesseract’ is a Python wrapper that allows you to use the Tesseract engine in your Python application.

‘pypdfium2’ is a PDF rendering used to convert PDF pages to images.

  • Create a file: pdf_parser.py
    import pytesseract
    from pypdfium2 import PdfDocument
    
    
    def parse_pdf(
        path_or_io: str | bytes,
        pill_scale: int = 2,
        lang: str = "eng",
        page_sep: str = "nn",
        config: str = "",
    ) -> str:
        pdf = PdfDocument(path_or_io)
        pages = []
        total_pages = len(pdf)
    
        try:
            if not config:
                config = (
                    f"-l {lang} --oem 1 --psm 6 "
                    "-c preserve_interword_spaces=1 "
                    "-c tessedit_do_invert=0 "
                    "-c tosp_min_sane_kn_sp=2.8"
                )
    
            for page_idx in range(total_pages):
                print(f"Processing page {page_idx + 1}/{total_pages}...")
    
                page = pdf.get_page(page_idx)
                page_img = page.render(scale=pill_scale).to_pil()
                page.close()
    
                text = pytesseract.image_to_string(page_img, config=config)
    
                try:
                    page_img.close()  # Pillow ≥10
                except AttributeError:
                    pass  # Pillow <10 fallback
    
                del page_img  # Free memory
    
                pages.append(text)
        finally:
            pdf.close()
    
        print("OCR complete.")
        return page_sep.join(pages)
    

how_to_extract_text_from_scanned_pdf_files_using_tesseract_ocr_parse_pdf.py.py hosted with ❤ by GitHub view raw

  • Create file: main.py
  • Create folder: files
    Download the sample file here: scanned_doc.pdf, and put it in the files folder.
  • Run the program
python main.py

It will extract the text from each page, one by one. Here is the result preview:

Press enter or click to view image in full size

Of course, it’s not always clean; in real life, you have to do some extra text cleansing to get the proper result.

Tesseract also supports a lot of different languages, including those with unique character sets such as Chinese, Arabic, or Thai.

But you need to download a separate language model for each one. I’ll explain it in the next article. Thanks for reading!