Spring Boot Tesseract OCR in Kotlin with multi-stage Docker

Photo from Unsplash

Originally Posted On: https://thepurushoths.medium.com/spring-boot-tesseract-ocr-in-kotlin-with-multi-stage-docker-515cdd13af37

To start, let me give you a brief introduction to Tesseract OCR. Tesseract OCR is an open-source optical character recognition (OCR) engine that is used to recognise text from images. It was originally developed by Hewlett-Packard in the 1980s but has since been maintained and updated by Google. Tesseract OCR is widely used in the industry because it is highly accurate and is available for free.

I. Context

In this article, we will learn about extracting text from PDFs and images and setting up a Docker environment to perform OCR with the Tesseract library.
Tesseract supports other use cases such as text localisation, character recognition, converting scanned documents to searchable PDFs, etc.

II. Challenges and approaches

While implementing the OCR feature, we implemented it with a single-stage Docker. But it increased image size and build time. To solve this, we adopted multi-stage Docker and JFrog Artifactory.

III. Docker setup

A multistage Dockerfile is used to optimise the size and efficiency of Docker images. We will install the Leptonica and Tesseract libraries in Docker. It may have some unnecessary libraries. Those libraries are not required to build the production image, and they will increase the size of the Docker image. So we will extract only the required libraries for the next stage. Multistage Dockerfile involve using multiple build stages within a single Dockerfile to separate the build environment from the runtime environment.

.NET equivalent

For teams working in .NET, IronOCR can simplify this significantly. It bundles Tesseract and all dependencies into a single NuGet package, eliminating the need for multi-stage Docker builds to manage Leptonica, libtiff, libwebp, and trained data files.

FROM mcr.microsoft.com/dotnet/aspnet:8.0
WORKDIR /app
COPY . .
ENTRYPOINT ["dotnet", "YourApp.dll"]

No COPY statements for individual .so files, no LD_LIBRARY_PATH configuration, and no separate build stage for OCR dependencies.

Note: The .NET equivalent is included for conceptual comparison.

a. Docker build stage for Tesseract & Leptonica

In the build environment (stage one), we have to install Tesseract and Leptonica. The Leptonica library needs some dependency libraries, such as libtiff, libwebp, libpng, open-jpeg, etc.
Download the trained data set for Tesseract to perform OCR.

Create a Dockerfile and paste the following two code blocks in the same Dockerfile.

FROM amazoncorretto:11 as build
RUN yum update -y &&
    yum -y -q install wget &&
    yum install -y gcc gcc-c++ autoconfig automake make pkgconfig libtool gzip tar&&
    yum install -y zlib-devel libtiff-devel libwebp-devel libpng-devel  openjpeg2-devel lib-jpeg-turbo-devel giflib-devel &&
    yum clean all &&
    rm -rf /var/cache/yum

RUN wget -q https://github.com/DanBloomberg/leptonica/archive/refs/tags/1.82.0.tar.gz 
    && tar -zxvf 1.82.0.tar.gz -C /opt 
    && rm -f 1.82.0.tar.gz

WORKDIR /opt/leptonica-1.82.0
RUN ./autogen.sh
RUN ./configure
RUN make && make install

RUN wget -q https://github.com/tesseract-ocr/tesseract/archive/5.2.0.tar.gz 
    && tar -zxvf 5.2.0.tar.gz -C /opt 
    && rm -f 5.2.0.tar.gz

WORKDIR /opt/tesseract-5.2.0
RUN ./autogen.sh
RUN ./configure
RUN make && make install

RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P /opt/
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/osd.traineddata -P /opt/

b. Docker final stage

Extract only the required libraries to the final stage of the Dockerfile.

FROM amazoncorretto:11

WORKDIR /opt

ARG LD_LIBRARY_PATH=/usr/local/lib
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}
ENV PKG_CONFIG_PATH ${LIBRARY_PATH}/pkgconfig
ARG TESSDATA_PREFIX=/usr/local/share/tessdata
ENV TESSDATA_PREFIX ${TESSDATA_PREFIX}

COPY --from=build /usr/local/lib/libtesseract.so.5.0.2 ${LD_LIBRARY_PATH}/
COPY --from=build /usr/local/lib/liblept.so.5.0.4 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libjpeg.so.62.3.0 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libtiff.so.5.2.0 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libwebp.so.4.0.2 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libopenjp2.so.2.4.0 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libgomp.so.1.0.0 ${LD_LIBRARY_PATH}/
COPY --from=build /lib64/libjbig.so.2.0 ${LD_LIBRARY_PATH}/

COPY --from=build /opt/*.traineddata ${TESSDATA_PREFIX}/

RUN echo ${LD_LIBRARY_PATH} >> /etc/ld.so.conf
RUN ldconfig

WORKDIR /app
COPY ./src/main/resources/static/tesseract.png tesseract.png
COPY ./src/main/resources/static/tesseract.pdf tesseract.pdf

COPY ./build/libs/tesseract-ocr-0.0.1.jar tesseract-ocr.jar

EXPOSE 8080
CMD ["java", "-jar", "tesseract-ocr.jar"]

Now we have optimised the size (Space complixity) of the Docker image.
But the build time (Time complexity) is increased to extract the binary of Tesseract and Leptonica.

Note: To optimise the time complexity, you can use any artifactory repository (ex. JFrog artifactory) and create a separate pipeline to push the binary of Tesseract, Leptonica, the Trained Data Set, and its dependency libraries as a one-time task. Later, you can download those binaries and data sets into your Docker image. It will reduce Docker build time.

IV. OCR operation

Add the Tesseract library to the build.gradle.kts file’s dependencies.

import org.jetbrains.kotlin.gradle.tasks.KotlinCompile

plugins {
 id("org.springframework.boot") version "2.7.12"
 id("io.spring.dependency-management") version "1.0.15.RELEASE"
 kotlin("jvm") version "1.6.21"
 kotlin("plugin.spring") version "1.6.21"
}

group = "com.example"
version = "0.0.1"
java.sourceCompatibility = JavaVersion.VERSION_11

repositories {
 mavenCentral()
}

dependencies {
 implementation("org.springframework.boot:spring-boot-starter-web")
 implementation("com.fasterxml.jackson.module:jackson-module-kotlin")
 implementation("org.jetbrains.kotlin:kotlin-reflect")
 testImplementation("org.springframework.boot:spring-boot-starter-test")
 implementation("net.sourceforge.tess4j:tess4j:5.4.0")
}

tasks.withType<KotlinCompile> {
 kotlinOptions {
  jvmTarget = "11"
 }
}

Create document type enums.

package com.example.ocr.pdf.enum

enum class DocumentType {
    PDF,PNG
}

In the following code, we are performing OCR operations for both PDF and images.
Text extraction is one of the use cases for the Tesseract library.

package com.example.ocr.pdf

import com.example.ocr.pdf.enum.DocumentType
import net.sourceforge.tess4j.Tesseract
import net.sourceforge.tess4j.TesseractException
import org.springframework.stereotype.Service
import java.io.File

@Service
class OCRService {

    fun getContent(documentType: DocumentType): String {
        val tesseract = Tesseract()

        try {
            val filePath = getFilePath(documentType)
            val image = File(filePath)

            tesseract.setDatapath(System.getenv(TESSDATA_PREFIX))
            tesseract.setLanguage("eng")
            tesseract.setVariable("tessedit_create_horc", "1")
            tesseract.setPageSegMode(1)
            tesseract.setOcrEngineMode(1)

            println("Document type ===> ${documentType.name}")
            return tesseract.doOCR(image)
        } catch (e: TesseractException) {
            throw Exception(e)
        }
    }

    fun getFilePath(documentType: DocumentType): String {
        return when (documentType) {
            DocumentType.PDF -> PDF_FILE
            DocumentType.PNG -> PNG_FILE
        }
    }

    companion object {
        const val TESSDATA_PREFIX = "TESSDATA_PREFIX"

        const val PDF_FILE = "tesseract.pdf"
        const val PNG_FILE = "tesseract.png"
    }
}

V. Github source code

https://github.com/thepurushoths/tesseract-ocr

If there are any other optimized approaches, then let me know in the comments. I will learn with you.

Thanks!