You are here
Home > java > Core Java >

How to Extract Text from Image Using Java?

extract text from an imageMany a times we come across a requirement when we need to extract text from image. This process is also known as Optical Character Recognition (OCR). OCR is extensively useful in various use cases, such as converting physical documents into digital formats to make them searchable and editable, automating data entry processes to minimize errors, extracting information from checks in banking institutions, digitizing patient records in the healthcare industry, converting case files into digital formats in the legal department, digitizing textbooks, extracting information from invoices, receipts, and many more.

This article will focus on How to Extract Text from Image Using Java?. Although there are many online image-to-text converter tools are introduced that can extract text from image automatically, but knowing how to do it using Java can be useful, especially if you are a developer. We are going to discuss some precious information that would enable you to extract any type of text from a digital image with the help of Java programming language

Writing code by yourself to achieve this can be a little overwhelming if you are new to programming. However, the process that we are going to walk you through should simplify the things. These steps are simple to follow but you have to perform them with your entire focus. So, let’s start with some of the basics of it, followed by an example.

What is the significance of extracting text from an image?

Extracting text from an image carries significant importance in various applications. Here are some key reasons why it is valuable:

1) Data Digitization: We can convert non-editable content into digital text, making it searchable, editable, and suitable for further analysis.

2) Information Retrieval: OCR offers the extraction of relevant information from images, which is mostly useful in circumstances where textual data is embedded in images, such as scanned documents, photographs, or screenshots.

3) Document Management: OCR plays a vital role in document management systems, making it easier to organize, categorize, and retrieve information from scanned documents. This improves productivity in various industries, such as legal to healthcare.

4) Accessibility: Extracting text from images contributes to making content more accessible. For example, it allows visually impaired persons to access information by converting image-based text into readable text that can be read by screen readers.

5) Automation: In fields like finance and administration, OCR is used to automate data entry processes. Instead of manual data input, systems can extract and process information from images, reducing human errors and saving time.

6) Language Translation: OCR can be integrated with language translation tools to convert text from images into different languages, simplifying communication and understanding across language barriers.

7) Mobile Applications: OCR is frequently used in mobile applications for tasks like recognizing text from images captured by the device’s camera. This is used in various applications, from translating text to extracting information from business cards.

What are the typical use cases where the OCR technique is useful?

OCR (Optical Character Recognition) is extremely demanded in several industries and applications due to its adaptable capabilities. Some prominent use cases are:

1) Document Digitization: OCR is significantly used for converting physical documents into digital formats, making them searchable and editable. This is crucial for industries dealing with large volumes of paperwork, such as legal, finance, and healthcare.

2) Data Entry and Automation: OCR is employed in automating data entry processes, reducing manual efforts, and minimizing human errors. Industries like finance and administration benefit from OCR’s ability to extract and process data from invoices, receipts, and other forms.

3) Banking and Finance: OCR plays a crucial role in the banking sector for tasks like extracting information from checks, invoices, and financial statements. It enhances accuracy and efficiency in handling financial documents.

4) Healthcare: In healthcare, OCR is used for digitizing patient records, extracting information from medical forms, and making medical documents searchable. This modernizes administrative processes and improves patient care.

5) Legal Industry: Law organizations use OCR to convert legal documents, contracts, and case files into digital formats, enabling easy retrieval and analysis. This improves the overall efficiency of the legal operations.

6) Mobile Applications: OCR is integrated into mobile applications for various purposes, such as translating text from images, recognizing business cards, and extracting information from documents using the mobile’s camera.

7) Education: OCR is utilized in educational settings for digitizing textbooks, converting printed material into accessible digital formats, and facilitating content searchability.

8) Accessibility Services: OCR is essential for creating accessible content for individuals with visual impairments. It converts image-based text into readable formats compatible with screen readers.

9) Retail and E-commerce: OCR is used in retail for tasks like extracting information from invoices, receipts, and product labels. It simplifies inventory management and order processing.

10) Government and Public Services: Government agencies use OCR for processing and digitizing various documents, from ID cards to official forms, improving efficiency and accessibility of public services.

How to Extract Text from Image Using Java?

In order to extract the written text in images using Java, we have to follow some steps. Although these steps are easy, we have to perform them cautiously. Otherwise, the process will not be completed and you may face an error. Below is the list of these steps that we have to follow one by one:

Step#1: Get Tesseract OCR

The first thing we have to do is install the Tesseract OCR. Tesseract OCR is an engine that actually performs the extraction of text from images in Java. The good thing about this engine is that it is freely accessible. The Tesseract OCR engine can easily be accessed on GitHub. After getting it, install it on your computer according to the instructions given by your operating system. Once the Tesseract is successfully installed on your system, you are ready to get the next process done. 

Here is the direct link to download Tesseract installer for Windows. Users other than Windows can download it from Tesseract download page.

During the installation process, it will ask you to provide a directory path to install. In our case, we have installed it in ‘D’ directory and the path is: ‘D:\\Tesseract-OCR\’. We will use this path in our code later.

Note: We can download trained data files in different languages from Traineddata Files for Version 4.00 +

Step#2: Add tess4j Library

The next step we have to follow in order to be able to extract text from a digital image is to add a tess4j library to your class path. A tess4j library is a Java wrapper that we need to make the Tesseract functional. Basically, there are two methods to perform this action. The first method involves using a build tool such as Maven or GradleHowever, the second method is a bit manual. In this method, we have to add tess4j jar manually without using a build tool.

Please note that tess4j may need some supporting jars in order to run the application successfully. Here is the direct link to download tess4j JAR file with all dependencies.

Now it is time to add the tess4j jar file to your classpath. This thing depends on the development environment you are using. This can either be done by configuring classpath in your IDE, specifying the tess4j file while coding, or using an environmental variable. 

To import jar file in your Eclipse/STS IDE, follow the steps given below.

  1. Right click on your project
  2. Select Build Path
  3. Click on Configure Build Path
  4. Click on Libraries and select Add External JARs
  5. Select the jar file from the required folder
  6. Click and Apply and Ok

If you are new in Java, kindly refer how to create a starter Project in Java using STS

Step#3: Write the Code

After you have performed the above-discussed steps, it’s time to actually write the code that will help you extract text from image. We are going to provide you with a basic code that you can use to do so. 

Make sure to code carefully so that you don’t have to face any type of error. Here’s an example code for you:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract; 
import net.sourceforge.tess4j.TesseractException; 

public class TestOCR { 

    public static void main(String[] args) {

       ITesseract tesseract = new Tesseract(); 

       try { 

          // the path of your tess data folder inside the extracted file
          tesseract.setDatapath("D:\\Tesseract-OCR\\tessdata"); 

          // path of your image file 
          String text = tesseract.doOCR(new File("D:\\Tess4j\\Spring_Cloud_Annotations.jpg"));         
          System.out.print(text);   

          // Create a FileWriter with the specified file path
          FileWriter fileWriter = new FileWriter("D:\\Tess4j\\Spring_Cloud_Annotations.txt");

          // Wrap FileWriter in BufferedWriter for efficient writing
          BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);

          // Write the text to the file
          bufferedWriter.write(text);

          // Close the BufferedWriter to flush and release resources
          bufferedWriter.close();
          System.out.println("Text has been written to the file successfully.");

       } catch (TesseractException | IOException e) { 
            e.printStackTrace(); 
       } 
    } 
}

This is the basic code you will have to use for extracting any type of text from a digital image. As shown in the code above, the extracted text from the image will be saved in a text file. Once you are done with the code, it’s time to execute the final step. 

Step#4: Run the Code

After writing the code with the instructions we just provided you, your Java program will be ready to extract the text. Run the program and see the output. Apart from the newly generated text file, you will see the text output in your console as well. After running the code, Java will use the Tesseract OCR engine to identify and analyze the text. After the analysis, it will convert it into a text document file. File that you can easily edit, store, and share anywhere. 

This is one of the most basic methods of extracting text from an image using a Java programming language.

Conclusion

Extracting text from an image is easier than people think. Although the coding can be a little complicated, we have simplified it for you. 

There are automatic ways to extract text from image without frustrating yourself with complicated programming processes, but knowing how to extract text from image using Java can be beneficial.

In the information given above, we have compiled a list of 4 simple steps that you have to follow in order to be able to extract text from digital images. Just follow these steps one by one and you will be able to do so. 

FAQ 

What is OCR?

OCR, stands for Optical Character Recognition. It is a technique that converts different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into text format. The text format data is easily editable and searchable. The basic purpose of OCR is to recognize and extract text from these documents, and make it available for further processing, such as editing, or searching.

Why do we have to use Tesseract OCR for Java?

Writing codes for extracting data from an image can be very complex. Tesseract OCR engine automates this and helps us extract information without having to spend a lot of time on it. 

Can we use this Guide for Extracting Text in Multiple Languages?

Yes, we can use this guide to extract text written in multiple languages. That is because the Tesseract is designed to recognize multilingual text and extract it into an editable document form. 

Are there Any Alternative Methods of Extracting Text from Images?

Yes, there are various online tools that can help you with extracting text from images such as Imagetotext.info, Imagetotext.io, and Prepostseo. These tools are designed to detect text in an image automatically and extract it into an editable text file. 

Can we use any other Name than “lib” While Creating a Directory?

Yes, we can use any name other than ‘lib’ while creating a directory. ‘lib’ is generally used to indicate the word ‘library’. Any other word would also be good to use as long as we can identify and remember it easily. 

Can OCR be used on mobile devices?

Yes, OCR technology can be used on mobile devices. Many mobile applications leverage OCR for tasks like translating text from images, extracting information from business cards, and digitizing documents using the mobile’s camera.

How does OCR work?

OCR works by analyzing the shapes and patterns of characters in an image or document. The technology uses algorithms to recognize these patterns and convert them into machine-readable text. Pre-processing steps may involve image enhancement, layout analysis, and character recognition to accomplish accurate results.

Sources:

Tesseract OCR: https://github.com/tesseract-ocr/tesseract

One thought on “How to Extract Text from Image Using Java?

Leave a Reply


Top