Apache PDFBox extract text from PDF Document

by MemoryNotFound · Published February 20, 2018 · Updated February 20, 2018

This tutorial demonstrates how to use Apache PDFBox to extract text from a PDF document. The first example extracts all text from a PDF document. The second example extracts text from a specific area.

Maven Dependencies

We use Apache Maven to manage our project dependencies. Make sure the following dependencies reside on the class-path.

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.8</version>
</dependency>

Extract Text from PDF document

In the following examples we’ll be using the following PDF document to extract the text from.

Apache PDFBox Extract all text from PDF

We can use the PDFTextStripper to filter out all the text from the PDF document.

package com.memorynotfound.pdf.pdfbox;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;

public class ExtractText {

    public static void main(String[] args) throws Exception{

        try (PDDocument document = PDDocument.load(new File("/tmp/example.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Output

You can see that the entire content of the PDF document is extracted.

apache pdfbox extract all text pdf document

Apache PDFBox Extract text from PDF by area

We can use the PDFTextStripperByArea to extract text from a specific area.

package com.memorynotfound.pdf.pdfbox;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.*;
import java.io.File;
import java.io.IOException;

public class ExtractTextByArea {

    public static void main(String[] args) throws Exception{

        try (PDDocument document = PDDocument.load(new File("/tmp/example.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);
                Rectangle rect = new Rectangle(10, 60, 600, 140);
                stripper.addRegion("class1", rect);
                PDPage firstPage = document.getPage(0);
                stripper.extractRegions( firstPage );
                System.out.println("Text in the area:" + rect);
                System.out.println(stripper.getTextForRegion( "class1"));

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Output

You can see that only the first paragraph of the PDF document is extracted.

apache pdfbox extract text pdf document by area

Apache PDFBox extract text from PDF Document

You may also like...

Overview

Other Gson Tutorials

Apache PDFBox extract text from PDF Document

Discover more articles

Maven Dependencies

Extract Text from PDF document

Apache PDFBox Extract all text from PDF

Output

Apache PDFBox Extract text from PDF by area

Output

References

Download

You may also like...

Apache PDFBox Add Embedded File to PDF Document

Split PDF Document with iText in Java

Merge Multiple PDF Documents using iText and Java

Overview

Other Gson Tutorials