Apache PDFBox read PDF Document in Java
In the previous tutorial we saw how to create a PDF document with Apache PDFBox. This tutorial demonstrates how to read a PDF document using Apache PDFBox.
Maven Dependencies
We use Apache Maven to manage our project dependencies. Make sure the following dependencies reside on the class-path.
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.8</version>
</dependency>
Input PDF Document
In this tutorial we’ll read the following PDF document and scrape all the text from it.
Apache PDFBox read PDF Document in Java
We can use the PDDocument.load()
method to read a PDF document. Next we use the PDFTextStripper
to demonstrate how you can extract some text from the PDF document.
package com.memorynotfound.pdf.pdfbox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class ReadPdf {
public static void main(String[] args) throws Exception{
try (PDDocument document = PDDocument.load(new File("/tmp/example.pdf"))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
Output
The previous program generates the following output.