Apache PDFBox adding meta-data to PDF document
We can change the document properties of a PDF document like: creator, author, title, subject, creationDate, etc. In this tutorial we demonstrate how to add meta-data to a PDF document using Apache PDFBox. PDFBox supports different formats and schemes like XMP. The first example adds meta-data to a PDF document. The second example reads the meta-data from a PDF document.
Maven Dependencies
We use Apache Maven to manage our project dependencies. Make sure the following dependencies reside on the class-path.
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.8</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>xmpbox</artifactId>
<version>2.0.8</version>
</dependency>
Adding Meta-data to PDF Document
We can change and add PDF document properties using the PDDocumentInformation
class. This allows us to set properties like:
title
– will set the title of the document.subject
– will set the subject of the document.author
– will set the author of the document.creator
– will set the creator of the document.producer
– will set the producer of the document.keywords
– will set the keywords of the document.creationDate
– will set the creation date of the document.modificationDate
– will set the modification date of the document.trapped
– will set the trapped of the document. This will beTrue
,False
, orUnkown
customMetadataValue
– set the custom metadata value.
package com.memorynotfound.pdf.pdfbox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.AdobePDFSchema;
import org.apache.xmpbox.schema.DublinCoreSchema;
import org.apache.xmpbox.schema.XMPBasicSchema;
import org.apache.xmpbox.xml.XmpSerializer;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.util.Calendar;
public class AddMetaData {
public static void main(String[] args) throws Exception{
try (final PDDocument document = new PDDocument()){
PDDocumentInformation info = new PDDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setAuthor("Memorynotfound.com");
info.setCreator("Memorynotfound.com");
info.setProducer("Memorynotfound.com");
info.setKeywords("Apache, PdfBox, XMP, PDF");
info.setCreationDate(Calendar.getInstance());
info.setModificationDate(Calendar.getInstance());
info.setTrapped("Unknown");
info.setCustomMetadataValue("swag", "yes");
XMPMetadata metadata = XMPMetadata.createXMPMetadata();
AdobePDFSchema pdfSchema = metadata.createAndAddAdobePDFSchema();
pdfSchema.setKeywords(info.getKeywords());
pdfSchema.setProducer(info.getProducer());
XMPBasicSchema basicSchema = metadata.createAndAddXMPBasicSchema();
basicSchema.setModifyDate(info.getModificationDate());
basicSchema.setCreateDate(info.getCreationDate());
basicSchema.setCreatorTool(info.getCreator());
basicSchema.setMetadataDate(info.getCreationDate());
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.addCreator(info.getCreator());
dcSchema.setDescription(info.getSubject());
PDMetadata metadataStream = new PDMetadata(document);
PDDocumentCatalog catalog = document.getDocumentCatalog();
catalog.setMetadata(metadataStream);
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream out = new ByteArrayOutputStream();
serializer.serialize(metadata, out, false);
metadataStream.importXMPMetadata(out.toByteArray());
PDPage page = new PDPage();
document.addPage(page);
document.setDocumentInformation(info);
document.setVersion(1.5f);
document.save(new File("/tmp/meta-data.pdf"));
} catch (IOException e){
System.err.println("Exception while trying to create pdf document - " + e);
}
}
}
Output
When you run the previous example the meta-data.pdf
document is generated. When you open this document using Acrobat Reader and check the document properties, you’ll see that every property is filled in.
Reading Meta-data from a PDF document
In the previous example we added meta-data to a PDF document. In this example we demonstrate how you can read meta-data from a PDF document. First, we read the document then obtain the PDDocumentInformation
. This class holds the basic meta-data of the PDF document.
Next we can also obtain the XMP meta-data of a PDF document.
package com.memorynotfound.pdf.pdfbox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import java.io.File;
import java.io.IOException;
import java.text.SimpleDateFormat;
public class PrintMetaData {
private static final SimpleDateFormat SDF = new SimpleDateFormat();
public static void main(String[] args) throws Exception{
try (PDDocument document = PDDocument.load(new File("/tmp/meta-data.pdf"))) {
PDDocumentInformation info = document.getDocumentInformation();
System.out.println( "Page Count=" + document.getNumberOfPages());
System.out.println( "Title=" + info.getTitle());
System.out.println( "Author=" + info.getAuthor());
System.out.println( "Subject=" + info.getSubject());
System.out.println( "Keywords=" + info.getKeywords());
System.out.println( "Creator=" + info.getCreator());
System.out.println( "Producer=" + info.getProducer());
System.out.println( "Creation Date=" + SDF.format(info.getCreationDate().getTime()));
System.out.println( "Modification Date=" + SDF.format(info.getModificationDate().getTime()));
System.out.println( "Trapped=" + info.getTrapped());
PDDocumentCatalog cat = document.getDocumentCatalog();
PDMetadata metadata = cat.getMetadata();
if (metadata != null) {
String string = new String( metadata.toByteArray(), "ISO-8859-1");
System.out.println( "Metadata=" + string);
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
Output
When the previous program is executed it’ll produce the following output.
References
- Apache PdfBox Official Website
- Apache PdfBox API Javadoc
- Apache PdfBox read PDF document
- Apache PdfBox create PDF document
- PDDocumentInformation JavaDoc
- PDMetadata JavaDoc