Monday, July 4, 2016

Apache PDFBox 2

Apache PDFBox 2 was released earlier this year and Apache PDFBox 2.0.1 and Apache PDFBox 2.0.2 have since been released. Apache PDFBox is open source (Apache License Version 2) and Java-based (and so is easy to use with wide variety of programming language including Java, Groovy, Scala, Clojure, Kotlin, and Ceylon). Apache PDFBox can be used by any of these or other JVM-based languages to read, write, and work with PDF documents.

Apache PDFBox 2 introduces numerous bug fixes in addition to completed tasks and some new features. Apache PDFBox 2 now requires Java SE 6 (J2SE 5 was minimum for Apache PDFBox 1.x). There is a migration guide, Migration to PDFBox 2.0.0, that details many differences between PDFBox 1.8 and PDFBox 2.0, including updated dependencies (Bouncy Castle 1.53 and Apache Commons Logging 1.2) and "breaking changes to the library" in PDFBox 2.

PDFBox can be used to create PDFs. The next code listing is adapted from the Apache PDFBox 1.8 example "Create a blank PDF" in the Document Creation "Cookbook" examples. The referenced example explicitly closes the instantiated PDDocument and probably does so for benefit of those using a version of Java before JDK 7. For users of Java 7, however, try-with-resources is a better option for ensuring that the PDDocument instance is closed and it is supported because PDDocument implements AutoCloseable.

Creating (Empty) PDF
/**
 * Demonstrate creation of an empty PDF.
 */
private void createEmptyDocument()
{
   try (final PDDocument document = new PDDocument())
   {
      final PDPage emptyPage = new PDPage();
      document.addPage(emptyPage);
      document.save("EmptyPage.pdf");
   }
   catch (IOException ioEx)
   {
      err.println(
         "Exception while trying to create blank document - " + ioEx);
   }
}

The next code listing is adapted from the Apache PDFBox 1.8 example "Hello World using a PDF base font" in the Document Creation "Cookbook" examples. The most significant change in this listing from that 1.8 Cookbook example is the replacement of deprecated methods PDPageContentStream.moveTextPositionByAmount(float, float) and PDPageContentStream.drawString(String) with PDPageContentStream.newLineAtOffset(float, float) and PDPageContentStream.showText(String) respectively.

Creating Simple PDF with Font
/**
 * Create simple, single-page PDF "Hello" document.
 */
private void createHelloDocument()
{
   final PDPage singlePage = new PDPage();
   final PDFont courierBoldFont = PDType1Font.COURIER_BOLD;
   final int fontSize = 12;
   try (final PDDocument document = new PDDocument())
   {
      document.addPage(singlePage);
      final PDPageContentStream contentStream = new PDPageContentStream(document, singlePage);
      contentStream.beginText();
      contentStream.setFont(courierBoldFont, fontSize);
      contentStream.newLineAtOffset(150, 750);
      contentStream.showText("Hello PDFBox");
      contentStream.endText();
      contentStream.close();  // Stream must be closed before saving document.

      document.save("HelloPDFBox.pdf");
   }
   catch (IOException ioEx)
   {
      err.println(
         "Exception while trying to create simple document - " + ioEx);
   }
}

The next code listing demonstrates parsing text from a PDF using Apache PDFBox. This extremely simple implementation parses all of the text into a single String using PDFTextStripper.getText(PDDocument). In most realistic situations, I'd not want all the text from the PDF in a single String and would likely use PDFTextStripper's ability to more narrowly specify which text to parse. It's also worth noting that while this code listing gets the PDF from online (Scala by Example PDF at http://www.scala-lang.org/docu/files/ScalaByExample.pdf), there are numerous constructors for PDDocument that allow one to access PDFs on file systems and via other types of streams.

Parsing Text from Online PDF

/**
 * Parse text from an online PDF.
 */
private void parseOnlinePdfText()
{
   final String address = "http://www.scala-lang.org/docu/files/ScalaByExample.pdf";
   try
   {
      final URL scalaByExampleUrl = new URL(address);
      final PDDocument documentToBeParsed = PDDocument.load(scalaByExampleUrl.openStream());
      final PDFTextStripper stripper = new PDFTextStripper();
      final String pdfText = stripper.getText(documentToBeParsed);
      out.println("Parsed text size is " + pdfText.length() + " characters:");
      out.println(pdfText);
   }
   catch (IOException ioEx)
   {
      err.println("Exception while trying to parse text from PDF at " + address);
   }
}

The JDK 8 Issue

PDFBox 2 exposes an issue in JDK 8 that is filed under Bug JDK-8041125 ("ColorConvertOp filter much slower in JDK 8 compared to JDK7"). The Apache PDFBox "Getting Started" documentation describes the issue, "Due to the change of the java color management module towards 'LittleCMS', users can experience slow performance in color operations." This same "Getting Started" section provides the work-around: "disable LittleCMS in favour of the old KCMS (Kodak Color Management System)."

The bug appears to have been identified and filed by IDR Solutions in conjunction with their commercial Java PDF library JPedal. Their blog post Major change to Color performance in newer Java releases provides more details related to this issue.

The just-mentioned posts and documentation, including Apache PDFBox 2's "Getting Started" section, explicitly demonstrate use of Java system properties to work-around the issue by explicitly specifying using of KCMS (which could be removed at any time) instead of the default LittleCMS. As these sources state, one can either provide the system property to the Java launcher [java] with the -D option [-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider] or specify the property within the executable code itself [System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");].

It sounds like this issue is not exclusive to version 2 of Apache PDFBox, but is more commonly seen with Apache PDFBox 2 because version 2 uses dependent constructs more frequently and because it's more likely that someone using Java 8 is also using the newer PDFBox.

The change in JDK 8 of the default implementation associated with property sun.java2d.cmm demonstrates a point I tried to make in my recent blog post Observations From A History of Java Backwards Incompatibility. In that post, I concluded, "Beware of and use only with caution any APIs, classes, and tools advertised as experimental or subject to removal in future releases of Java." It turns out that the Java 2D system properties are in this class. The System Properties for Java 2D Technology page provides this background and warning information regarding use of these properties:

This document describes several unsupported properties that you can use to customize how the 2D painting system operates. You might use these properties to improve performance, fix incorrect rendering, or avoid system crashes under certain configurations. ... Warning: Take care when using these properties. Some of them are unsupported for very practical reasons. ... Since these properties have the sole purpose of enabling or disabling implementation-specific behaviors, they are subject to change or removal without notification. Some properties might work only on the exact product releases for which they are documented.

Conclusion

Apache PDFBox 2 is a relatively easy way to manipulate PDF documents in Java. Its liberal Apache 2 license makes it amenable to a very large audience and its open source nature allows developers to see how to use the libraries it uses underneath the covers and adapt it as needed.

Additional Resources

No comments: