Saturday, April 23, 2011

Peeking at Office 2007 Document Contents with Groovy

The Microsoft Office 2007 suite of products introduced default support of documents stored in an XML format. Specifically, Office (2007) Open XML File Formats were introduced. Although introduced in conjunction with Office 2007, conversion tools were provided so older versions of these products could also read and write this XML-based format. As documented in Walkthrough: Word 2007 XML Format, there is more than just XML to the new format. Many of the XML files are compressed and the overall format is a compressed ZIP file (albeit typically with a .docx) file extension of numerous content files.

Because Java's JAR file is based on the ZIP format and because Java provides numerous useful constructs for dealing with JAR (and ZIP) files, it is easy to use Groovy to manipulate the contents of a Office 2007 file. In this post, I demonstrate a simple Groovy script that displays a content listing for one of these Office 2007 files.

In Walkthrough: Word 2007 XML Format, Erika Ehrli provides some steps one can take to see the contents of an Office 2007 file. These steps include creating a temporary folder, saving a Word document into that newly created temporary folder, adding a ZIP extension to the saved file, and double clicking on it to open it or extract its contents (the .zip extension makes this automatic). Today's more sophisticated zip-oriented tools can open it without these steps and I'll later show a screen snapshot of doing just that.

For my example, I'm using a draft version (originally written in Word 2003) of my Oracle Technology Network article "Add Some Spring to Your Oracle JDBC." This November 2005 article has not been available online since the merge and consolidation of Oracle articles with Sun-hosted articles, but I still had my draft that I'm using as the example here. The following screen snapshot demonstrates saving the article from Word 2003 as a Word 2007 document.


The next screen snapshot shows that the Word 2007 file is stored with a .docx extension.


As discussed previously, this is really a ZIP file, so it can be opened with ZIP-friendly tools. The next screen snapshot display some of the contents of this Word 2007 format file via the 7-Zip tool.


In Groovy code, I can use classes from the java.util.zip package to similarly view the contents of an Office 2007 file. The next Groovy code listing shows how this might be implemented.

showContentsOfficeFile.groovy
#!/usr/bin/env groovy
// showContentsOfficeFile.groovy

import java.util.zip.ZipEntry
import java.util.zip.ZipFile

if (!args || args.length < 1)
{
   println "Please provide path/name of Office file as first argument."
   System.exit(-1)
}
def fileName = args[0]

def file = new ZipFile(fileName)
def entries = file.entries()
entries.each
{
   def datetime = Calendar.getInstance()
   datetime.setTimeInMillis(it.time)
   // Use GDK's String.format convenience method here!
   print it.name
   println " created on ${datetime.format('EEE, d MMM yyyy HH:mm:ss Z')}"
   print "\t   Sizes (bytes): ${it.size} original, ${it.compressedSize} compressed ("
   println "${convertCompressionMethodToString(it.method)})"
}


/**
 * Convert the provided integer representing ZipEntry compression method into
 * a more readable String.
 *
 * @param newCompressionMethod Integer representing compression type of a 
 *    ZipEntry as provided by ZipEntry.getMethod().
 * @return A String representation of compression method.
 */
def String convertCompressionMethodToString(final int newCompressionMethod)
{
   String returnedCompressionMethodStr = "Unknown"
   if (newCompressionMethod == ZipEntry.DEFLATED)
   {
      returnedCompressionMethodStr = "Deflated"
   }
   else if (newCompressionMethod == ZipEntry.STORED)
   {
      returnedCompressionMethodStr = "Stored"
   }
   return returnedCompressionMethodStr
}

The output of the above script when run against the Word 2007 file mentioned previously is shown next.


The Groovy code shown above produces output similar to that provided by the 7-Zip output shown earlier with details such as content names, normal and compressed sizes, and modification date. I was a little concerned that my Groovy script was returning a 1980 modification date for the contents of this Office 2007 file, but then noticed that 7-Zip reports the same modification date. It's not null, but it's not much more useful.

The Groovy code demonstrates use of java.util.zip.ZipFile and java.util.zip.ZipEntry to access the innards of the Microsoft 2007 file. Another Groovyism demonstrated by the above script is the use of the GDK's Calendar.format(String) method. This convenience method is a "shortcut for SimpleDateFormat to output a String representation of this calendar instance."


Conclusion

The example in this post demonstrates a simple script for viewing contents of a Microsoft 2007 file. This viewing of Microsoft 2007 format contents is nothing that cannot already be done via simple tools. The real potential in accessing these via Groovy is, of course, the ability to write custom scripts to programatically manipulate these contents or to do other things based on these contents.

No comments: