Saturday, February 2, 2008

OpenXML4J: Reading Office 2007 File Properties

Happy Groundhog Day (movie)!

Apache POI is an excellent library for reading, writing, and manipulating Excel spreadsheets. Apache POI's HSSF ("Horrible Spreadsheet Format") support is particularly good and useful. The most significant disadvantage of Apache POI is that it does not support the new format used with Microsoft Office 2007 applications.

For a while, the disadvantage of Office 2007's new file format will not be as large on the effectiveness of Apache POI as it might be because of the following reasons:

  • Not everyone has migrated to Office 2007 and so there are still many earlier versions of Office products in existence and new files are still being created in the older POI-supported format.

  • Those who have migrated to Office 2007 can still save their files in the old format when necessary to share with others or for POI's use. Likewise, they canalso read the old format and thus read POI-generated files.

  • Files that are created with POI can be opened up in an earlier version of Office products and converted to Office 2007 format using the freely provided converter.



However, as time goes on, the ability to read, write, and manipulate Office 2007 files in Java will become more desirable and eventually a necessity. While the new format used in Microsoft 2007 applications (Office Open XML - ECMA-376) is simply composed of XML files zipped up in a zip file, any Java application accessing these file types would need to extract or build the zip file and would also need to understand the underlying XML format. To answer this need, work has begun on an open source library called OpenXML4J. This library recently transitioned from pre-Alpha to Alpha status (on 21 January 2008). The license for this library is currently your choice of BSD license or Apache v2 license.

In this blog entry, I will focus on using OpenXML4J to extract properties from Office 2007 application files. The code sample that will be shown below will extract properties from three types of Office 2007 files (PowerPoint, Word, and Excel files). As a lazy developer, I chose to use some existing files for this. I use Office 2003, but I was able to save these three files in the Office Open XML format in PowerPoint, Word, and Excel using the downloaded Microsoft conversion pack. For the PowerPoint and Word examples, I saved my presentation and paper (respectively) for RMOUG Training Days 2008 ("Excel with Apache POI and Oracle Database"). For the Excel example, I am using a simple spreadsheet containing important dates for submitting papers, submitting presentations, and actually presenting the materials for various conferences this spring. For this example, the contents of the various files are not all that important except for the Microsoft Office properties associated with each.

The next three screen shots show the properties window for the three application data files. The contents of these windows are important because they are what we will be extracting directly from the file using Java and OpenXML4J. As with all images in this blog, please click on any image to see a larger version of the image.

PowerPoint: marx-poi.pptx Properties



Word: marx-poi.docx Properties



Excel: 2008Conferences.xlsx Properties



Before moving onto the example that demonstrates use of OpenXML4J to extract properties from Office 2007 application data files, I will first demonstrate the XML contents of the new Office Open XML formats. In the following three examples, I have used Java's jar command (jar tvf) to display the contents of these files.

Contents of marx-poi.pptx

13756 Tue Jan 01 00:00:00 MST 1980 [Content_Types].xml
737 Tue Jan 01 00:00:00 MST 1980 _rels/.rels
598 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide11.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide12.xml.rels
598 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide13.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide15.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide10.xml.rels
598 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide16.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide17.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide14.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide8.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide18.xml.rels
909 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide1.xml.rels
462 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide2.xml.rels
462 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide3.xml.rels
598 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide4.xml.rels
462 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide5.xml.rels
462 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide6.xml.rels
641 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide7.xml.rels
462 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide9.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide20.xml.rels
5977 Tue Jan 01 00:00:00 MST 1980 ppt/_rels/presentation.xml.rels
829 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide31.xml.rels
839 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide32.xml.rels
882 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide33.xml.rels
1263 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide34.xml.rels
1080 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide35.xml.rels
893 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide36.xml.rels
1237 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide37.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide30.xml.rels
1025 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide29.xml.rels
951 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide28.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide21.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide22.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide23.xml.rels
648 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide24.xml.rels
688 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide25.xml.rels
659 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide26.xml.rels
647 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide27.xml.rels
463 Tue Jan 01 00:00:00 MST 1980 ppt/slides/_rels/slide19.xml.rels
4714 Tue Jan 01 00:00:00 MST 1980 ppt/presentation.xml
2750 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide26.xml
3103 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide30.xml
2200 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide16.xml
6513 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide15.xml
2682 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide31.xml
5865 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide14.xml
2206 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide13.xml
3046 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide25.xml
13352 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide12.xml
6083 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide17.xml
5929 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide18.xml
4080 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide29.xml
3007 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide24.xml
5206 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide23.xml
4207 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide27.xml
4579 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide22.xml
3483 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide21.xml
4177 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide28.xml
6277 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide20.xml
7659 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide19.xml
7534 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide11.xml
4180 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide32.xml
4283 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide4.xml
3439 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide35.xml
3378 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide6.xml
2118 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide2.xml
3049 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide33.xml
3335 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide5.xml
2754 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide36.xml
3992 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide37.xml
2806 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide3.xml
2955 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide7.xml
3515 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide9.xml
2836 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide10.xml
3605 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide1.xml
35794 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide8.xml
4031 Tue Jan 01 00:00:00 MST 1980 ppt/slides/slide34.xml
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide11.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide10.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout12.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide37.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide5.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide2.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide9.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide4.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide6.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide3.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide7.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide8.xml.rels
447 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide1.xml.rels
20729 Tue Jan 01 00:00:00 MST 1980 ppt/slideMasters/slideMaster1.xml
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide13.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide30.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide29.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide28.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide27.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide31.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide32.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide33.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide34.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide35.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide36.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide26.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide25.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide24.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide17.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide16.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide15.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide14.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide18.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide19.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide20.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide21.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide22.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide23.xml.rels
448 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/_rels/notesSlide12.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout11.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout9.xml.rels
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide11.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide10.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide9.xml
2337 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide8.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide7.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide6.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide12.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide13.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide14.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide19.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide18.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide17.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide16.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide15.xml
1696 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide5.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide4.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide3.xml
1736 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout6.xml
6603 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout5.xml
4030 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout4.xml
3076 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout3.xml
2376 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout2.xml
7618 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout1.xml
1439 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout7.xml
4188 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout8.xml
4164 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout9.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide2.xml
1616 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide1.xml
2203 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout12.xml
2656 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout11.xml
2431 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/slideLayout10.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide20.xml
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout10.xml.rels
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide22.xml
2284 Tue Jan 01 00:00:00 MST 1980 ppt/slideMasters/_rels/slideMaster1.xml.rels
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide37.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide36.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide35.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide34.xml
447 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout1.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout2.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout3.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout8.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout7.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout6.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout5.xml.rels
311 Tue Jan 01 00:00:00 MST 1980 ppt/slideLayouts/_rels/slideLayout4.xml.rels
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide33.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide21.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide31.xml
1734 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide27.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide26.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide25.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide24.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide23.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide28.xml
2685 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide32.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide29.xml
1617 Tue Jan 01 00:00:00 MST 1980 ppt/notesSlides/notesSlide30.xml
521299 Tue Jan 01 00:00:00 MST 1980 ppt/media/image2.jpeg
20941 Tue Jan 01 00:00:00 MST 1980 ppt/media/image7.png
1557708 Tue Jan 01 00:00:00 MST 1980 ppt/media/image4.jpeg
114888 Tue Jan 01 00:00:00 MST 1980 ppt/media/image5.png
292 Tue Jan 01 00:00:00 MST 1980 ppt/notesMasters/_rels/notesMaster1.xml.rels
9241 Tue Jan 01 00:00:00 MST 1980 ppt/theme/theme1.xml
69148 Tue Jan 01 00:00:00 MST 1980 ppt/media/image3.jpeg
43096 Tue Jan 01 00:00:00 MST 1980 docProps/thumbnail.wmf
25927 Tue Jan 01 00:00:00 MST 1980 ppt/media/image6.png
6962 Tue Jan 01 00:00:00 MST 1980 ppt/theme/theme2.xml
93711 Tue Jan 01 00:00:00 MST 1980 ppt/media/image1.jpeg
10986 Tue Jan 01 00:00:00 MST 1980 ppt/media/image8.wmf
9070 Tue Jan 01 00:00:00 MST 1980 ppt/notesMasters/notesMaster1.xml
376 Tue Jan 01 00:00:00 MST 1980 ppt/presProps.xml
1045 Tue Jan 01 00:00:00 MST 1980 ppt/viewProps.xml
182 Tue Jan 01 00:00:00 MST 1980 ppt/tableStyles.xml
986 Tue Jan 01 00:00:00 MST 1980 docProps/core.xml
3736 Tue Jan 01 00:00:00 MST 1980 docProps/app.xml


Contents of marx-poi.docx

2143 Tue Jan 01 00:00:00 MST 1980 [Content_Types].xml
590 Tue Jan 01 00:00:00 MST 1980 _rels/.rels
10765 Tue Jan 01 00:00:00 MST 1980 word/_rels/document.xml.rels
135202 Tue Jan 01 00:00:00 MST 1980 word/document.xml
319 Tue Jan 01 00:00:00 MST 1980 word/_rels/footer2.xml.rels
887 Tue Jan 01 00:00:00 MST 1980 word/footer2.xml
939 Tue Jan 01 00:00:00 MST 1980 word/footer1.xml
319 Tue Jan 01 00:00:00 MST 1980 word/_rels/footer1.xml.rels
822 Tue Jan 01 00:00:00 MST 1980 word/header1.xml
946 Tue Jan 01 00:00:00 MST 1980 word/endnotes.xml
952 Tue Jan 01 00:00:00 MST 1980 word/footnotes.xml
2751 Tue Jan 01 00:00:00 MST 1980 word/media/image1.png
6992 Tue Jan 01 00:00:00 MST 1980 word/theme/theme1.xml
122008 Tue Jan 01 00:00:00 MST 1980 word/media/image2.png
356 Tue Jan 01 00:00:00 MST 1980 word/_rels/settings.xml.rels
6519 Tue Jan 01 00:00:00 MST 1980 word/settings.xml
654 Tue Jan 01 00:00:00 MST 1980 word/webSettings.xml
1828 Tue Jan 01 00:00:00 MST 1980 word/fontTable.xml
750 Tue Jan 01 00:00:00 MST 1980 docProps/app.xml
1089 Tue Jan 01 00:00:00 MST 1980 docProps/core.xml
1680 Tue Jan 01 00:00:00 MST 1980 word/numbering.xml
23802 Tue Jan 01 00:00:00 MST 1980 word/styles.xml


Contents of 2008Conferences.xlsx

1780 Tue Jan 01 00:00:00 MST 1980 [Content_Types].xml
588 Tue Jan 01 00:00:00 MST 1980 _rels/.rels
980 Tue Jan 01 00:00:00 MST 1980 xl/_rels/workbook.xml.rels
625 Tue Jan 01 00:00:00 MST 1980 xl/workbook.xml
6995 Tue Jan 01 00:00:00 MST 1980 xl/theme/theme1.xml
605 Tue Jan 01 00:00:00 MST 1980 xl/worksheets/_rels/sheet1.xml.rels
518 Tue Jan 01 00:00:00 MST 1980 xl/worksheets/sheet2.xml
518 Tue Jan 01 00:00:00 MST 1980 xl/worksheets/sheet3.xml
1882 Tue Jan 01 00:00:00 MST 1980 xl/drawings/vmlDrawing1.vml
1748 Tue Jan 01 00:00:00 MST 1980 xl/sharedStrings.xml
3998 Tue Jan 01 00:00:00 MST 1980 xl/styles.xml
3530 Tue Jan 01 00:00:00 MST 1980 xl/worksheets/sheet1.xml
994 Tue Jan 01 00:00:00 MST 1980 xl/comments1.xml
364 Tue Jan 01 00:00:00 MST 1980 xl/printerSettings/printerSettings1.bin
1028 Tue Jan 01 00:00:00 MST 1980 docProps/core.xml
866 Tue Jan 01 00:00:00 MST 1980 docProps/app.xml


All of the above has been background information on the new Office Open XML format employed by Office 2007. Now, here is the Java code that uses OpenXML4J to extract the properties from a PowerPoint presentation, a Word document, and an Excel spreadsheet.


package marx.openxml4j;

import org.openxml4j.exceptions.InvalidFormatException;

import org.openxml4j.opc.Package;
import org.openxml4j.opc.PackageAccess;
import org.openxml4j.opc.PackageProperties;

import org.openxml4j.util.Nullable;

/**
* Example demonstrating how to use OpenXML4J to read properties from Office
* 2007 formatted files (Excel, Word, and PowerPoint).
*/
public class OpenXML4JExample
{
/**
* Read an Excel document with the Office Open XML format used for Office
* 2007 products.
*
* @param aFilePathAndName
*/
public void readOfficeDocumentProperties(final String aFilePathAndName)
{
try
{
Package pkg = Package.open(aFilePathAndName, PackageAccess.READ);
PackageProperties properties = pkg.getPackageProperties();
final Nullable<String> title = properties.getTitleProperty();
final Nullable<String> language = properties.getLanguageProperty();
final Nullable<String> category = properties.getCategoryProperty();
final Nullable<String> keywords = properties.getKeywordsProperty();
final Nullable<String> subject = properties.getSubjectProperty();
final Nullable<String> description = properties.getDescriptionProperty();
final Nullable<String> version = properties.getVersionProperty();
final Nullable<String> creator = properties.getCreatorProperty();
if ( title.hasValue() )
{
System.out.println( aFilePathAndName + " has a title of '"
+ title.getValue() + "'");
}
if ( language.hasValue() )
{
System.out.println( aFilePathAndName + " uses the language "
+ language.getValue() );
}
if ( category.hasValue() )
{
System.out.println( aFilePathAndName + " is in the category of '"
+ category.getValue() + "'");
}
if ( keywords.hasValue() )
{
System.out.println( aFilePathAndName + " has the keywords: "
+ keywords.getValue() );
}
if ( subject.hasValue() )
{
System.out.println( aFilePathAndName + "'s subject is "
+ subject.getValue() );
}
if ( description.hasValue() )
{
System.out.println( aFilePathAndName + "'s description is "
+ description.getValue() );
}
if ( version.hasValue() )
{
System.out.println( aFilePathAndName + " is version "
+ version.getValue() );
}
if ( creator.hasValue() )
{
System.out.println( aFilePathAndName + "'s creator is "
+ creator.getValue() );
}
}
catch (InvalidFormatException invalidFormatEx) // checked exception
{
System.err.println( "Invalid format (looking for Office 2007 format):"
+ invalidFormatEx.getMessage() );
}
}

/**
* Print a separator to stdout to delineate example file being run.
*
* @param aSeparatorTitle Title to place in separator header.
*/
public static void printSeparatorHeader(final String aSeparatorTitle)
{
System.out.println(
"-------------------------------------------------------------------");
System.out.println(
"-- " + aSeparatorTitle );
System.out.println(
"-------------------------------------------------------------------");
}

/**
* Run test showing how to extract Office 2007 format properties from a
* PowerPoint presentation, a Word document, and an Excel spreadsheet using
* the OpenXML4J library (currently in Alpha).
*
* @param aCommandLineArgs Command-line arguments.
*/
public static void main(final String[] aCommandLineArgs)
{
String officeFilePathAndName;
OpenXML4JExample me = new OpenXML4JExample();

// Example Microsoft Office 2007 PowerPoint presentation
officeFilePathAndName = "C:\\sample2007\\marx-poi.pptx";
printSeparatorHeader("PowerPoint Example: " + officeFilePathAndName);
me.readOfficeDocumentProperties(officeFilePathAndName);

// Example Microsoft Office 2007 Word document
officeFilePathAndName = "C:\\sample2007\\marx-poi.docx";
printSeparatorHeader("Word Example: " + officeFilePathAndName);
me.readOfficeDocumentProperties(officeFilePathAndName);

// Example Microsoft Office 2007 Excel spreadsheet
officeFilePathAndName = "C:\\sample2007\\2008Conferences.xlsx";
printSeparatorHeader("Excel Example: " + officeFilePathAndName);
me.readOfficeDocumentProperties(officeFilePathAndName);
}
}


As you can see, the OpenXML4J API for accessing properties is extremely straightforward. I did not get all properties, but did show examples of obtaining the more important Office properties. The code sample also demonstrates the need to catch the checked exception org.openxml4j.exceptions.InvalidFormatException.

The next screen shot (click on it to see larger image) shows the output from running the code above against the three Office 2007-compatible (Office Open XML format) documents described earlier in this blog.



This output demonstrates some useful things to know when using OpenXML4J. For example, I needed to place a log4j JAR file and a dom4j JAR file on the classpath when using the OpenXML4J library. In this case, I knew that the Spring framework download with dependencies contains these libraries and so I simply used those versions. The screen shot also shows that I used the Alpha version of OpenXML4J released on 21 January 2008 (openxml4j-bin-alpha-080121.jar).

There is, of course, much more to OpenXML4J than the extraction of properties from Office 2007 files and we can expect much more to come as well as it moves to production support and quality.

4 comments:

gyanesh said...

Hi Dustin,
Thanks for this example.I want to know that can we extract content(plain text),page count and other Meta-Fields (custom) through openxml4j or some other library in java.
This is urgent.


Thanks in Advance..
Gyanesh Gupta
gupta.gyanesh@gmail.com

Slim said...

Hi,
Thanks for posting this example.
i'm asking about using OpenXML4J to extract data from a .doc or .docx file and make them like a XML file
thanks
slimmakni@hotmail.com

Mohan said...

Hi

Thanks for nice example. But I would like to extract the content of the docx and style of the font. Based on the information, Need to generate the XML. Do you have any idea about it ?

Regards
Mohan
cbe.mohan@gmail.com

Denys Lazarenko said...

Thank you for a post! It is very useful, as there's not much doc about POI and OpenXML4J.