Wednesday, February 18, 2015

Determining File Types in Java

Programmatically determining the type of a file can be surprisingly tricky and there have been many content-based file identification approaches proposed and implemented. There are several implementations available in Java for detecting file types and most of them are largely or solely based on files' extensions. This post looks at some of the most commonly available implementations of file type detection in Java.

Several approaches to identifying file types in Java are demonstrated in this post. Each approach is briefly described, illustrated with a code listing, and then associated with output that demonstrates how different common files are typed based on extensions. Some of the approaches are configurable, but all examples shown here use "default" mappings as provided out-of-the-box unless otherwise stated.

About the Examples

The screen snapshots shown in this post are of each listed code snippet run against certain subject files created to test the different implementations of file type detection in Java. Before covering these approaches and demonstrating the type each approach detects, I list the files under test and what they are named and what they really are.

File
Name
File
Extension
File
Type
Type Matches
Extension Convention?
actualXml.xml xml XML Yes
blogPostPDF   PDF No
blogPost.pdf pdf PDF Yes
blogPost.gif gif GIF Yes
blogPost.jpg jpg JPEG Yes
blogPost.png png PNG Yes
blogPostPDF.txt txt PDF No
blogPostPDF.xml xml PDF No
blogPostPNG.gif gif PNG No
blogPostPNG.jpg jpg PNG No
dustin.txt txt Text Yes
dustin.xml xml Text No
dustin   Text No

Files.probeContentType(Path) [JDK 7]

Java SE 7 introduced the highly utilitarian Files class and that class's Javadoc succinctly describes its use: "This class consists exclusively of static methods that operate on files, directories, or other types of files" and, "in most cases, the methods defined here will delegate to the associated file system provider to perform the file operations."

The java.nio.file.Files class provides the method probeContentType(Path) that "probes the content type of a file" through use of "the installed FileTypeDetector implementations" (the Javadoc also notes that "a given invocation of the Java virtual machine maintains a system-wide list of file type detectors").

/**
 * Identify file type of file with provided path and name
 * using JDK 7's Files.probeContentType(Path).
 *
 * @param fileName Name of file whose type is desired.
 * @return String representing identified type of file with provided name.
 */
public String identifyFileTypeUsingFilesProbeContentType(final String fileName)
{
   String fileType = "Undetermined";
   final File file = new File(fileName);
   try
   {
      fileType = Files.probeContentType(file.toPath());
   }
   catch (IOException ioException)
   {
      out.println(
           "ERROR: Unable to determine file type for " + fileName
              + " due to exception " + ioException);
   }
   return fileType;
}

When the above Files.probeContentType(Path)-based approach is executed against the set of files previously defined, the output appears as shown in the next screen snapshot.

The screen snapshot indicates that the default behavior for Files.probeContentType(Path) on my JVM seems to be tightly coupled to the file extension. The files with no extensions show "null" for file type and the other listed file types match the files' extensions rather than their actual content. For example, all three files with names starting with "dustin" are really the same single-sentence text file, but Files.probeContentType(Path) states that they are each a different type and the listed types are tightly correlated with the different file extensions for essentially the same text file.

MimetypesFileTypeMap.getContentType(String) [JDK 6]

The class MimetypesFileTypeMap was introduced with Java SE 6 to provide "data typing of files via their file extension" using "the .mime.types format." The class's Javadoc explains where in a given system the class looks for MIME types file entries. My example uses the ones that come out-of-the-box with my JDK 8 installation. The next code listing demonstrates use of javax.activation.MimetypesFileTypeMap.

/**
 * Identify file type of file with provided name using
 * JDK 6's MimetypesFileTypeMap.
 *
 * See Javadoc documentation for MimetypesFileTypeMap class
 * (http://docs.oracle.com/javase/8/docs/api/javax/activation/MimetypesFileTypeMap.html)
 * for details on how to configure mapping of file types or extensions.
 */
public String identifyFileTypeUsingMimetypesFileTypeMap(final String fileName)
{    
   final MimetypesFileTypeMap fileTypeMap = new MimetypesFileTypeMap();
   return fileTypeMap.getContentType(fileName);
}

The next screen snapshot demonstrates the output from running this example against the set of test files.

This output indicates that the MimetypesFileTypeMap approach returns the MIME type of application/octet-stream for several files including the XML files and the text files without a .txt suffix. We see also that, like the previously discussed approach, this approach in some cases uses the file's extension to determine the file type and so incorrectly reports the file's actual file type when that type is different than what its extension conventionally implies.

URLConnection.getContentType()

I will be covering three methods in URLConnection that support file type detection. The first is URLConnection.getContentType(), a method that "returns the value of the content-type header field." Use of this instance method is demonstrated in the next code listing and the output from running that code against the common test files is shown after the code listing.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.getContentType().
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGetContentType(final String fileName)
{
   String fileType = "Undetermined";
   try
   {
      final URL url = new URL("file://" + fileName);
      final URLConnection connection = url.openConnection();
      fileType = connection.getContentType();
   }
   catch (MalformedURLException badUrlEx)
   {
      out.println("ERROR: Bad URL - " + badUrlEx);
   }
   catch (IOException ioEx)
   {
      out.println("Cannot access URLConnection - " + ioEx);
   }
   return fileType;
}

The file detection approach using URLConnection.getContentType() is highly coupled to files' extensions rather than the actual file type. When there is no extension, the String returned is "content/unknown."

URLConnection.guessContentTypeFromName(String)

The second file detection approach provided by URLConnection that I'll cover here is its method guessContentTypeFromName(String). Use of this static method is demonstrated in the next code listing and associated output screen snapshot.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromName(String).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromName(final String fileName)
{
   return URLConnection.guessContentTypeFromName(fileName);
}

URLConnection's guessContentTypeFromName(String) approach to file detection shows "null" for files without file extensions and otherwise returns file type String representations that closely mirror the files' extensions. These results are very similar to those provided by the Files.probeContentType(Path) approach shown earlier with the one notable difference being that URLConnection's guessContentTypeFromName(String) approach identifies files with .xml extension as being of file type "application/xml" while Files.probeContentType(Path) identifies these same files' types as "text/xml".

URLConnection.guessContentTypeFromStream(InputStream)

The third approach I cover that is provided by URLConnection for file type detection is via the class's static method guessContentTypeFromStream(InputStream). A code listing employing this approach and associated output in a screen snapshot are shown next.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromStream(InputStream).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromStream(final String fileName)
{
   String fileType;
   try
   {
      fileType = URLConnection.guessContentTypeFromStream(new FileInputStream(new File(fileName)));
   }
   catch (IOException ex)
   {
      out.println("ERROR: Unable to process file type for " + fileName + " - " + ex);
      fileType = "null";
   }
   return fileType;
}

All the file types are null! The reason for this appears to be explained by the Javadoc for the InputStream parameter of the URLConnection.guessContentTypeFromStream(InputStream) method: "an input stream that supports marks." It turns out that the instances of FileInputStream in my examples do not support marks (their calls to markSupported() all return false).

Apache Tika

All of the examples of file detection covered in this post so far have been approaches provided by the JDK. There are third-party libraries that can also be used to detect file types in Java. One example is Apache Tika, a "content analysis toolkit" that "detects and extracts metadata and text from over a thousand different file types." In this post, I look at using Tika's facade class and its detect(String) method to detect file types. The instance method call is the same in the three examples I show, but the results are different because each instance of the Tika facade class is instantiated with a different Detector.

The instantiations of Tika instances with different Detectors is shown in the next code listing.

/** Instance of Tika facade class with default configuration. */
private final Tika defaultTika = new Tika();

/** Instance of Tika facade class with MimeTypes detector. */
private final Tika mimeTika = new Tika(new MimeTypes());
his is 
/** Instance of Tika facade class with Type detector. */
private final Tika typeTika = new Tika(new TypeDetector());

With these three instances of Tika instantiated with their respective Detectors, we can call the detect(String) method on each instance for the set of test files. The code for this is shown next.

/**
 * Identify file type of file with provided name using
 * Tika's default configuration.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingDefaultTika(final String fileName)
{
   return defaultTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a MimeTypes detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingMimeTypesTika(final String fileName)
{
   return mimeTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a Types detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingTypeDetectorTika(final String fileName)
{
   return typeTika.detect(fileName);
}

When the three above Tika detection examples are executed against the same set of files are used in the previous examples, the output appears as shown in the next screen snapshot.

We can see from the output that the default Tika detector reports file types similarly to some of the other approaches shown earlier in this post (very tightly tied to the file's extension). The other two demonstrated detectors state that the file type is application/octet-stream in most cases. Because I called the overloaded version of detect(-) that accepts a String, the file type detection is "based on known file name extensions."

If the overloaded detect(File) method is used instead of detect(String), the identified file type results are much better than the previous Tika examples and the previous JDK examples. In fact, the "fake" extensions don't fool the detectors as much and the default Tika detector is especially good in my examples at identifying the appropriate file type even when the extension is not the normal one associated with that file type. The code for using Tika.detect(File) and the associated output are shown next.

   /**
    * Identify file type of file with provided name using
    * Tika's default configuration.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingDefaultTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = defaultTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a MimeTypes detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingMimeTypesTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = mimeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a Types detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingTypeDetectorTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = typeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

Caveats and Customization

File type detection is not a trivial feat to pull off. The Java approaches for file detection demonstrated in this post provide basic approaches to file detection that are often highly dependent on a file name's extension. If files are named with conventional extensions that are recognized by the file detection approach, these approaches are typically sufficient. However, if unconventional file type extensions are used or the extensions are for files with types other than that conventionally associated with that extension, most of these approaches to file detection break down without customization. Fortunately, most of these approaches provide the ability to customize the mapping of file extensions to file types. The Tika approach using Tika.detect(File) was generally the most accurate in the examples shown in this post when the extensions were not the conventional ones for the particular file types.

Conclusion

There are numerous mechanisms available for simple file type detection in Java. This post reviewed some of the standard JDK approaches for file detection and some examples of using Tika for file detection.

4 comments:

Debjit said...

How to I detect mime types of a file without relying on the extension and completely relying on the content of the file.

@DustinMarx said...

Debjit,

You might want to look into DROID or using a specific implementation of Tika's Detector that only uses file content and not file extension.

Dustin

Jeremy said...

Thanks for this overview.

As a sidenote: by wrapping a FileInputStream in a BufferedInputStream I picked up the marker functionality. The guessContentTypeFromStream() method is ... nice. It picks up images and a few others. But our primary use case is for PDFs and MS Office documents, and it's not really capable of identifying those.

Unless I can talk our higher-ups into bundling Tika, I may have to write my own minimal package to at least identify Office formats. (or scour the web for someone who's already done as much).

Unknown said...

I've made a color-coded spreadsheet, filetypes.xlsx, listing the outcomes for the various methods. Two of the tika methods do the best.