Monday, January 26, 2015

Reading Large Lines Slower in JDK 7 and JDK 8

I recently ran into a case where a particular task (LineContainsRegExp) in an Apache Ant build file ran considerably slower in JDK 7 and JDK 8 than it did in JDK 6for extremely long character lines. Based on a simple example adapted from the Java code used by the LineContainsRegExp task, I was able to determine that the slowness has nothing to do with the regular expression, but rather has to do with reading characters from a file. The remainder of the post demonstrates this.

For my simple test, I first wrote a small Java class to write out a file that includes a line with as many characters as specified on the command line. The simple class, FileMaker, is shown next:

FileMaker.java
import static java.lang.System.out;

import java.io.FileWriter;

/**
 * Writes a file with a line that contains the number of characters provided.
 */
public class FileMaker
{
   /**
    * Create a file with a line that has the number of characters specified.
    *
    * @param arguments Command-line arguments where the first argument is the
    *    name of the file to be written and the second argument is the number
    *   of characters to be written on a single line in the output file.
    */
   public static void main(final String[] arguments)
   {
      if (arguments.length > 1)
      {
         final String fileName = arguments[0];
         final int maxRowSize = Integer.parseInt(arguments[1]);
         try
         {
            final FileWriter fileWriter = new FileWriter(fileName);
            for (int count = 0; count < maxRowSize; count++)
            {
               fileWriter.write('.');
            }
            fileWriter.flush();
         }
         catch (Exception ex)
         {
            out.println("ERROR: Cannot write file '" + fileName + "': " + ex.toString());
         }
      }
      else
      {
         out.println("USAGE: java FileMaker <fileName> <maxRowSize>");
         System.exit(-1);
      }
   }
}

The above Java class exists solely to generate a file with a line that has as many characters as specified (actually one more than specified when the \n is counted). The next class actually demonstrates the difference between the runtime behavior between Java 6 and Java 7. The code for this Main class is adapted from Ant classes that help perform the file reading functionality used by LineContainsRegExp without the regular expression matching. In other words, the regular expression support is not included in my example, but this class executes much more quickly for very large lines when run in Java 6 than when run in Java 7 or Java 8.

Main.java
import static java.lang.System.out;

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.concurrent.TimeUnit;

/**
 * Adapted from and intended to represent the basic character reading from file
 * used by the Apache Ant class org.apache.tools.ant.filters.LineContainsRegExp.
 */
public class Main
{
   private Reader in;
   private String line;

   public Main(final String nameOfFile)
   {
      if (nameOfFile == null || nameOfFile.isEmpty())
      {
         throw new IllegalArgumentException("ERROR: No file name provided.");
      }
      try
      {
         in = new FileReader(nameOfFile);
      }
      catch (Exception ex)
      {
         out.println("ERROR: " + ex.toString());
         System.exit(-1);
      }
   }
   

   /**
    * Read a line of characters through '\n' or end of stream and return that
    * line of characters with '\n'; adapted from readLine() method of Apache Ant
    * class org.apache.tools.ant.filters.BaseFilterReader.
    */
   protected final String readLine() throws IOException
   {
      int ch = in.read();

      if (ch == -1)
      {
         return null;
      }
        
      final StringBuilder line = new StringBuilder();

      while (ch != -1)
      {
         line.append ((char) ch);
         if (ch == '\n')
         {
            break;
         }
         ch = in.read();
      }

      return line.toString();
   }

   /**
    * Provides the next character in the stream; adapted from the method read()
    * in the Apache Ant class org.apache.tools.ant.filters.LineContainsRegExp.
    */
   public int read() throws IOException
   {
      int ch = -1;
 
      if (line != null)
      {
         ch = line.charAt(0);
         if (line.length() == 1)
         {
            line = null;
         }
         else
         {
            line = line.substring(1);
         }
      }
      else
      {
         for (line = readLine(); line != null; line = readLine())
         {
            if (line != null)
            {
               return read();
            }
         }
      }
      return ch;
   }

   /**
    * Process provided file and read characters from that file and display
    * those characters on standard output.
    *
    * @param arguments Command-line arguments; expect one argument which is the
    *    name of the file from which characters should be read.
    */
   public static void main(final String[] arguments) throws Exception
   {
      if (arguments.length > 0)
      {
        final long startTime = System.currentTimeMillis();
         out.println("Processing file '" + arguments[0] + "'...");
         final Main instance = new Main(arguments[0]);
         int characterInt = 0;
         int totalCharacters = 0;
         while (characterInt != -1)
         {
            characterInt = instance.read();
            totalCharacters++;
         }
         final long endTime = System.currentTimeMillis();
         out.println(
              "Elapsed Time of "
            + TimeUnit.MILLISECONDS.toSeconds(endTime - startTime)
            + " seconds for " + totalCharacters + " characters.");
      }
      else
      {
         out.println("ERROR: No file name provided.");
      }
   }
}

The runtime performance difference when comparing Java 6 to Java 7 or Java 8 is more pronounced as the lines get larger in terms of number of characters. The next screen snapshot demonstrates running the example in Java 6 (indicated by "jdk1.6" being part of path name of java launcher) and then in Java 8 (no explicit path provided because Java 8 is my default JRE) against a freshly generated file called dustin.txt that includes a line with 1 million (plus one) characters.

Although a Java 7 example is not shown in the screen snapshot above, my tests have shown that Java 7 has similar slowness to Java 8 in terms of processing very lone lines. Also, I have seen this in Windows and RedHat Linux JVMs. As the example indicates, the Java 6 version, even for a million characters in a line, reads the file in what rounds to 0 seconds. When the same compiled-for-Java-6 class file is executed with Java 8, the average length of time to handle the 1 million characters is over 150 seconds (2 1/2 minutes). This same slowness applies when the class is executed in Java 7 and also exists even when the class is compiled with JDK 7 or JDK 8.

Java 7 and Java 8 seem to be exponentially slower reading file characters as the number of characters on a line increases. When I raise the 1 million character line to 10 million characters as shown in the next screen snapshot, Java 6 still reads those very fast (still rounded to 0 seconds), but Java 8 requires over 5 hours to complete the task!

I don't know why Java 7 and Java 8 read a very long line from a file so much slower than Java 6 does. I hope that someone else can explain this. While I have several ideas for working around the issue, I would like to understand why Java 7 and Java 8 read lines with very large number of characters so much slower than Java 6. Here are the observations that can be made based on my testing:

  • The issue appears to be a runtime issue (JRE) rather than a JDK issue because even the file-reading class compiled with JDK 6 runs significantly slower in JRE 7 and JRE 8.
  • Both the Windows 8 and RedHat Linux JRE environments consistently indicated that the file reading is dramatically slower for very large lines in Java 7 and in Java 8 than in Java 6.
  • Processing time for reading very long lines appears to increase exponentially with the number of characters in the line in Java 7 and Java 8.

4 comments:

Christoph Nahr said...

Interesting observation. One obvious guess: the default buffer size for FileReader may have changed since JDK6. Would be worth a try to manually specify the buffer size. Also, I would recommend explicitly specifying the initial StringBuilder buffer size to ensure that hasn't changed either.

Michael Waters said...

Hey, had this forwarded on from a coworker.

I think the behavior change is because of this line, which is causing the new garbage collector to go nuts.

line = line.substring(1);

With a 1M initial line, this going to create about 500GB of garbage (Sum of 1 to 1,000,000) that will need collected. To walk a string, this is sub-optimal code.

Michael Martin said...

The issue is the change in how Java 7 handles the substring call of Strings compared to how Java 6 did substrings. Java 6 changed indexes and offsets to return a value which pointed to an existing allocated char[] of a String. Java 7 allocates a brand new String on every substring call. Not only is the overhead of creating and copying this new, large String onto the heap very expensive, garbage collection will now go crazy as Michael Waters stated.

@DustinMarx said...

Thanks Christoph, Michael, and Michael for the feedback.

Based on what you wrote (and on a link that Michael Martin sent me), I have written a post describing what happened between Java 6 and Java 7 (actually in Java 7 Update 6). That post is called Reason for Slower Reading of Large Lines in JDK 7 and JDK 8.

Dustin