Monday, January 26, 2015

Reason for Slower Reading of Large Lines in JDK 7 and JDK 8

I earlier posted the blog post Reading Large Lines Slower in JDK 7 and JDK 8 and there were some useful comments on the post describing the issue. This post provides more explanation regarding why the file reading demonstrated in that post (and used by Ant's LineContainsRegExp) is so much slower in Java 7 and Java 8 than in Java 6.

X Wang's post The substring() Method in JDK 6 and JDK 7 describes how String.substring() was changed between JDK 6 and JDK 7. Wang writes in that post that the JDK 6 substring() "creates a new string, but the string's value still points to the same [backing char] array in the heap." He contrasts that with the JDK 7 approach, "In JDK 7, the substring() method actually create a new array in the heap."

Wang's post is very useful for understanding the differences in String.substring() between Java 6 and Java 7. The comments on this post are also insightful. The comments include the sentiment that I can appreciate, "I would say 'different' not 'improved'." There are also explanations of how JDK 7 avoids a potential memory leak that could occur in JDK 6.

The StackOverflow thread Java 7 String - substring complexity explains the motivation of the change and references bug JDK-4513622 : (str) keeping a substring of a field prevents GC for object. That bug states, "An OutOfMemory error [occurs] because objects don't get garbage collected if the caller stores a substring of a field in the object." The bug contains sample code that demonstrates this error occurring. I have adapted that code here:

 * Minimally adapted from Bug JDK-4513622.
 * {@link}
public class TestGC
   private String largeString = new String(new byte[100000]);
   private String getString()
      return this.largeString.substring(0,2);
   public static void main(String[] args)
      java.util.ArrayList<String> list = new java.util.ArrayList<String>();
      for (int i = 0; i < 1000000; i++)
         final TestGC gc = new TestGC();

The next screen snapshot demonstrates that last code snippet (adapted from Bug JDK-4513622) executed with both Java 6 (jdk1.6 is part of the path of the executable Java launcher) and Java 8 (the default version on my host). As the screen snapshot shows, an OutOfMemoryError is thrown when the code is run in Java 6 but is not thrown when it is run in Java 8.

In other words, the change in Java 7 fixes a potential memory leak at the cost of a performance impact when executing String.substring against lengthy Java Strings. This means that any implementations that use String.substring (including Ant's LineContainsRegExp) to process really long lines probably need to be changed to implement this differently or should be avoided when processing very long lines when migrating from Java 6 to Java 7 and beyond.

Once the issue is known (change of String.substring implementation in this case), it is easier to find documentation online regarding what is happening (thanks for the comments that made these resources easy to find). The duplicate bugs of JDK-4513622 have writes-ups that provide additional details. These bugs are JDK-4637640 : Memory leak due to String.substring() implementation and JDK-6294060 : Use of substring() causes memory leak. Other related online resources include Changes to String.substring in Java 7 [which includes a reference to String.intern() – there are better ways], Java 6 vs Java 7: When implementation matters, and the highly commented (over 350 comments) Reddit thread TIL Oracle changed the internal String representation in Java 7 Update 6 increasing the running time of the substring method from constant to N.

The post Changes to String internal representation made in Java 1.7.0_06 provides a good review of this change and summarizes the original issue, the fix, and the new issue associated with the fix:

Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.

No comments: