Monday, January 4, 2016

Leaner Java Collections with FastUtil

In response to my recent post Discovering a Trove of Java Primitives Collection Handling on the GNU Trove library, TheAlchemist pointed out some advantages of fastutil over trove: "I much prefer fastutil (http://fastutil.di.unimi.it/), because it's still in active development, has more features, supports large sizes (> 2^32), and has better documentation." Attila-Mihaly Balazs has seconded this: "I second @TheAlchemist's recommendation for fastutil! It's a great library." In this post, I look at fastutil from some of the same perspectives that I previously looked at trove.

The main fastutil page describes fastutil as an extension of the JavaTM Collections Framework that provides "type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion" along with "big (64-bit) arrays, sets and lists, and fast, practical I/O classes for binary and text files." The license for fastutil is Apache License, Version 2 and the current version of fastutil requires Java 7 or newer. There are currently (as of this writing) "unmaintained" versions of fastutil available for download as well for Java 6 and Java 5 as well.

Adding elements to a FastUtil collection is accomplished with the same API calls as used with standard Java collections as demonstrated in the next code listing which compares inserting elements into a JDK ArrayList to inserting elements into a FastUtil DoubleArrayList.

Inserting Doubles with JDK and Inserting doubles with FastUtil DoubleArrayList
/**
 * Demonstrate standard JDK {@code ArrayList<Double>}
 * with some JDK 8 functionality.
 */
public void demonstrateJdkArrayListForDoubles()
{
   final ArrayList<Double> doubles = new ArrayList<>();
   doubles.add(15.5);
   doubles.add(24.4);
   doubles.add(36.3);
   doubles.add(67.6);
   doubles.add(10.0);
   out.println("JDK ArrayList<Double>:");
   out.println("\tDoubles List: " + doubles);
}

/**
 * Demonstrate use of DoubleArrayList and show how
 * similar using it is to using {@code ArrayList<Double>}.
 */
public void demonstrateFastUtilArrayListForDoubles()
{
   // Demonstrate adding elements to DoubleArrayList is
   // exactly like adding elements to ArrayList<Double>.
   final DoubleArrayList doubles = new DoubleArrayList();
   doubles.add(15.5);
   doubles.add(24.4);
   doubles.add(36.3);
   doubles.add(67.6);
   doubles.add(10.0);
   out.println("FastUtil DoubleArrayList:");  // DoubleArrayList overrides toString()
   out.println("\tDoubles List: " + doubles);
}

When the two above methods are executed, the list of doubles that are written to standard output appear exactly the same with even the same square braces surrounding the comma-separated doubles values.

FastUtil collections tend to implement the appropriate JDK collection interfaces. For example, the just-demonstrated class DoubleArrayList implements several interfaces including Collection<Double> and List<Double>. It turns out that DoubleArrayList also implements it.unimi.dsi.fastutil.doubles.DoubleStack and it.unimi.dsi.fastutil.Stack<Double>. The ability to use this class as a stack is demonstrated in the next code listing.

Using FastUtil's DoubleArrayList as a Stack
/**
 * Demonstrate FastUtil's Double Stack.
 *
 * FastUtil's DoubleStack allows access to its
 * contents via push, pop, and peek. It is declared
 * as a DoubleArrayList type here so that the size()
 * method is available without casting.
 */
public void demonstrateFastUtilDoubleStack()
{
   final DoubleArrayList stack = new DoubleArrayList();
   stack.push(15.5);
   stack.push(17.3);
   stack.push(16.6);
   stack.push(2.2);
   out.println("FastUtil Stack of Doubles");
   out.println("\tPeek: " + stack.peek(0) + "; After Size: " + stack.size());
   out.println("\tPop:  " + stack.pop() + "; After Size: " + stack.size());
   out.println("\tPeek: " + stack.peek(0) + "; After Size: " + stack.size());
}

As I discussed in the blog post on Trove, Trove provides a gnu.trove.TCollections class that is an analogous (subset) to java.util.Collections. FastUtil provides similar functionality, but this approach of providing static methods to act upon FastUtil collections is separated into type-specific and structure-specific classes with static methods rather than in a single class with static methods. The next code listing demonstrates using one of these type-specific and structure-specific classes with static methods, IntSets, in conjunction with a FastUtil IntLinkedOpenHashSet. As the name suggests, the IntSets class provides "static methods and objects that do useful things with [int]-specific sets."

Using IntSets with IntLinkedOpenHashSet
/**
 * Demonstrate one of FastUtil's "equivalent"s of the
 * java.util.Collections class. FastUtil separates its
 * grouping of static methods into classes that are
 * specific to the data type of the collection and to
 * the data structure type of the collection.
 */
public void demonstrateFastUtilCollectionsClass()
{
   final IntLinkedOpenHashSet integers = new IntLinkedOpenHashSet();
   integers.add(5);
   integers.add(7);
   integers.add(3);
   integers.add(1);
   final IntSet unmodifiableIntegers = IntSets.unmodifiable(integers);
   out.println("Unmodifiable Integers:");
   out.println("\tClass: " + unmodifiableIntegers.getClass().getCanonicalName());
   try
   {
      unmodifiableIntegers.add(15);
   }
   catch (Exception ex)
   {
      out.println("\tException caught: " + ex);
   }
}

FastUtil supports that standard Java iteration approaches of using an explicit iterator and using the Java SE 5-introduced for-each loop. FastUtil collections even support the JDK 8 style using .forEach (assuming code is built and run on JDK 8) because the FastUtil collections implement java.lang.Iterable. These are demonstrated in the next code listing.

Iterating FastUtil Collections in Standard Java Style
/**
 * Demonstrate "traditional" Java iteration of a
 * FastUtil collection.
 */
public void demonstrateIterationWithIterator()
{
   final LongOpenHashSet longs = new LongOpenHashSet();
   longs.add(15);
   longs.add(6);
   longs.add(12);
   longs.add(13);
   longs.add(2);
   final LongIterator longIterator = longs.iterator();
   while (longIterator.hasNext())
   {
      final long longValue = longIterator.next();
      out.print(longValue + " ");
   }
}

/**
 * Demonstrate iteration of a FastUtil collection
 * using Java's enhanced for-each approach.
 */
public void demonstrateIterationWithForEach()
{
   final LongLinkedOpenHashSet longs = new LongLinkedOpenHashSet();
   longs.add(15);
   longs.add(6);
   longs.add(12);
   longs.add(13);
   longs.add(2);
   for (final long longValue : longs)
   {
      out.println(longValue + " ");
   }
}

/**
 * Demonstrate iteration of a FastUtil collection
 * using JDK 8 .forEach approach.
 */
public void demonstrateIterationWithJdk8ForEach()
{
   final LongLinkedOpenHashSet longs = new LongLinkedOpenHashSet();
   longs.add(15);
   longs.add(6);
   longs.add(12);
   longs.add(13);
   longs.add(2);
   longs.forEach(longValue -> out.print(longValue + " "));
}
Additional Observations Related to FastUtil
  • Because FastUtil collections implement standard JDK 8 collection interfaces, the APIs are easy to pick up and use with familiar Java idioms.
  • FastUtil collections generally provide a constructor that accepts an array of the underlying data type and an overridden toArray() method and a type-specific method such as toDoubleArray() [for double-oriented collections] to provide their data elements in form of array of primitives.
  • FastUtil collections generally provide explicitly overridden toString() implementations that allow for the individual data elements to be easily written similar to JDK collections and differently than Java arrays (which require Arrays.toString() methods).
  • FastUtil's Java packages are organized generally by primitive type with specific implementations of the various data structure types for that primitive type all in the same package. For example, packages are named like it.unimi.dsi.fastutil.doubles, it.unimi.dsi.fastutil.ints, and so on.
  • Because each FastUtil collection is specific to a particular primitive data type, each collection does not require a generic parameter and has none of the issues related to generics (such as erasure). I haven't seen FastUtil take advantage of the collections being type-specific for type-specific methods like Trove does, probably because FastUtil more closely implements the corresponding Java collection interfaces.
  • FastUtil's Javadoc API documentation is probably the best place to start with when learning to use FastUtil. The classes, interfaces, enums, and packages tend to be fairly well documented in FastUtil's Javadoc-based API documentation.
Conclusion

Like Trove, FastUtil is a library that can potentially be used to more efficiently (in terms of memory and performance) work with Java collections. While Trove seems to have formerly been the most popular of the many choices available, FastUtil is perhaps the most popular currently for reasons that include those cited by TheAlchemist: "still in active development, has more features, supports large sizes (> 2^32), and has better documentation." Similar libraries besides Trove and FastUtil include High Performance Primitive Collections for Java (HPPC), Koloboke, Goldman Sachs Collections, Mahout collections, and Javolution.

1 comment:

tuolumne said...

I am interested in someone doing an independent performance review of the various available Maps for scanning - Iteration, forEach, and threaded parallel scans. We provide AirConcurrentMap at boilerbay.com to improve these cases and others. It is free for non-commercial use, and licensed commercially with our without source. The parallel scan uses a hidden internal thread pool with a simple API for semi-declarative use, and it can provide parallelism for operations on the Map contents. It is an extended standard ConcurrentNavigableMap. Also there are advantages compared to streams.

We have done our own performance testing, with graphical results at boilerbay.com over various Map sizes for the standard library Maps. Performance is best above 1K entries. The tests also show memory efficiency and concurrency advantages. It is ordered, and improves on ConcurrentSkipListMap for get/put/remove and higher/lower above 1K Entries.

Also we have a Java NoSQL embedded database called InfinityDB. It is an extended standard ConcurrentNavigableMap at boilerbay.com.

Roger Deran
boilerbay.com