Saturday, January 29, 2011

Having the Last Word with Java and Groovy

For the past several months, I have been somewhat of a Groovy evangelist and have been touting the virtues of Groovy to Java developer colleagues. Recently, a colleague asked if Groovy has a slick way for extracting the last word from a sentence. My immediate answer was to use Java's String.split(String), split on a single space string (" "), and take the last element of the array returned from String.split(String). However, a quick perusal through the Groovy GDK documentation reminded me of the many alternatives available to the Groovy developer. In this post, I cover some of these different options.

The following Groovy script, lastWord.groovy demonstrates multiple ways of getting the final word in a String of words where the words are separated by spaces. Although this script contains multiple approaches, I certainly don't claim that it covers all available approaches via Java or Groovy.

lastWord.groovy
#!/usr/bin/env groovy
if (args.length < 1)
{
   println "Please provide a String as a parameter."
   System.exit(-1)
}
def stringWithWords = args[0]
println "Parsing the final word from provided string '${stringWithWords}' ..."

//
// Get last word with Java String's substring method.
//
def javaSubStringWords = stringWithWords.substring(stringWithWords.lastIndexOf(" ")+1);
println "[Java] String.substring(String.lastIndexOf): ${javaSubStringWords}"

//
// Get last word with Java String's split(regex) where regex is any whitespace
// character (not limited to space).
//
def javaWhiteSpaceSplitWords = stringWithWords.split("\\s+")
println "[Java] String.split(white space): ${javaWhiteSpaceSplitWords[javaWhiteSpaceSplitWords.length-1]}"

//
// Get the last word with Java String's split(" ") method and accessing last
// element in array returned by the split method.
//
def javaWords = stringWithWords.split(" ")
def javaSplitLastWord = javaWords[javaWords.length-1]
println "[Java] String.split(\" \"): ${javaSplitLastWord}"

//
// Get the last word with Groovy's (GDK) String.split()
//
def groovyWords = stringWithWords.split()
def splitLastWord = groovyWords[groovyWords.length-1]
println "[Groovy] String.split(): ${splitLastWord}"

//
// Get the last word with Groovy's (GDK) String.tokenize() and List.last() 
//
println "[Groovy] String.tokenize()/List.last(): ${stringWithWords.tokenize().last()}"

//
// Get last word with Groovy's GDK String.find(String) method using regular
// expression that includes end of line character.
//
println "[Groovy] String.find(String regex): ${stringWithWords.find('\\w+$')}"

//
// Get last word with "crazy" approach of using Groovy's GDK String reverse()
// method (twice) in conjunction with the find(String regex) method. This is
// not intended to be an example of how this should be done; rather it is an
// example of what Groovy can do (even if it shouldn't be done).
//
print "[Groovy] String.reverse().find(regex).reverse(): "
println stringWithWords.reverse().find('\\w+').reverse()

The output from running the above code is shown in the next screen snapshot.


All approaches used in the above Groovy code provide the "events" string as expected when passed the longer string "When, in the course of human events". If the larger string is passed in with a comma at the end of it ("When, in the course of human events,"), the output is a little different as demonstrated in the next screen snapshot.


As the two above screen snapshots indicate, the approaches used in the Groovy script can lead to different results depending on whether or not a String to be parsed for the last work end with punctuation. The next screen snapshot indicates the behavior of the respective approaches if the provided String ends with one or more space characters.


The approaches covered in the example Groovy script are summarized next. Note that although all examples are demonstrated in a Groovy script as shown above, the first three are general Java approaches that can be used directly in Java. The last four listed approaches employ one or more Groovy-specific features.
  • Java's String.substring and Java's String.lastIndexOf(" ") provide the last word by finding the last space in the String and bounding the returned String from the next character after the space (the +1) to the end of the String. This example is also referenced in Java: Fastest Way to Get Last Word in a String. This approach includes the comma as part of the final word in the String when it's included and returns spaces when the String ends with spaces (so calling String.trim() before invoking this would be a good idea).
  • Java String.split(String) on regular expression with one or more whitespace characters provides "events" for the nominal case (when "events" really is the last characters in the provided String) and for the case where the String ends with spaces (String.trim() not needed) and returns "events," for the case with a comma at the end of the provided String. Because String.split(String) returns an array, the last element of the array is accessed with array syntax whether in Java or in Groovy. This approach is also mentioned in Java: Fastest Way to Get Last Word in a String.
  • The Java example that uses String.split(String) on a more limited basis (supplying " " to the method rather than \s+ works the same way but splits on only single space (" ") rather than on any whitespace character.
  • Groovy's GDK String class provides a convenience String.split() method that does not take any parameters because it assumes white space as the delimiting character. It is the same as previous two Java approaches using Java's String.split(String), but does not require the developer to explicitly specify that whitespace is the delimiter. Like the two Java approaches, this approach returns "events" when the provided String ends with "events" or ends with "events" and some spaces and returns "events," when the provided String literally ends with that String.
  • The Groovy GDK String class provides a String.tokenize() method that works similarly to Groovy's String.split() method, but which returns a List rather than an array. Because it works similarly, it is not surprising that it returns "events", "events", and "events," as do those approaches. Note also that the GDK String class also provides overloaded versions of the tokenize method that accept a String or a single Character as the delimiter.
  • Groovy provides GDK String methods with the name "find" that allow for substring expressions to be "found" in a provided String. In the Groovy script above, this method is invoked with a String representation of the regular expression pattern '\\w+$'. Because this depends on the end of line character being preceded by one or more word characters, it returns null in the cases of the provided String ending with spaces or with comma. It returns "events" as expected when that is the last series of characters. Groovy's GDK String.find is overloaded several times, including once that accepts a Pattern. The behavior is the same and will be demonstrated later in this post.
  • Groovy's GDK String.reverse() method can be used to reverse the order of characters in a given String. With this, another approach is to reverse the provided String, extract the first word, and then re-reverse the String to get the extracted String back in the correct order. If performance is an issue, pulling a double reverse (not the American football type) is probably not the best idea. However, one major advantage of this approach is that concisely (in terms of code space) returns "events" without any other characters for all three test cases. Indeed, it's the only one of the demonstrated approaches that does this. Of course, the other approaches could be tweaked to do this as well. For example, passing changing the regular expression passed to the Groovy GDK String.find method (either overloaded version accepting String or Pattern) can be adjusted for different results.

I mentioned previously that Groovy's GDK String provides a find method that accepts an instance of Pattern rather than a String. Groovy's regular expression niceties really shine in this case. A code snippet that demonstrates Groovy's regular expression slashy syntax, ~ for Pattern, and String.find(Pattern) is shown next.

//
// Get last word with Groovy's GDK String.find(Pattern) method using regular
// expression that includes end of line character.
//
def regExPattern = ~/\w+$/
println "[Groovy] String.find(Pattern): ${stringWithWords.find(regExPattern)}"

Many of the approaches discussed above assume that at least one "word" is contained in the provided String and exceptions such as ArrayIndexOutOfBoundsException will be encountered if this is not a valid assumption. For a production script, these types of conditions should either be prevented by checking provided String or handled by catching the exception.


Conclusion

When I was a young boy, I found satisfaction in tormenting my younger brother by insisting on having the "last word" when we were supposed to be going to sleep. I'd say a word just loud enough so that he'd hear it and know I had the last word. As an adult, getting the last word can still be satisfying and Groovy makes doing just that particularly easy to do.

No comments: