Monday, May 16, 2016

On the Virtues of Avoiding Parsing or Basing Logic on toString() Result

With Java or any other programming language I've used significantly, I have found that there are occasionally things that can be done in the language, but generally should not be done. Often, these misuses of the language seem harmless and perhaps beneficial when a developer first uses them, but later that same developer or another developer runs into associated issues that are costly to overcome or change. An example of this, and subject of this blog post, is using the results of a toString() call in Java to make a logic choice or to be parsed for content.

In 2010, I wrote in Java toString() Considerations, that I generally prefer it when toString() methods are explicitly available for classes and when they contain the relevant public state of an object of that class. I still feel this way. However, I expect a toString() implementation to be sufficient for a human to read the content of the object via logged statement or debugger and not to be something that is intended to be parsed by code or script. Using the String returned by a toString() method for any type of conditional or logic processing is too fragile. Likewise, parsing the toString()'s returned String for details about the instance's state is also fragile. I warned about (even unintentionally) requiring developers to parse toString() results in the previously mentioned blog post.

Developers may choose to change a toString()'s generated String for a variety of reasons including adding existing fields to the output that may not have been represented before, adding more data to existing fields that were already represented, adding text for newly added fields, removing representation of fields no longer in the class, or changing format for aesthetic reasons. Developers might also change spelling and grammar issues of a toString()'s generated String. If the toString()'s provided String is simply used by humans analyzing an object's state in log messages, these changes are not likely to be an issue unless they remove information of substance. However, if code depends on the entire String or parses the String for certain fields, it can be easily broken by these types of changes.

For the purpose of illustration, consider the following initial version of a Movie class:

package dustin.examples.strings;

/**
 * Motion Picture, Version 1.
 */
public class Movie
{
   private String movieTitle;

   public Movie(final String newMovieTitle)
   {
      this.movieTitle = newMovieTitle;
   }

   public String getMovieTitle()
   {
      return this.movieTitle;
   }

   @Override
   public String toString()
   {
      return this.movieTitle;
   }
}

In this simple and somewhat contrived example, there's only one attribute and so it's not unusual that the class's toString() simply returns that class's single String attribute as the class's representation.

The next code listing contains an unfortunate decision (lines 22-23) to base logic on the Movie class's toString() method.

/**
 * This is a contrived class filled with some ill-advised use
 * of the {@link Movie#toString()} method.
 */
public class FavoriteMoviesFilter
{
   private final static List<Movie> someFavoriteMovies;

   static
   {
      final ArrayList<Movie> tempMovies = new ArrayList<>();
      tempMovies.add(new Movie("Rear Window"));
      tempMovies.add(new Movie("Pink Panther"));
      tempMovies.add(new Movie("Ocean's Eleven"));
      tempMovies.add(new Movie("Ghostbusters"));
      tempMovies.add(new Movie("Taken"));
      someFavoriteMovies = Collections.unmodifiableList(tempMovies);
   }

   public static boolean isMovieFavorite(final String candidateMovieTitle)
   {
      return someFavoriteMovies.stream().anyMatch(
         movie -> movie.toString().equals(candidateMovieTitle));
   }
}

This code might appear to work for a while despite some underlying issues with it when more than one movie shares the same title. However, even before running into those issues, a risk of using toString() in the equality check might be realized if a developer decides he or she wants to change the format of the Movie.toString() representation to what is shown in the next code listing.

@Override
public String toString()
{
   return "Movie: " + this.movieTitle;
}

Perhaps the Movie.toString() returned value was changed to make it clearer that the String being provided is associated with an instance of the Movie class. Regardless of the reason for the change, the previously listed code that uses equality on the movie title is now broken. That code needs to be changed to use contains instead of equals as shown in the next code listing.

public static boolean isMovieFavorite(final String candidateMovieTitle)
{
   return someFavoriteMovies.stream().anyMatch(
      movie -> movie.toString().contains(candidateMovieTitle));
}

When it's realized that the Movie class needs more information to make movies differentiable, a developer might add the release year to the movie class. The new Movie class is shown next.

package dustin.examples.strings;

/**
 * Motion Picture, Version 2.
 */
public class Movie
{
   private String movieTitle;

   private int releaseYear;

   public Movie(final String newMovieTitle, final int newReleaseYear)
   {
      this.movieTitle = newMovieTitle;
      this.releaseYear = newReleaseYear;
   }

   public String getMovieTitle()
   {
      return this.movieTitle;
   }

   public int getReleaseYear()
   {
      return this.releaseYear;
   }

   @Override
   public String toString()
   {
      return "Movie: " + this.movieTitle;
   }
}

Adding a release year helps to differentiate between movies with the same title. This helps to differentiate remakes from originals as well. However, the code that used the Movie class to find favorites will still show all movies with the same title regardless of the year the movies were released. In other words, the 1960 version of Ocean's Eleven (6.6 rating on IMDB currently) will be seen as a favorite alongside the 2001 version of Ocean's Eleven (7.8 rating on IMDB currently) even though I much prefer the newer version. Similarly, the 1988 made-for-TV version of Rear Window (5.6 rating currently on IMDB) would be returned as a favorite alongside the 1954 version of of Rear Window (directed by Alfred Hitchcock, starring James Stewart and Grace Kelly, and rated 8.5 currently in IMDB) even though I much prefer the older version.

I think that a toString() implementation should generally include all publicly available details of an object. However, even if the Movie's toString() method is enhanced to include release year, the client code still will not differentiate based on year because it only performs a contain on movie title.

@Override
public String toString()
{
   return "Movie: " + this.movieTitle + " (" + this.releaseYear + ")";
}

The code above shows release year added to Movie's toString() implementation. The code below shows how the client needs to be changed to respect release year properly.

public static boolean isMovieFavorite(
   final String candidateMovieTitle,
   final int candidateReleaseYear)
{
   return someFavoriteMovies.stream().anyMatch(
      movie ->   movie.toString().contains(candidateMovieTitle)
              && movie.getReleaseYear() == candidateReleaseYear);
}

It is difficult for me to think of cases where it is a good idea to parse a toString() method or base a condition or other logic on the results of a toString() method. In just about any example I think about, there is a better way. In my example above, it'd be better to add equals() (and hashCode()) methods to Movie and then use equality checks against instances of Movie instead of using individual attributes. If individual attributes do need to be compared (such as in cases where object equality is not required and only a field or two needs to be equal), then the appropriate getXXX methods could be employed.

As a developer, if I want users of my classes (which will often end up including myself) to not need to parse toString() results or depend on a certain result, I need to ensure that my classes make any useful information available from toString() available from other easily accessible and more programmatic-friendly sources such as "get" methods and equality and comparison methods. If a developer does not want to expose some data via public API, then it's likely that developer probably doesn't really want to expose it in the returned toString() result either. Joshua Bloch, in Effective Java, articulates this in bold-emphasized text, "... provide programmatic access to all of the information contained in the value returned by toString()."

In Effective Java, Bloch also includes discussion about whether a toString() method should have an advertised format of the String representation it provides. He points out that this representation, if advertised, must be the same from then on out if it's a widely used class to avoid the types of runtime breaks I have demonstrated in this post. He also advises that if the format is not guaranteed to stay the same, that the Javadoc include a statement to that effect as well. In general, because Javadoc and other comments are often more ignored than I'd like and because of the "permanent" nature of an advertised toString() representation, I prefer to not rely on toString() to provide a specific format needed by clients, but instead provide a method specific for that purpose that clients can call. This leaves me the flexibility to change my toString() as the class changes.

An example from the JDK illustrates my preferred approach and also illustrates the dangers of prescribing a particular format to an early version of toString(). BigDecimal's toString() representation was changed between JDK 1.4.2 and Java SE 5 as described in "Incompatibilities in J2SE 5.0 (since 1.4.2)": "The J2SE 5.0 BigDecimal's toString() method behaves differently than in earlier versions." The Javadoc for the 1.4.2 version of BigDecimal.toString() simply states in the method overview: "Returns the string representation of this BigDecimal. The digit-to- character mapping provided by Character.forDigit(int, int) is used. A leading minus sign is used to indicate sign, and the number of digits to the right of the decimal point is used to indicate scale. (This representation is compatible with the (String) constructor.)" The same method overview documentation for BigDecimal.toString() in Java SE 5 and later versions is much more detailed. It's such a lengthy description that I won't show it here.

When BigDecimal.toString() was changed with Java SE 5, other methods were introduced to present different String representations: toEngineeringString() and toPlainString(). The newly introduced method toPlainString() provides what BigDecimal's toString() provided through JDK 1.4.2. My preference is to provide methods that provide specific String representations and formats because those methods can have the specifics of the format described in their names and Javadoc comments and changes and additions to the class are not as likely to impact those methods as they are to impact the general toString() method.

There are some simple classes that might fit the case where an originally implemented toString() method will be fixed once and for all and will "never" change. Those might be candidates for parsing the returned String or basing logic on the String, but even in those cases I prefer to provide an alternate method with an advertised and guaranteed format and leave the toString() representation some flexibility for change. It's not a big deal to have the extra method because, while they return the same thing, the extra method can be simply a one-line method calling the toString. Then, if the toString() does change, the calling method's implementation can be changed to be what toString() formerly provided and any users of that extra method won't see any changes.

Parsing of a toString() result or basing logic on the result of a toString() call are most likely to be done when that particular approach is perceived as the easiest way for a client to access particular data. Making that data available via other, specific publicly available methods should be preferred and class and API designers can help by ensuring that any even potentially useful data that will be in the String provided by toString() is also available in a specific alternate programmatically-accessible method. In short, my preference is to leave toString() as a method for seeing general information about an instance in a representation that is subject to change and provide specific methods for specific pieces of data in representations that are much less likely to change and are easier to programmatically access and base decisions on than a large String that potentially requires format-specific parsing.

No comments: