Saturday, June 15, 2019

History and Motivations Behind Java's Maligned Serialization

Issues related to Java's serialization mechanism are well-advertised. The entire last chapter of Effective Java 1st Edition (Chapter 10) and of Effective Java 2nd Edition (Chapter 11) are dedicated to the subject of serialization in Java. The final chapter of Effective Java 3rd Edition (Chapter 12) is still devoted to serialization, but includes a new item (Item 85) that goes even further emphasize two assertions related to Java serialization:

  • "The best way to avoid serialization exploits is to never deserialize anything."
  • "There is no reason to use Java serialization in any new system you write."

In the recently released document "Towards Better Serialization," Brian Goetz "explores a possible direction for improving serialization in the Java Platform." Although the main intention of this document is to propose potential new direction for Java serialization, it is an "exploratory document only and does not constitute a plan for any specific feature." This means that it is an interesting read for the direction Java serialization might take, but there is significant value in reading this document for a summary of Java serialization as it currently exists and how we got to this place. That is the main theme of the rest of my post in which I'll reference and summarize sections of "Towards Better Serialization" that I feel best articulate the current issues with Java's serialization mechanism and why we have these issues.

Goetz opens his document's "Motivation" section with an attention-grabbing paragraph on the "paradox" of Java serialization:

Java's serialization facility is a bit of a paradox. On the one hand, it was probably critical to Java's success --- Java would probably not have risen to dominance without it, as serialization enabled the transparent remoting that in turn enabled the success of Java EE. On the other hand, Java's serialization makes nearly every mistake imaginable, and poses an ongoing tax (in the form of maintenance costs, security risks, and slower evolution) for library maintainers, language developers, and users.

The other paragraph in the "Motivation" section of the Goetz document distinguishes between the general concept of serialization and the specific design of Java's current serialization mechanism:

To be clear, there's nothing wrong with the concept of serialization; the ability to convert an object into a form that can be easily transported across JVMs and reconstituted on the other side is a perfectly reasonable idea. The problem is with the design of serialization in Java, and how it fits (or more precisely, does not fit) into the object model.

Goetz states that "Java's serialization [mistakes] are manifold" and he outlines the "partial list of sins" committed by Java's serialization design. I highly recommend reading the original document for the concise and illustrative descriptions of these "sins" that I only summarize here.

  • "Pretends to be a library feature, but isn't."
    • "Serialization pretends to be a library feature. ... In reality, though, serialization extracts object state and recreates objects via privileged, extralinguistic mechanisms, bypassing constructors and ignoring class and field accessibility."
  • "Pretends to be a statically typed feature, but isn't."
    • "Serializability is a function of an object's dynamic type, not its static type."
    • "implements Serializable doesn't actually mean that instances are serializable, just that they are not overtly serialization-hostile."
  • "The compiler won't help you" identify "all sorts of mistakes one can make when writing serializable classes"
  • "Magic methods and fields" are "not specified by any base class or interface) that affect the behavior of serialization"
  • "Woefully imperative."
  • "Tightly coupled to encoding."
  • "Unfortunate stream format" that is "neither compact, nor efficient, nor human-readable."

Goetz also outlines the ramifications of these Java serialization design decisions (see the original document for more background on each of these "serious problems"):

  • "Cripples library maintainers."
    • "Library designers must think very carefully before publishing a serializable class --- as doing so potentially commits you to maintaining compatibility with all the instances that have ever been serialized."
  • "Makes a mockery of encapsulation."
    • "Serialization constitutes an invisible but public constructor, and an invisible but public set of accessors for your internal state."
  • "Readers cannot verify correctness merely by reading the code."
    • "But because serialization constitutes a hidden public constructor, you have to also reason about the state that objects might be in based on previous versions of the code."
    • "By bypassing constructors, serialization completely subverts the integrity of the object model."
  • "Too hard to reason about security."
    • "The variety and subtlety of security exploits that target serialization is impressive; no ordinary developer can keep them all in their head at once."
  • "Impedes language evolution."
    • "Complexity in programming languages comes from unexpected interactions between features, and serialization interacts with nearly everything."
    • "Serialization is an ongoing tax on evolving the language."

Perhaps my favorite section of Goetz's "Toward Better Serialization" document is the section "The underlying mistake" because the items that Goetz outlines in this section are common reasons for mistakes in other Java code I've written, read, and worked with. In other words, while Goetz is specifically discussion how these design decisions led to problems for Java's serialization mechanism, I have (unsurprisingly) found these general design decisions to cause problems in other areas as well.

Goetz opens the section "The underlying mistake" with this statement: "Many of the design errors listed above stem from a common source --- the choice to implement serialization by 'magic' rather than giving deconstruction and reconstruction a first-class place in the object model itself." I have found "magic" code written by other developers and even myself at a later date to often be confusing and difficult to reason. I've definitely realized that clean, explicit code is often preferable.

Goetz adds, "Worse, the magic does its best to remain invisible to the reader." Invisible "magic" designs often seem clever when we first implement them, but then cause developers who must read, maintain, and change the code a lot of pain when they suddenly need some visibility to the underlying magic.

Goetz cites Edsger W.Dijkstra and writes, "Serialization, as it is currently implemented, does the exact opposite of minimizing the gap between the text of the program and its computational effect; we could be forgiven for mistakenly assuming that our objects are always initialized by the constructors written in our classes, but we shouldn't have to be".

Goetz concludes "The underlying mistake" section withe a paragraph that begins, "In addition to trying to be invisible, serialization also tries to do too much. Although Goetz is writing particularly about Java's serialization currently "serializing programs [rather than] merely serializing data," I have seen this issue countless times in a more general sense. It is tempting for we developers to design and implement code that performs every little feature we think might be useful to someone at some point even if the vast majority of (or even all currently known) users and use cases only require a simpler subset of the functionality.

Given that the objective of "Towards Better Serialization" is to "explore a possible direction for improving serialization in the Java Platform," it's not surprising that the document goes into significant detail about design and even implementation details that might influence Java's future serialization mechanism. In addition, the Project Amber mailing lists (amber-dev and amber-spec-experts) also have significant discussion on possible future direction of Java serialization. However, the purpose of my post here is not to look at the future of Java's serialization, but to instead focus on how this document has nicely summarized Java's current serialization mechanism and its history.

Although the previously mentioned Project Amber mailing lists messages focus on the potential future of Java's serialization mechanism, there are some interesting comments in these posts about Java's current serialization that add to what Goetz summarized in "Toward Better Serialization." Here are some of the most interesting:

  • Goetz's post that announced "Toward Better Serialization" states that the proposal "addresses the risks of serialization at their root" and "brings object serialization into the light, where it needs to be in order to be safer."
  • Brian Goetz post reiterates through implication that big part of problems with Java's serialization today is constructing objects without invoking a constructor: "our main security goal [is to allow] deserialization [to] proceed through constructors."
  • Stuart Marks's post states, "The line of reasoning about convenience in the proposal is not that convenience itself is evil, but that in pursuit of convenience, the original design adopted extralinguistic mechanisms to achieve it. This weakens some of the fundamentals of the Java platform, and it has led directly to several bugs and security holes, several of which I've fixed personally."
    • Marks outlines some specific examples of subtle bugs in the JDK due to serialization-related design decisions.
    • Marks outlines the explicit and specific things a constructor must do ("bunch of special characteristics") that are circumvented when current deserialization is used.
    • He concludes, "THIS is the point of the proposal. Bringing serialization into the realm of well-defined language constructs, instead of using extralinguistic 'magic' mechanisms, is a huge step forward in improving quality and security of Java programs."
  • Kevin Bourrillion's post states, "Java's implementation of serialization has been a gaping wound for a long time" and adds that "every framework to support other wire formats has always had to start from scratch."

I highly recommend reading "Towards Better Serialization" to anyone interested in Java serialization regardless of whether their primary interest is Java's current serialization mechanism or what it might one day become. It's an interesting document from both perspectives.

Monday, June 10, 2019

JDK 13: VM.events Added to jcmd

CSR (Compatibility and Specification Review) JDK-8224601 ["Provide VM.events diagnostic command"] is implemented in JDK 13 as of JDK 13 Early-Access Build #24 (dated 2019/6/6) and was added via Enhancement JDK-8224600 ["Provide VM.events command"]. The CSR's "Summary" describes this enhancement: "Add a VM.events command to jcmd to display event logs." The CSR's "Solution" states, "Add a command to jcmd to print out event logs. The proposed name is 'VM.events'."

The "Problem" section of CSR JDK-8224601 explains the value achieved from adding VM.events to the already multi-functioning jcmd: "Event logs are a valuable problem analysis tool. Right now the only way to see them is via hs-err file in case the VM died, or as part of the VM.info output."

To demonstrate jcmd's new VM.events in action, I downloaded JDK 13 Early Access Build #24 and used it to compile a simple, contrived Java application that I could then run jcmd tool delivered with that same JDK 13 Early Access Build #24 against.

The first screen snapshot shown here demonstrates using jcmd to detect the PID of the simple Java application and using jcmd <pid> help to see what jcmd options are available for that particular running Java process. The presence of VM.events is highlighted.

The next screen snapshot demonstrates applying jcmd <pid> help VM.events to see the usage (including available options) for the newly added VM.events command.

The final screen snapshot demonstrates application of jcmd's new VM.events command by showing the top (most) portion of the output from running that command without any options.

The options for the VM.events command are to narrow down results to a specified log to be printed or to limit the number of events shown. By not specifying any options, I was implicitly requesting the default of all logs and all events.

In the last displayed screen snapshot, we could see that types of JVM events rendered in the output include "compilation events", "deoptimization events", garbage collection events, classes unloaded, classes redefined, and classes loaded.

I have been a big fan of jcmd for a number of years and believe it is still generally an underappreciated command-line tool for many Java developers. The addition of the VM.events command in JDK 13 makes the tool even more useful for diagnosing a wider variety of issues.