Saturday, July 20, 2013

Escaping XML with Groovy 2.1

When posting source code to my blog, I often need to convert less than signs (<), and greater than signs (>) to their respective entity references so that they are not confused as HTML tags when the browser renders the output. I have often done this using quick search-and-replace syntax like %s/</\&lt;/g and %s/>/\&gt;/g with vim or Perl. However, Groovy 2.1 introduced a method to do this and in this post I demonstrate a Groovy script that makes use of that groovy.xml.XmlUtil.escapeXml(String) method.

escapeXml.groovy
#!/usr/bin/env groovy
/*
 * escapeXml.groovy
 *
 * Requires Groovy 2.1 or later.
 */
if (args.length < 1)
{
   println "USAGE: groovy escapeXml.groovy <xmlFileToBeProcessed>"
   System.exit(-1)
}
def inputFileName = args[0]
println "Processing ${inputFileName}..."
def inputFile = new File(inputFileName)
String outputFileName = inputFileName + ".escaped"
def outputFile = new File(outputFileName)
if (outputFile.createNewFile())
{
   outputFile.text = groovy.xml.XmlUtil.escapeXml(inputFile.text)
}
else
{
   println "Unable to create file ${outputFileName}"
}

The XmlUtil.escapeXml method is intended to, as its GroovyDoc states, "escape the following characters " ' & < > with their XML entities." Running source code through it helps to convert symbols to XML entity references that will be rendered properly by the browser. This is particularly helpful with Java code that uses generics, for example.

The Groovydoc states that the following transformations from symbols to corresponding entity references are supported:

SymbolEntity
Reference
"&quot;
'&apos;
&&amp;
<&lt;
>&gt;

One of the advantages of this approach is that I can escape all five of these special symbols in an entire String or file with a single command rather than one symbol at a time.

The Groovydoc for this XmlUtil.escapeXml method also states things that this method does not do:

  • "Does not escape control characters" [use XmlUtil.escapeControlCharacters(String) for this]
  • "Does not support DTDs or external entities"
  • "Does not treat surrogate pairs specially"
  • "Does not perform Unicode validation on its input"

My example above showed a Groovy script file that makes use of XmlUtil.escapeXml(String), but it can also be run inline on the command-line. This is done in DOS, for example, as shown here:

type escapeXml.groovy | groovy -e "println groovy.xml.XmlUtil.escapeXml(System.in.text)"

That command just shown will take the provided file (escapeXml.groovy itself in this case) and render output with the specific symbols replaced with entity references. It could be handled the same way in Linux/Unix with "cat" rather than "type." This is shown in the next screen snapshot.

This blog post has shown how XmlUtil.escapeXml(String) can be used within a script or on the command-line to escape certain commonly problematic XML characters to their entity references. Although not shown here, one could embed such code within a Java application as well.

No comments: