Monday, July 19, 2010

split Command for DOS/Windows Via Groovy

One of the commands that I miss most from Linux when working in Windows/DOS environments is the split command.  This extremely handy command allows one to split a large file into multiple smaller files determined by the specification of either by number of lines or number of bytes (or kilobytes or megabytes) desired for the smaller files.  There are many uses for such functionality including fitting files onto certain media, making files "readable" by applications with file length restrictions, and so on.  Unfortunately, I'm not aware of a split equivalent for Windows or DOS.  PowerShell can be scripted to do something like this, but that implementation is specific to PowerShell.  There are also third-party products available that perform similar functionality.  However, these existing solutions leave just enough to be desired that I have the motivation to implement a split equivalent in Groovy and that is the subject of this post.  Because Groovy runs on the JVM, this implementation could be theoretically run on any operating system with a modern Java Virtual Machine implementation.

To test and demonstrate the Groovy-based split script, some type of source file is required.  I'll use Groovy to easily generate this source file.  The following simple Groovy script, buildFileToSplit.groovy, creates a simple text file that can be split.

#!/usr/bin/env groovy
//
// buildFileToSplit.groovy
//
// Accepts single argument for number of lines to be written to generated file.
// If no number of lines is specified, uses default of 100,000 lines.
//
if (!args)
{
   println "\n\nUsage: buildFileToSplit.groovy fileName lineCount\n"
   println "where fileName is name of file to be generated and lineCount is the"
   println "number of lines to be placed in the generated file."
   System.exit(-1)
}
fileName = args[0]
numberOfLines = args.length > 1 ? args[1] as Integer : 100000
file = new File(fileName)
// erases output file if it already existed
file.delete()
1.upto(numberOfLines, {file << "This is line #${it}.\n"})

This simple script uses Groovy's implicitly available "args" handle to access command-line arguments for the buildFileToSplit.groovy script.  It then creates a single file of size based on the provided number of lines argument.  Each line is largely unoriginal and states "This is line #" followed by the line number.  It's not a fancy source file, but it works for the splitting example.  The next screen snapshot shows it run and its output.


The generated source.txt file looks like this (only beginning and ending of it is shown here):

This is line #1.
This is line #2.
This is line #3.
This is line #4.
This is line #5.
This is line #6.
This is line #7.
This is line #8.
This is line #9.
This is line #10.
     . . .
This is line #239.
This is line #240.
This is line #241.
This is line #242.
This is line #243.
This is line #244.
This is line #245.
This is line #246.
This is line #247.
This is line #248.
This is line #249.
This is line #250.

There is now a source file available to be split. This script is significantly longer because I have made it check for more error conditions, because it needs to handle more command-line parameters, and simply because it does more than the script that generated the source file. The script, simply called split.groovy, is shown next:

#!/usr/bin/env groovy
//
// split.groovy
//
// Split single file into multiple files similarly to how Unix/Linux split
// command works.  This version of the script is intended for text files only.
//
// This script does differ from the Linux/Unix variant in certain ways.  For
// example, this script's output messages differ in several cases and this
// script requires that the name of the file being split is provided as a
// command-line argument rather than providing the option to provide it as
// standard input.  This script also provides a "-v" ("--version") option not
// advertised for the Linux/Unix version.
//
// CAUTION: This script is intended only as an illustration of using Groovy to
// emulate the Unix/Linux script command.  It is not intended for production
// use as-is.  This script is designed to make back-up copies of files generated
// from the splitting of a single source file, but only one back-up version is
// created and is overridden by any further requests.
//
// http://marxsoftware.blogspot.com/
//

import java.text.NumberFormat

NEW_LINE = System.getProperty("line.separator")

//
// Use Groovy's CliBuilder for command-line argument processing
//

def cli = new CliBuilder(usage: 'split [OPTION] [INPUT [PREFIX]]')
cli.with
{
   h(longOpt: 'help', 'Usage Information')
   a(longOpt: 'suffix-length', type: Number, 'Use suffixes of length N (default is 2)', args: 1)
   b(longOpt: 'bytes', type: Number, 'Size of each output file in bytes', args: 1)
   l(longOpt: 'lines', type: Number, 'Number of lines per output file', args: 1)
   t(longOpt: 'verbose', 'Print diagnostic to standard error just before each output file is opened', args: 0)
   v(longOpt: 'version', 'Output version and exit', args: 0)
}
def opt = cli.parse(args)
if (!opt || opt.h) {cli.usage(); return}
if (opt.v) {println "Version 0.1 (July 2010)"; return}
if (!opt.b && !opt.l)
{
   println "Specify length of split files with either number of bytes or number of lines"
   cli.usage()
   return
}
if (opt.a && !opt.a.isNumber()) {println "Suffix length must be a number"; cli.usage(); return}
if (opt.b && !opt.b.isNumber()) {println "Files size in bytes must be a number"; cli.usage(); return}
if (opt.l && !opt.l.isNumber()) {println "Lines number must be a number"; cli.usage(); return}

//
// Determine whether split files will be sized by number of lines or number of bytes
//

private enum LINES_OR_BYTES_ENUM { BYTES, LINES }
bytesOrLines = LINES_OR_BYTES_ENUM.LINES
def suffixLength = opt.a ? opt.a.toBigInteger() : 2
if (suffixLength < 0)
{
   suffixLength = 2
}
def numberLines = opt.l ? opt.l.toBigInteger() : 0
def numberBytes = opt.b ? opt.b.toBigInteger() : 0
if (!numberLines && !numberBytes)
{
   println "File size must be specified in either non-zero bytes or non-zero lines."
   return
}
else if (numberLines && numberBytes)
{
   println "Ambiguous: must specify only number of lines or only number of bytes"
   return
}
else if (numberBytes)
{
   bytesOrLines = LINES_OR_BYTES_ENUM.BYTES
}
else
{
   bytesOrLines = LINES_OR_BYTES_ENUM.LINES
}

def verboseMode = opt.t
if (verboseMode)
{
   print "Creating output files of size "
   print "${numberLines ?: numberBytes} ${numberLines ? 'lines' : 'bytes'} each "
   println "and outfile file suffix size of ${suffixLength}."
}
fileSuffixFormat = NumberFormat.getInstance()
fileSuffixFormat.setMinimumIntegerDigits(suffixLength)
fileSuffixFormat.setGroupingUsed(false)
filename = ""
candidateFileName = opt.arguments()[0]
if (candidateFileName == null)
{
   println "No source file was specified for splitting."
   System.exit(-2)
}
else if (candidateFileName.startsWith("-"))
{
   println "Ignoring option ${candidateFileName} and exiting."
   System.exit(-3)
}
else
{
   println "Processing ${candidateFileName} as source file name."
   filename = candidateFileName
}
def prefix = opt.arguments().size() > 1 ? opt.arguments()[1] : "x"
try
{
   file = new File(filename)
   if (!file.exists())
   {
      println "Source file ${filename} is not a valid source file."
      System.exit(-4)
   }

   int fileCounter = 1
   firstFileName = "${prefix}${fileSuffixFormat.format(0)}"
   if (verboseMode)
   {
      System.err.println "Creating file ${firstFileName}..."
   }
   outFile = createFile(firstFileName)
   if (bytesOrLines == LINES_OR_BYTES_ENUM.BYTES)
   {
      int byteCounter = 0
      file.eachByte
      {
         if (byteCounter < numberBytes)
         {
            outFile << new String(it)
         }
         else
         {
            nextOutputFileName = "${prefix}${fileSuffixFormat.format(fileCounter)}"
            if (verboseMode)
            {
               System.err.println "Creating file ${nextOutputFileName}..."
            }
            outFile = createFile(nextOutputFileName)
            outFile << new String(it)
            fileCounter++
            byteCounter = 0            
         }
         byteCounter++
      }
   }
   else
   {
      int lineCounter = 0
      file.eachLine
      {
         if (lineCounter < numberLines)
         {
            outFile << it << NEW_LINE
         }
         else
         {
            nextOutputFileName = "${prefix}${fileSuffixFormat.format(fileCounter)}"
            if (verboseMode)
            {
               System.err.println "Creating file ${nextOutputFileName}..."
            }
            outFile = createFile(nextOutputFileName)
            outFile << it << NEW_LINE
            fileCounter++
            lineCounter = 0
         }
         lineCounter++
      }
   }
}
catch (FileNotFoundException fnfEx)
{
   println System.properties
   println "${fileName} is not a valid source file: ${fnfEx.toString()}"
   System.exit(-3)
}
catch (NullPointerException npe)
{
   println "NullPointerException encountered: ${npe.toString()}"
   System.exit(-4)
}

/**
 * Create a file with the provided file name.
 *
 * @param fileName Name of file to be created.
 * @return File created with the provided name; null if provided name is null or
 *    empty.
 */
def File createFile(String fileName)
{
   if (!fileName)
   {
      println "Cannot create a file from a null or empty filename."
      return null
   }
   outFile = new File(fileName)
   if (outFile.exists())
   {
      outFile.renameTo(new File(fileName + ".bak"))
      outFile = new File(fileName)
   }
   return outFile
}

This script could be optimized and better modularized, but it fulfills its purpose of demonstrating how Groovy provides a nice approach for implementing platform-independent utility scripts.

The next screen snapshot demonstrates the script's use of Groovy's built-in CLI support.


The next two screen snapshots demonstrate splitting the source file into smaller files by line numbers and by bytes respectively (and using different suffix and file name options).  The first image demonstrates that three output files are generated when split into 100 lines (250 lines in source file).  The -a option specifies that four integer places will be in the filename.  Unlike the Linux split, this script does not guarantee that the user-provided number of integers is sufficient to cover the number of necessary output files.


The second image (next image) shows the script splitting the source file based on number of bytes and using a different filename and only two integers for the numbering.


As mentioned above, this script is a "rough cut."  It could be improved in terms of the code itself as well as in terms of functionality (extended to better support binary formats and to make sure file name suffixes are sufficiently long for number of output files).  However, the script here does demonstrate one of my favorite uses of Groovy: to write platform-independent scripts using familiar Java and Groovy libraries (SDK and GDK).

3 comments:

Andrew said...

Glad to see you still put your
{
on the right lines!
}

Dustin said...

Andrew,

I do still prefer to have the curly braces lined up in the same column (Allman) for quick and easy visual determination of code blocks. Saving lines (Kernighan and Ritchie) might be a noble cause in printed materials, but it's not worth the visual disadvantages in electronic mediums such as blogs and IDEs. I thought that the blog post The Horstmann Brace Style presented an "interesting" hybrid of the two, though I plan to stick with the lined up braces on their own lines. I assume you still use this clearer approach as well. Now if I could just get those who write code that I read and maintain to do the same ... :)

Andrew said...

You can get split for windows. It is part of the Linux core utils.

You can find them here:
http://gnuwin32.sourceforge.net/packages/coreutils.htm