split Command for DOS/Windows Via Groovy

One of the commands that I miss most from Linux when working in Windows/DOS environments is the split command.  This extremely handy command allows one to split a large file into multiple smaller files determined by the specification of either by number of lines or number of bytes (or kilobytes or megabytes) desired for the smaller files.  There are many uses for such functionality including fitting files onto certain media, making files "readable" by applications with file length restrictions, and so on.  Unfortunately, I'm not aware of a split equivalent for Windows or DOS.  PowerShell can be scripted to do something like this, but that implementation is specific to PowerShell.  There are also third-party products available that perform similar functionality.  However, these existing solutions leave just enough to be desired that I have the motivation to implement a split equivalent in Groovy and that is the subject of this post.  Because Groovy runs on the JVM, this implementation could be theoretically run on any operating system with a modern Java Virtual Machine implementation.

To test and demonstrate the Groovy-based split script, some type of source file is required.  I'll use Groovy to easily generate this source file.  The following simple Groovy script, buildFileToSplit.groovy, creates a simple text file that can be split.

  1. #!/usr/bin/env groovy  
  2. //  
  3. // buildFileToSplit.groovy  
  4. //  
  5. // Accepts single argument for number of lines to be written to generated file.  
  6. // If no number of lines is specified, uses default of 100,000 lines.  
  7. //  
  8. if (!args)  
  9. {  
  10.    println "\n\nUsage: buildFileToSplit.groovy fileName lineCount\n"  
  11.    println "where fileName is name of file to be generated and lineCount is the"  
  12.    println "number of lines to be placed in the generated file."  
  13.    System.exit(-1)  
  14. }  
  15. fileName = args[0]  
  16. numberOfLines = args.length > 1 ? args[1] as Integer : 100000  
  17. file = new File(fileName)  
  18. // erases output file if it already existed  
  19. file.delete()  
  20. 1.upto(numberOfLines, {file << "This is line #${it}.\n"})  

This simple script uses Groovy's implicitly available "args" handle to access command-line arguments for the buildFileToSplit.groovy script.  It then creates a single file of size based on the provided number of lines argument.  Each line is largely unoriginal and states "This is line #" followed by the line number.  It's not a fancy source file, but it works for the splitting example.  The next screen snapshot shows it run and its output.

The generated source.txt file looks like this (only beginning and ending of it is shown here):

This is line #1.
This is line #2.
This is line #3.
This is line #4.
This is line #5.
This is line #6.
This is line #7.
This is line #8.
This is line #9.
This is line #10.
     . . .
This is line #239.
This is line #240.
This is line #241.
This is line #242.
This is line #243.
This is line #244.
This is line #245.
This is line #246.
This is line #247.
This is line #248.
This is line #249.
This is line #250.

There is now a source file available to be split. This script is significantly longer because I have made it check for more error conditions, because it needs to handle more command-line parameters, and simply because it does more than the script that generated the source file. The script, simply called split.groovy, is shown next:

  1. #!/usr/bin/env groovy  
  2. //  
  3. // split.groovy  
  4. //  
  5. // Split single file into multiple files similarly to how Unix/Linux split  
  6. // command works.  This version of the script is intended for text files only.  
  7. //  
  8. // This script does differ from the Linux/Unix variant in certain ways.  For  
  9. // example, this script's output messages differ in several cases and this  
  10. // script requires that the name of the file being split is provided as a  
  11. // command-line argument rather than providing the option to provide it as  
  12. // standard input.  This script also provides a "-v" ("--version") option not  
  13. // advertised for the Linux/Unix version.  
  14. //  
  15. // CAUTION: This script is intended only as an illustration of using Groovy to  
  16. // emulate the Unix/Linux script command.  It is not intended for production  
  17. // use as-is.  This script is designed to make back-up copies of files generated  
  18. // from the splitting of a single source file, but only one back-up version is  
  19. // created and is overridden by any further requests.  
  20. //  
  21. //  
  22. //  
  24. import java.text.NumberFormat  
  26. NEW_LINE = System.getProperty("line.separator")  
  28. //  
  29. // Use Groovy's CliBuilder for command-line argument processing  
  30. //  
  32. def cli = new CliBuilder(usage: 'split [OPTION] [INPUT [PREFIX]]')  
  33. cli.with  
  34. {  
  35.    h(longOpt: 'help''Usage Information')  
  36.    a(longOpt: 'suffix-length', type: Number, 'Use suffixes of length N (default is 2)', args: 1)  
  37.    b(longOpt: 'bytes', type: Number, 'Size of each output file in bytes', args: 1)  
  38.    l(longOpt: 'lines', type: Number, 'Number of lines per output file', args: 1)  
  39.    t(longOpt: 'verbose''Print diagnostic to standard error just before each output file is opened', args: 0)  
  40.    v(longOpt: 'version''Output version and exit', args: 0)  
  41. }  
  42. def opt = cli.parse(args)  
  43. if (!opt || opt.h) {cli.usage(); return}  
  44. if (opt.v) {println "Version 0.1 (July 2010)"return}  
  45. if (!opt.b && !opt.l)  
  46. {  
  47.    println "Specify length of split files with either number of bytes or number of lines"  
  48.    cli.usage()  
  49.    return  
  50. }  
  51. if (opt.a && !opt.a.isNumber()) {println "Suffix length must be a number"; cli.usage(); return}  
  52. if (opt.b && !opt.b.isNumber()) {println "Files size in bytes must be a number"; cli.usage(); return}  
  53. if (opt.l && !opt.l.isNumber()) {println "Lines number must be a number"; cli.usage(); return}  
  55. //  
  56. // Determine whether split files will be sized by number of lines or number of bytes  
  57. //  
  59. private enum LINES_OR_BYTES_ENUM { BYTES, LINES }  
  60. bytesOrLines = LINES_OR_BYTES_ENUM.LINES  
  61. def suffixLength = opt.a ? opt.a.toBigInteger() : 2  
  62. if (suffixLength < 0)  
  63. {  
  64.    suffixLength = 2  
  65. }  
  66. def numberLines = opt.l ? opt.l.toBigInteger() : 0  
  67. def numberBytes = opt.b ? opt.b.toBigInteger() : 0  
  68. if (!numberLines && !numberBytes)  
  69. {  
  70.    println "File size must be specified in either non-zero bytes or non-zero lines."  
  71.    return  
  72. }  
  73. else if (numberLines && numberBytes)  
  74. {  
  75.    println "Ambiguous: must specify only number of lines or only number of bytes"  
  76.    return  
  77. }  
  78. else if (numberBytes)  
  79. {  
  80.    bytesOrLines = LINES_OR_BYTES_ENUM.BYTES  
  81. }  
  82. else  
  83. {  
  84.    bytesOrLines = LINES_OR_BYTES_ENUM.LINES  
  85. }  
  87. def verboseMode = opt.t  
  88. if (verboseMode)  
  89. {  
  90.    print "Creating output files of size "  
  91.    print "${numberLines ?: numberBytes} ${numberLines ? 'lines' : 'bytes'} each "  
  92.    println "and outfile file suffix size of ${suffixLength}."  
  93. }  
  94. fileSuffixFormat = NumberFormat.getInstance()  
  95. fileSuffixFormat.setMinimumIntegerDigits(suffixLength)  
  96. fileSuffixFormat.setGroupingUsed(false)  
  97. filename = ""  
  98. candidateFileName = opt.arguments()[0]  
  99. if (candidateFileName == null)  
  100. {  
  101.    println "No source file was specified for splitting."  
  102.    System.exit(-2)  
  103. }  
  104. else if (candidateFileName.startsWith("-"))  
  105. {  
  106.    println "Ignoring option ${candidateFileName} and exiting."  
  107.    System.exit(-3)  
  108. }  
  109. else  
  110. {  
  111.    println "Processing ${candidateFileName} as source file name."  
  112.    filename = candidateFileName  
  113. }  
  114. def prefix = opt.arguments().size() > 1 ? opt.arguments()[1] : "x"  
  115. try  
  116. {  
  117.    file = new File(filename)  
  118.    if (!file.exists())  
  119.    {  
  120.       println "Source file ${filename} is not a valid source file."  
  121.       System.exit(-4)  
  122.    }  
  124.    int fileCounter = 1  
  125.    firstFileName = "${prefix}${fileSuffixFormat.format(0)}"  
  126.    if (verboseMode)  
  127.    {  
  128.       System.err.println "Creating file ${firstFileName}..."  
  129.    }  
  130.    outFile = createFile(firstFileName)  
  131.    if (bytesOrLines == LINES_OR_BYTES_ENUM.BYTES)  
  132.    {  
  133.       int byteCounter = 0  
  134.       file.eachByte  
  135.       {  
  136.          if (byteCounter < numberBytes)  
  137.          {  
  138.             outFile << new String(it)  
  139.          }  
  140.          else  
  141.          {  
  142.             nextOutputFileName = "${prefix}${fileSuffixFormat.format(fileCounter)}"  
  143.             if (verboseMode)  
  144.             {  
  145.                System.err.println "Creating file ${nextOutputFileName}..."  
  146.             }  
  147.             outFile = createFile(nextOutputFileName)  
  148.             outFile << new String(it)  
  149.             fileCounter++  
  150.             byteCounter = 0              
  151.          }  
  152.          byteCounter++  
  153.       }  
  154.    }  
  155.    else  
  156.    {  
  157.       int lineCounter = 0  
  158.       file.eachLine  
  159.       {  
  160.          if (lineCounter < numberLines)  
  161.          {  
  162.             outFile << it << NEW_LINE  
  163.          }  
  164.          else  
  165.          {  
  166.             nextOutputFileName = "${prefix}${fileSuffixFormat.format(fileCounter)}"  
  167.             if (verboseMode)  
  168.             {  
  169.                System.err.println "Creating file ${nextOutputFileName}..."  
  170.             }  
  171.             outFile = createFile(nextOutputFileName)  
  172.             outFile << it << NEW_LINE  
  173.             fileCounter++  
  174.             lineCounter = 0  
  175.          }  
  176.          lineCounter++  
  177.       }  
  178.    }  
  179. }  
  180. catch (FileNotFoundException fnfEx)  
  181. {  
  182.    println  
  183.    println "${fileName} is not a valid source file: ${fnfEx.toString()}"  
  184.    System.exit(-3)  
  185. }  
  186. catch (NullPointerException npe)  
  187. {  
  188.    println "NullPointerException encountered: ${npe.toString()}"  
  189.    System.exit(-4)  
  190. }  
  192. /** 
  193.  * Create a file with the provided file name. 
  194.  * 
  195.  * @param fileName Name of file to be created. 
  196.  * @return File created with the provided name; null if provided name is null or 
  197.  *    empty. 
  198.  */  
  199. def File createFile(String fileName)  
  200. {  
  201.    if (!fileName)  
  202.    {  
  203.       println "Cannot create a file from a null or empty filename."  
  204.       return null  
  205.    }  
  206.    outFile = new File(fileName)  
  207.    if (outFile.exists())  
  208.    {  
  209.       outFile.renameTo(new File(fileName + ".bak"))  
  210.       outFile = new File(fileName)  
  211.    }  
  212.    return outFile  
  213. }  

This script could be optimized and better modularized, but it fulfills its purpose of demonstrating how Groovy provides a nice approach for implementing platform-independent utility scripts.

The next screen snapshot demonstrates the script's use of Groovy's built-in CLI support.

The next two screen snapshots demonstrate splitting the source file into smaller files by line numbers and by bytes respectively (and using different suffix and file name options).  The first image demonstrates that three output files are generated when split into 100 lines (250 lines in source file).  The -a option specifies that four integer places will be in the filename.  Unlike the Linux split, this script does not guarantee that the user-provided number of integers is sufficient to cover the number of necessary output files.

The second image (next image) shows the script splitting the source file based on number of bytes and using a different filename and only two integers for the numbering.

As mentioned above, this script is a "rough cut."  It could be improved in terms of the code itself as well as in terms of functionality (extended to better support binary formats and to make sure file name suffixes are sufficiently long for number of output files).  However, the script here does demonstrate one of my favorite uses of Groovy: to write platform-independent scripts using familiar Java and Groovy libraries (SDK and GDK).


