Match Command Line Application

Description

Match searches directories for matching files. The tool may be configured to select from a variety of criteria to determine whether two files match, including file name, file size, last modified time, and file contents. Files may be considered a match if the criteria are the same, or if the criteria are different. The tool is also available as an Ant task.

The application returns one of two numbers: the number of matching files, or the number of matching file groups. The number of matching files is the count of individual files which have at least one other file that matches. The number of matching groups is the number of groups of files that match. For example, if the criteria are "same filename", and the directory trees being searched contain three files named README and four files named LICENSE with no other matching files, then the number of matching files would be seven, and the number of matching groups would be two. One of these numbers is returned (as the exit code) from the application, depending on the arguments.

The match application can also produce a report of the duplicate files, either in XML or text formats. Both formats contains the absolute path of any matching files, organized into groups. A custom report generator can also be specified. If multiple report generators are specified, each one will produce a report.

Match Criteria

Central to this tool is the idea of "matching files." This is determined by comparing various aspects of the file: file name, file size, and time of last modification. According to the criteria selected, a file may match another file on an aspect if the two values are the same, or if the two values are different. Two files match if and only if all criteria are satisfied.

The criteria are specified in the task's attributes. Each aspect can be given one of three values, same, diff, or ignore. By default, each aspect is set to ignore; when set to ignore, that aspect plays no part in determining a match. When set to same, the two files are considered to match if and only if they have the same value for that aspect. When set to diff, the two files are considered to match if they have different values for that aspect. If two files match on some aspects but do not match on others, then those two files do not match.

Consider these four files:

Path Size Last Modified
dist/output.jar 22438 bytes 11:43:03, 12 July 2002
build/output.jar 22438 bytes 11:42:36, 12 July 2002
src/App.java 4720 bytes 22:00:00, 10 July 2002
src/Engine.java 6494 bytes 22:00:00, 10 July 2002

If the criteria were name=same, then the first two would match, since they both have the same name (output.jar); the rest of the path is not relevant. With criteria of name=same, size=same, the first two would still match, since the first two also have the same size. However, with criteria of name=same, size=diff, there would be no matches, since no files that have the same name also have different sizes. With criteria of size=diff, time=same, the last two files would match, since they have the same timestamp but different sizes.

Usage

java -jar match.jar [flags] path [path]*

The path arguments control which files are checked for matching. If a path argument represents a directory, all files in that directory are checked, and all files in any subdirectory, recursively. The path arguments do not expand wildcard characters.

Flag Description Required
-n Controls how files' names are used to determine whether two files match (+n for names must be the same, -n for names must be different). The file name is that portion of the complete path following the final separator character; e.g., on a Unix machine, the path /usr/local/bin/mason.jar would have a corresponding file name of mason.jar At least one of these must be specified.
+n
-s Controls how files' sizes (in bytes) are used to determine whether two files match (+s for sizes must be equal, -s for sizes must be unequal).
+s
-t Controls how a files' last modified times are used to determine whether two files match (+t for times must be equal, -t for times must be unequal). The tool compares the times with millisecond precision, although the operating system may store the times at lower precision.
+t
-c Controls how files' contents (byte for byte) are used to determine whether two files match (+c for contents must be identical, -c for contents must differ).
+c
-fc Exit code contains the count of matching files. No
-gc Exit code contains the count of matching file groups. No (default)
-txt[=filename] Produce a text report, writing the results to the file if specified, or to standard output if no filename is specified. No
-xml[=filename] Produce an XML report, writing the results to the file if specified, or to standard output if no filename is specified. No
-rpt=class[=filename] Produce a custom report, writing the results to the file if specified, or to standard output if no filename is specified. The custom report formatter must implement com.bennettconsulting.match.ResultFormatter. No

Examples

>java -jar match.jar +n -s -xml=versionmismatch.xml dist

Checks the directory named dist and all subdirectories of it for files with the same file name but with different file sizes, writes an XML-formatted report of those files to a file named versionmismatch.xml, and exits with a code equal to the count of the number of groups of matching files.

>java -jar match.jar +n -fc .

Checks the current directory and all subdirectories of it for files with the same file name, and exits with a code equal to the count of those files.

License

This tool is distributed under the Apache 2.0 License. The license is also available on the web.