Match

Description

Match searches directories for matching files. The tool may be configured to select from a variety of criteria to determine whether two files match, including file name, file size, last modified time, and file contents. Files may be considered a match if the criteria are the same, or if the criteria are different. The tool is also available as a command-line application.

The tool reports two numbers: the number of matching files, and the number of matching file groups. The number of matching files is the count of individual files which have at least one other file that matches. The number of matching groups is the number of groups of files that match. For example, if the criteria are "same filename", and the directory trees being searched contain three files named README and two files named LICENSE with no other matching files, then the number of matching files would be five, and the number of matching groups would be two. These numbers are placed into properties for use later in the build file.

The match task can also produce a report of the duplicate files. It does this using a formatter. Three formatters are supplied, a text formatter, a summary formatter, and an XML formatter. The text and XML formatters produce files containing the full filenames of any matching files, organized into groups. All three formatters include summary information (how many files and how many groups were detected). Alternatively, a custom formatter (implementing com.bennettconsulting.match.ResultFormatter) can be used. If multiple formatters are supplied, each one will produce a report.

Match Criteria

Central to this tool is the idea of "matching files." This is determined by comparing various aspects of the file: file name, file size, and time of last modification. According to the criteria selected, a file may match another file on an aspect if the two values are the same, or if the two values are different. Two files match if and only if all criteria are satisfied.

The criteria are specified in the task's attributes. Each aspect can be given one of three values, same, diff, or ignore. By default, each aspect is set to ignore; when set to ignore, that aspect plays no part in determining a match. When set to same, the two files are considered to match if and only if they have the same value for that aspect. When set to diff, the two files are considered to match if they have different values for that aspect. If two files match on some aspects but do not match on others, then those two files do not match.

Consider these four files:

Path Size Last Modified
dist/output.jar 22438 bytes 11:43:03, 12 July 2002
build/output.jar 22438 bytes 11:42:36, 12 July 2002
src/App.java 4720 bytes 22:00:00, 10 July 2002
src/Engine.java 6494 bytes 22:00:00, 10 July 2002

If the criteria were name=same, then the first two would match, since they both have the same name (output.jar); the rest of the path is not relevant. With criteria of name=same, size=same, the first two would still match, since the first two also have the same size. However, with criteria of name=same, size=diff, there would be no matches, since no files that have the same name also have different sizes. With criteria of size=diff, time=same, the last two files would match, since they have the same timestamp but different sizes.

Parameters

Attribute Description Required
names Controls how files' names are used to determine whether two files match (either same, diff, or ignore). The file name is that portion of the complete path following the final separator character; e.g., on a Unix machine, the path /usr/local/bin/tokay.jar would have a corresponding file name of tokay.jar One or more of these must be specified as either same or diff; default is ignore.
sizes Controls how files' sizes (in bytes) are used to determine whether two files match (either same, diff, or ignore).
times Controls how a files' last modified times are used to determine whether two files match (either same, diff, or ignore). The tool compares the times with millisecond precision, although the operating system may store the times at lower precision.
contents Controls how files' contents, byte for byte, are used to determine whether two files match (either same, diff, or ignore).
groupproperty The name of a property to set with the number of groups of matching files. No
fileproperty The name of a property to set with the number of matching files. No

Nested Elements

fileset

The match task supports any number of nested <fileset> elements to specify the files to be checked for matches.

formatter

The results of the comparisons can be printed in different formats. Output is sent to a file, whose name is set by the file attribute of the <formatter>. One match task can support any number of formatters. If there are no formatters, then no report is produced.

There are three predefined formatters—one prints the test results in XML format, the other two emit plain text. The formatter named summary prints the a summary of the results as ASCII text. The formatter named plain prints the complete results as ASCII text. The formatter named xml writes an XML representation of the complete results. The report is sent to standard output, unless a file is specified. Custom formatters, which must implement com.bennettconsulting.match.ResultFormatter, can be specified.

Attribute Description Required
type Use a predefined formatter (one of xml, summary, or plain). Exactly one of these.
classname Name of a custom formatter class.
file Name of file to write output to. No; defaults to standard output.
header A string, used as a header in the file. No.

Examples

<taskdef resource="matchtask.properties" classpath="match.jar"/>

Establishes the match task.

<match names="same" sizes="diff">
    <formatter type="summary" header="Possible version mismatches in"/>
    <formatter type="xml" file="versionmismatch.xml" header="Possible version mismatches:"/>
    <fileset dir="dist" includes="**/*"/>
</match>

Checks the directory named dist and all subdirectories of it for files with the same file name but with different file sizes, and writes a summary to the console and an XML-formatted report of those files to a file named versionmismatch.xml.

<match fileproperty="matches" names="same">
    <fileset dir="${basedir}" includes="*"/>
</match>
<condition property="succeeded">
    <equals arg1="${matches}" arg2="0"/>
</condition>
<fail unless="succeeded" message="${matches} duplicates found!"/>

Checks the base directory for files with the same file name, and places the count of those files into the property matches. The build then fails if there are any matching files, with a message telling the user how many files matched. No output is produced if the build succeeds.

License

This tool is distributed under the Apache 2.0 License. The license is also available on the web.