|
A Multi-Layered Corroboration-Based Check Reader
Dr. D. Kimura
This paper presents a multi-layered corroboration-based system to achieve very high performance on reading both personal and business checks while controlling the error rate. Techniques for character corroboration and field corroboration are presented. In particular, corroboration between legal, courtesy, and spoken amounts is discussed. A technical description of handwritten phrase recognition is given. A system currently being designed for a large US bank includes the capabilities presented in this paper.
Banks have been reluctant to invest in document reading automation for two distinct reasons. First, predictions of a paperless world have influenced banks in their long term capital investment. Second, the commercially available technology is too expensive and too inadequate (in terms of labor savings) to justify a business case. As discussed in this paper, these reasons are no longer valid. The volume of checks in the US has increased steadily by two to three percent yearly for the past five years10. Advances in computer technology and pattern recognition, coupled with continued customer preference for payment by check, make imaging-based systems a sound solution not only to save labor but also to create new services.
Courtesy amount recognition technology was oversold in its early stages; the expectations of banks were set based on claims of 70 percent read rates,11 whereas commercially available technology achieves a 45-55 percent read rate (at a 1 percent error rate). Thus, some banks have been left with a negative impression of the labor savings potential of ICR. Nevertheless, because check processing is labor intensive, there continues to be a high demand for ICR systems with very high read rate and control over the error rate to meet the requirements for a wide set of applications. The design of such a system is presented in this paper. It is based on a multi-layered corroboration-based ICR system, and processes both machine printed and handwritten checks. Corroboration is performed both at the character level and at the field level. In particular, corroboration between legal, courtesy and spoken amounts is described. A highly accurate handwritten phrase recognizer is being integrated to validate the courtesy amount from the legal amount.
Section 2 defines some technical terms before giving a summary of the market for check processing. This will serve as a motivation and justification for building the multi-layered, corroboration-based ICR system. Section 3 describes the challenge in reading checks for the lockbox application, which goes beyond recognizing the amount. A subjective image analysis for each target field is given in Section 3 using specific examples; this is intended to set expectations of what can be achieved using an ICR system. Section 4 contains the technical details of the proposed multi-layered corroboration-based system. A handwritten legal amount phrase recognition is presented in detail. Section 5 gives the results from each recognition source in isolation and in corroboration. Finally, Section 6 summarizes findings and proposes new directions for still further savings in check processing labor.
The market study and analysis of the potential savings in the check business presented in this section will serve as a motivation for the proposed check reader. But first, a brief discussion of some technical terms is needed to clarify areas of interest.
2.1 Check technical terms
Checks are generally classified by banks into processing categories:1,10 inclearings, over-the-counter, and pre-encoded.
Inclearings are checks written by customers of a bank, but deposited at other financial institutions. After these institutions have processed them for their accounts, inclearing checks are returned to the bank of origin and debited against that bank's accounts. These checks are easy to handle because they have already been encoded.
Over-the-counter: These checks, deposited at bank branches and then delivered by courier to a processing center, include some written on the bank itself and some written on other institutions. They are more challenging for an image system than inclearings since no encoding information is available. These have a potential for labor savings, but may lack the context information of a transaction-based system.
Pre-encoded: These checks come from corporate accounts, or lockbox-based customers, meaning businesses that use the bank to process their incoming checks (usually the "lockbox" is a post office box reserved by the bank for such a customer.) Typically these are remittance transactions, such as magazine subscriptions and monthly utility bills. Being thus part of a transaction for which context is available, these checks should be suitable for automatic processing. Unfortunately, the checks may be business checks, which vary considerably in their design. This is therefore an area where new technology in imaging has potential for labor savings.
2.2 Market analysis and potential savings
Banks are made of billions and billions of pieces of paper: cash, checks, ledgers, account statements, loan applications, and regulatory filings. The financial institutions are, and have been, promoting services to allow reduction of the huge daily consumption of paper. Nonetheless, the volume of checks in the US is growing at a rate of nearly 2 billion checks every year and is now reaching 69 billion checks annually. According to the Federal Reserve,3 banks spent 10 billion dollars to clear the 67 billion checks received in 1995. The fact is that neither ATMs, nor debit cards, nor home banking have yet persuaded the populace to write fewer checks. Given such huge amount of paper-based documents, why have only a handful of banks fully converted to imaging systems? Reasons include:7
In summary, despite the advent of new electronic payment techniques, check-writing remains the primary method Americans use to transact payments. Meanwhile, advances in computer technology9 and pattern recognition now allow design of highly accurate ICR image-based systems at affordable cost.
In remittance processing typically five target fields need to be identified and recognized: signature presence, payee name, legal and courtesy amounts, and date. The challenge in recognizing each of these fields will be described, but first a brief discussion on image quality is important since it is a critical factor affecting the ICR read rates.
3.1 Image Quality
Because of the wide variation in document quality, the problem of adjusting a scanner for the highest recognition rate will always be an issue (Figure 1). Ideally a gray-level image can be captured to allow for software-based background removal29 thus isolating the text regions. However, in some cases banks have already invested in a binary image system. The next generation of document processing undoubtedly will use gray level information. Furthermore the standard spatial resolution for check image processing is quickly moving from 200 to 300 dots per inch.

Figure 1. Image quality problem
3.2 Image analysis and document layout
An important distinction between checks and other documents is the multi-pass printing of checks. First, the check itself (background and text) is created with a high print quality. The text regions will appropriately have various font types and sizes. Then the check is filled out with various quality of type printed or handprinted text. For example, the legal amount shown in Figure 2 has broken characters while the payer name has touching characters. Such potential problems dictate the use of locally adaptive image analysis.

Figure 2. Challenges in reading checks
Other problems associated with multi-pass printing are skewed text regions (Figure 2) and overlapping characters. Furthermore, the image analysis needs to selectively remove any noise (i.e., scanning noise, background scenes and textures, etc.) without affecting the text information. To these problems we can add typical character recognition problems (e.g. line/word/character segmentation, font variations, reverse video, broken and touching characters, etc.) In lockbox check processing applications for which the content of an envelope is fed manually into a scanner, a skewed image will often be captured, as show in Figure 3.

Figure 3. Skewed image
Most of the problems discussed thus far can be handled by an ICR system. However, some characteristics unique to checks pose special difficulties. For security reasons, the printing of the legal amount might use "protection fonts" or a highly graphical representation (Figure 4). That is, the legal line may be made deliberately unsuitable for machine processing, to avoid tampering.
3.3 Semantic type context analysis
Once the text fields have been isolated, the next challenge is to identify the semantic type (i.e., legal amount, courtesy amount, payee name, etc.) Physical location is certainly a strong feature, but not sufficient to assign the correct semantic type reliably. For instance, the date field may be near or far from the courtesy amount, to the right, left, or above it. The legal line may be near the payee name line, which makes it difficult to segment the two into separate fields. Using a philosophy of least commitment, we defer final type assignment until recognition results are available. For instance, by recognizing labels such as DATE, PAY TO THE ORDER OF, AMOUNT, etc., we can significantly improve the assignment.
3.4 Field-specific recognition challenges
The challenge of locating and recognizing the five semantic types typically found in remittance applications is now described.
3.4.1 Legal amount recognition challenge
About 80 percent of machine printed business check images tested had a legal line. The image shown in Figure 5 lacks a legal amount. Even in a clean environment, the lack of redundant information will affect the error rate.

Figure 4. Challenges in reading checks: unusual font.

Figure 5. Check with no legal amount
On most personal checks and some business checks, the legal amount is handwritten (e.g., Figure 3.), posing a major challenge. Section 4 of this paper describes how legal line phrases can be successfully recognized.
When the legal amount is not fully spelled out (e.g., Figure 6) the recognition accuracy drops due to lack of redundancy. An estimated 15.5 percent of the machine printed checks processed that have a legal amount, have a digit-type format.

Figure 6. Digit-type legal amount
3.4.2 Courtesy amount recognition challenge
The courtesy amount is the numerical character representation of the amount on the check. While the legal amount may not be present on all checks, the courtesy amount should (in theory) always be present. Finding the courtesy amount on personal checks is not difficult since these items tend to follow the ANSI standard location. However, business checks do not conform to the ANSI specification, and their courtesy amount locations vary widely.

Figure 7. Example of courtesy amount fused with date field
3.4.3 Date recognition challenge
The date field is the most difficult target to isolate and recognize for several reasons: it has only a few characters; it can be located practically anywhere in the upper region of a check; and it may be fused (in almost any spatial orientation) with the courtesy amount. Once isolated, character recognition engines have problems recognizing short fields because of lack of field-level statistics. Furthermore, the date format varies from a numeric format (07-20-95) to an alphanumeric format (Jul. 20, 1995). Section 5.3.2.3 presents a brief study of the performance of four machine print classifiers on date fields.
3.4.4 Payee name recognition challenge
The payee name field is the easiest field to locate and to recognize, as it usually exhibits a clean background and good quality characters. This is explained by the fact that the payee address block tends to follow USPS Publication 28, which gives recommendations to ensure proper delivery of the mail piece.
Figure 8 illustrates an example in which the payee name is above the legal amount line. A purely geometric-based location algorithm would have misidentified the two strings. Such cases motivate use of semantic clues to correctly parse the target fields.
3.4.5 Signature detection challenge
Signature detection on checks can be done with high accuracy for two reasons: first, the signature location is reasonably well constrained, and second, the system is only looking for presence of a signature (as opposed to fraud detection which requires recognition of the signature shape).

Figure 8. Payee name example
In an application such as wholesale lockbox processing for banks, extremely low error rates (on the order of 1 misread field in 10,000 processed documents) are expected in reading the amount. A trained keyer under speed constraints will miskey from 1 out of 50 keyed characters, to less than 1 out of 500. These generally random errors lead to a per-field error rate that is intolerable in the check processing application. Thus, a two-pass keying scheme is ordinarily used. Because of the cost per keyer, it is advantageous to perform corroboration between a human operator (keying or voice) and an ICR engine, therefore limiting the double keying to discrepancies only. By using legal and courtesy amount corroboration, some amounts are recognized with a reliability equivalent to a two-pass keying system so that in this case only minimum human involvement is required.
A document layout data structure is essential not only to extract structural information but also to allow for management of uncertainty. Section 4.1 describes the document layout class implemented, along with the mechanism to hypothesize fields of interest using both a priori knowledge of the application and information acquired from the image. The most important contribution towards controlling the error rate is presented in section 4.2 with the layers of corroboration. Section 4.3 describes in detail the subcomponents of the ICR system currently been tested for deployment, and section 4.4 focuses on the handwritten phrase recognition component.
4.1 Document layout and inference process
The first stage in document understanding is to segment the image into objects that are recursively segmented into text, graphic, and remnant (all non-classified objects). We have designed the hierarchy shown in Figure 9, which was inspired by the ISO/IEC 8613.13

Figure 9. Document layout architecture
A frame for the purpose of check item processing is defined as an isolated region containing one or more complete semantic type (legal line, MICR line, payee name, etc.). Examples of a frame are shown in Figure 10. Notice the date label was put into a separate frame (which is not a violation of the frame definition). However, the isolation of the ZIP Code field may be considered a violation of an address block semantic type (but not a violation for a ZIP Code semantic type.)
The information of the document layout is obtained using a hybrid approach.14 First, the frames are created by a top-down method. Then, subsequent document layers are obtained through a bottom-up analysis of the connected components. At each layer, line detection is used to identify some line-type graphics (lines, corners, T's, boxes, etc.) Components not classified are stored in the remnant list.
Information extracted from each frame is used in creating the best hypotheses for a given field of interest. Discrimination between machine print and handwritten fields triggers the appropriate enhancement, segmentation and recognition. Because the determination is based on connected components, it is imperative to perform the graphic detection (e.g., line connecting all the digits in the field) before determining the font type. The discrimnant features used are mean and variance of connected components heights, minimum and maximum vertical extent.

A) Original Image B) Top down sort of frames
Figure 10. Example of frame extraction
With a document layout representation containing some information about type, number of blocks for each frame, number of lines for each blocks, and so on, we need a mechanism to efficiently identify potential fields requested. This differs slightly different from form processing, in which a template is stored to help extract the appropriate regions. It is not realistic to provide a template for all possible business checks. For this reason we must define a generic template that contains information about all fields that might be found on a check. Such information contains statistics of the location and in some cases (e.g., MICR line) it contains the font type. The geometric and logical representation of the generic template is shown in Figure 11.

Figure 11. Geometric and logical representation of a generic template.
Each rectangular region is an indication of where the information may be found. The date, for instance, has a broad area where it may be found, which overlaps with other fields. Using the generic template and the frame objects found, a ranked list of hypotheses for each field of interest is generated. For instance, in recognizing a legal line, the frames can be ranked based on their location and their content indicating propensity to be a legal line. As another example, the MICR line may never be explicitly requested (since it is read through a separate magnetic reader); however, the position and the component shapes of the MICR line are so stable that it should be used for registration of the document. The hypothesized fields will go through recognition and will be parsed according to the semantic type (e.g., legal amount) of interest.
4.2 ICR Multi-layered corroboration system
There are several layers, as depicted in Figure 12, where corroboration between results can help improve the recognition accuracy.

Figure 12. ICR Corroboration Schemes
At the character level, some significant improvements were achieved using two classifiers (section 4.2.1). Intra-field corroboration further improves the accuracy. In this case several segmentation alternatives are applied on the same field image. These alternatives are ranked using connected component analysis and a variable segmentation cost associated with splitting and merging of components. The top N alternatives go through recognition, where N varies from 1 to 3 depending on the quality of the image. (Limiting the number of such alternatives is important to maintain the required throughput.) Each character is recognized by two separate classifiers, and merged using techniques discussed in 4.2.1. Then, inter-field corroboration (e.g. between legal and courtesy amounts) allows further improvement in the tradeoff between error rate and correct rate. Finally, external sources such as spoken amount, balancing, or keying will complete the resolution of the transaction.
4.2.1 Isolated digit corroboration
Highly accurate OCR engines are now available commercially in software or hardware. The raw accuracy attainable by any one approach appears to have reached a plateau of sorts.18,19 Recently, schemes for further improving the read/error tradeoff by combining the outputs of multiple OCR engines are receiving increasing attention,18 and some commercial OCR products34,35 are already on the market.
There are at least three schools of thought on how to merge recognition results. First, results can be merged based on their ranking15,37 without regard to precise confidence values. This method has proved to be useful especially when several classifiers are available. Second, a normalization on the confidence behavior can be performed so that a simple aggregate operator can be used. Third, the behavior of each classifier can be investigated and the confidences remapped in a combining scheme that adapts to the strengths and weaknesses of each classifier. In this paper, we present experiments with the first two approaches, and more detailed corroboration schemes will be published in a separate paper.
We use the term "corroboration" rather than the common term "voting", for two reasons. First, there is a mathematics for the study of elections,36 which is concerned mainly with cases of many voters each supplying relatively simple decisions; by contrast, the algorithms examined here use only two inputs, but can use detailed confidence information from each. Second, in this study we are able to evaluate accuracy because the ground-truth is known, which is not generally the case in elections.
In Section 5 we present corroboration results using some basic operators to merge pairwise results from four classifiers; two of which are hardware-based classifiers (Mitek and AEG 6160), and the other two software-based (NIST and TRW).
4.2.1.1 Basic Corroboration Operators
We tested simple rules as reference points to investigate the general behavior of our engines in corroboration. For engines that return second choice results and confidences (Mitek, AEG and TRW) these were included in the computation. The three basic operators tested were Minimum, Average, and Bayes, with output contours as depicted in Figure 13.
The Min operator implements a pessimistic approach and is good at isolating errors, but ignores cases where one classifier has strengths that can circumvent a weakness in the other. For a non-zero output, the engines must agree to some extent (at least in the second choice.)
The Avg operator has the advantage of performing well on non-normalized confidence. It behaves more smoothly in that a disagreement is limited to one-half of an engine's confidence. However, this technique also ignores particular strengths of a classifier. For instance two engines in agreement with confidence values 70 and 40 would produce a larger output than if the confidence values were 95 and 5, but our experience is that the latter result is more likely correct if the engines are of good quality.

Figure 13. Simple operators for corroboration
The Bayes operator considers the confidences as probabilities, and combines results using Bayes' Rule for conditional probabilities, whereby we estimate the probability P(A | (A=B)) of engine A's result being correct given that it agrees with engine B's result. Using CA and CB to represent the confidences assigned to a given character class by engines A and B, the combined confidence is given by
![]() |
(1) |
where N is the number of character classes. A rule can also be derived for the case where the classifiers disagree, i.e have no nonzero-confidence classes in common.
4.2.1.2 Engine Confidence Calibration and Regression
In view of the varying confidence characteristics of the engines, we considered use of a pre-normalization stage that would render the confidences more directly comparable. Three methods were examined: two attempted to fit a curve to the distribution of correct and incorrect reads at each confidence level; one of these was a low-order polynomial fit such as we have used in previous systems,20 while the other was a logistic regression. The third method was the percentile-rank score used by NIST in its handprint recognizer evaluations. The NIST method has the convenient property of being "unsupervised", i.e., it does not require knowledge of the ground-truth, only of the distribution of output values from the engine.
4.2.2 Field corroboration
In addition to the character-level corroboration, we have implemented three types of field-level corroboration. First, in intra-field corroboration the same image is segmented using different strategies. A cost is associated with each resulting sequence of characters to reflect the segmentation quality. Second, in inter-field corroboration, the results of intra-field corroboration for two separate fields (e.g. legal and courtesy amount) are merged through a canonical form. Finally, an external source such as a spoken amount might be available to corroborate with the courtesy/legal amount. In the case of transaction-based check processing, the results from different items (check(s), stub, and invoice) are kept to allow for balancing which might take place in a further processing stage.
4.3 Semantic type field recognition
In the lockbox application there are five fields of interest, i.e., the date, the courtesy amount, the legal amount, the payee name, and the signature presence. Signature presence, previously discussed in Section 3.3.5, requires no syntax analysis. The other types are discussed below.
4.3.1 Date
Because the date field on business checks can be found practically anywhere in the upper half of the check image, several fields (up to five) are recognized and parsed to ascertain which is the date. To minimize some types of confusion, the date field is processed after the courtesy and the legal (if present) amounts have been resolved.
Specialized tools are being designed to handle date fields. This is an application for which the difference between a '1' and '/' is relevant. The matching algorithm first identifies the bigrams '95', '96', '97', which are common to all date patterns (e.g., Jan25, 1996, JANUARY 25 1996, 01/25/96, 01-25-96, etc.). It next examines the left neighboring characters to determine whether they are the completion of year (i.e. 1996), or a delimiter for the day or month. Although recognizing the date field requires testing several heuristics, we have designed a tool shell which supports both word spotting through inexact matching,39 and the use of wild card characters to search for formatted patterns such as **/**/**, **-**-**, etc.
4.3.2 Courtesy amount
To handle the courtesy amount successfully, some specialized tools were also designed. First we have ensured robust recognition of the dollar sign "$" since it is a reliable anchor, especially on personal checks where it characteristically has an edge bracket to its left as shown in Figure 14.

Figure 14. Dollar sign registration
With the "$" sign identified, a cleanup of the field is needed to remove graphic frames which were not removed successfully during the document layout analysis. For the machine printed amounts, the asterisk '*' symbol is also of great help in finding the courtesy and the legal amount; a string of one or more asterisks is essentially a currency symbol when the check is a business check.
The cents part of the courtesy amount is by far the most challenging piece, and will have the biggest impact on the overall confidence of the field. For this reason, it is worth spending major effort to have a specialized tool to detect the slash since, if present, it will clearly identify the cents and the boundary of the amount. This is not always trivial as depicted in Figure 15:

Figure 15. Courtesy amount slash example
If two connected components are found above the slash, then we can disregard the information below the slash. In some cases of touching digits the recognition of "100" is also a great help in assigning confidence to the segmentation. Other specialized tools recognize "xx" and connected "00", each as a single entity (i.e., no segmentation required).
Results for the courtesy amount need to convey specific information about the format ($DD.cc or $DD cc/100) to allow for separate confidence for dollars and cents. Also, since the resulting amount will be used in corroboration with the legal amount, more than one field result is provided. In the case of handwritten personal checks, the courtesy amount choices are passed to the legal amount phrase recognizer for validation as explained in the next section. With separate confidence for dollars and cents, the workflow manager will be able to request in some cases only cents keying to complete a transaction.
4.3.3 Legal amount recognition
The approach to recognizing the legal amount is quite different for machine printed as compared to handwritten amounts. In the case of a machine printed legal amount, the recognition is done in parallel with the courtesy amount recognition. The algorithm used is based on spotting words from the legal word lexicon of about 50 words. This technique39 uses confusion matrices from the recognition engine (single engine or corroboration between two engines). Then the recognized words must be parsed to obtain a correct syntax.
4.3.3.1 Legal line parsing
Legal line recognition produces a set of numeric hypotheses with associated digit confidences. This is done so that corroboration between the legal and courtesy amounts be performed with both sets of results in a normalized form. To do so each of the most likely interpretations of the legal amount are parsed, using a grammar describing the valid constructs found in legal amounts, and each successful parse used for a syntax directed translation of the legal amount string. The output of the translation is the set of digits and confidences that constitute the numeric equivalent of the legal line.
To illustrate this process further, consider the recognition results from a legal amount field shown below in Table 1. Only the top three candidates for the ranked lexicon for each word are considered in order to reduce the number of cases that have to be parsed. Each of the twelve possible combinations is passed as input to a recursive descent parser that incorporates a syntax directed translator. The grammar was obtained through several thousand legal lines from ground-truthed business checks. It incorporates the common prefixes found in legal amounts, such as "PAY EXACTLY" or "THE SUM OF".
|
1st Result |
Conf |
2nd Result |
Conf |
3rd Result |
Conf |
|
| Word #1 |
FOUR |
98 |
||||
| Word #2 |
THOUSAND |
92 |
||||
| Word #3 |
TWENTY |
76 |
TWELVE |
64 |
TEN |
51 |
| Word #4 |
- |
99 |
||||
| Word #5 |
FOUR |
92 |
||||
| Word #6 |
DOLLARS |
88 |
||||
| Word #7 |
AND |
96 |
||||
| Word #8 |
SIXTY |
72 |
FIFTY |
67 |
||
| Word #9 |
- |
98 |
||||
| Word #10 |
SEVEN |
91 |
||||
| Word #11 |
CENTS |
83 |
SEVEN |
52 |
Table 1. Parsing legal line word results.
Only two of the twelve possibilities survive through the parser:
"FOUR THOUSAND TWENTY - FOUR DOLLARS AND SIXTY - SEVEN CENTS"
"FOUR THOUSAND TWENTY - FOUR DOLLARS AND FIFTY - SEVEN CENTS"
Each word is an input symbol to the parser, and each symbol is assigned two numeric attributes: its recognition confidence and its numeric value if it is a number word, or a zero otherwise. The translator uses these attributes to generate output in the form of digit/confidence pairs. For the above two examples the output is:
| 4 (98) | 0 (99) | 2 (76) | 4 (92) | 6 (72) | 7 (91) |
| 4 (98) | 0 (99) | 2 (76) | 4 (92) | 5 (67) | 7 (91) |
Note that the value of the hundreds of dollars place is zero and that the confidence is arbitrarily assigned to 99. Also note that the non-number words in the legal line do not contribute to the confidence of any of the digits.
4.4 From word to legal line phrase matching
For handwritten amounts, the legal phrase recognition is in a more experimental stage. In an effort to integrate the handwritten legal phrase recognizer in a production system, it was decided to use it as a validation to reinforce or reduce the confidence of the courtesy amount. To appreciate the major challenge associated with reading legal amounts, a detailed technical description is now presented starting at the word level and evolving towards phrase recognition.

![]()
a) Original b) Final segmentation-recognition
Figure 16. Initial and final step in word recognition
4.4.1 Handwritten word recognition21,25,32,33
To convert a word image into the final segmented characters (Figure 16) that best match a word in a lexicon, the following steps are needed:
4.4.1.1 Slant Estimation, Correction, and Smoothing
Among the many methods available for slant estimation, the simplest and perhaps one of the most effective approaches utilizes the chain code of the image boundaries. The average slant of characters in a word or in a line is easily estimated as
|
|
(3) |
where ni is the number of chain elements at an angle of i x 45° (/ or | or \). Shear transformation is then applied to correct the slant. Shearing is achieved by moving the image pixel located at (i,j) to location (i,j1) where j1 is equal to j + i.tan(q). The chain code method is computationally more efficient than the projection method. It is noted that a filter is required to smooth the serrated borders of the slant corrected image .
4.4.1.2 Character Segmentation
Character segmentation is the process of extracting characters from a word/phrase image. There are several steps31 to character segmentation including (1) Contour analysis, (2) Run-length analysis and (3) Disjoint box segmentation.
In the first step, possible segmentation points are detected in terms of local extrema analysis of the upper contour of a word. Among the local minima, those that are not deep enough from the adjacent local maxima are sequentially removed. To obtain characters separated by vertical lines, segmentation points may be shifted horizontally as follows. If the minimal point is not open vertically upward, the point is shifted to the right or the left to the optimum point with regard to an evaluation function of the number and the total length of runs. The contour analysis method is used to obtain oversegmented character images. This is important as undersegmented images can lead to significant errors in word recognition.
In the run length analysis25, ligatures between "or", "on", ..., occasionally do not have a valley point in the upper contour. To detect and split these ligatures, a single-run stretch (Figure 17) is detected and split at the middle point. The run is vertical streaks of one or more black pixels, and the single-run is the unique run on a single vertical line.

Figure 17. Single-run stretch between 'o' and 'r'
The single-run stretch is a horizontal stretch of single-runs shorter than a threshold determined depending on the average stroke width. Among these single-run stretches, those which have obvious peaks and valley points in the upper contours are removed. The rest of single-run stretches are split at the middle point.
For the disjoint box segmentation,25 characters and character segments are more efficiently processed if their bounding boxes are mutually disjoint. If oversegmentation is permitted, horizontal overlapping of character segments is resolved by a simple algorithm. A word image is split vertically at each pre-segmentation point and is separated into horizontally non-overlapping zones. A connected component analysis is applied to the split image to detect the boxes enclosing each connected component. These boxes are usually disjoint and do not include parts of other connected components (Figure 18). Disjointed bounding boxes play an essential role in high speed feature extraction.

Figure 18. Disjoint box segmentation
4.4.1.3 Word Matching Algorithm
A lexicon directed algorithm24,25 is used to recognize the legal amount. The number of the boxes (or segments) obtained by the disjoint box segmentation is generally greater than the number of characters in the word. In order to merge these segments into characters and find the optimal character segmentation, dynamic programming (DP) is applied using the total likelihood of characters as the objective function. The likelihood of each character is given by a discriminant function described further. To apply DP, the boxes are sorted left to right according to the location of their centroids. If two or more boxes have the same x coordinates at the centroids, they are sorted top to bottom. Numbers above or below the boxes in Figure 18 show the order of the sorted boxes. It is worth observing that the disjoint box segmentation and the box sorting process reduce the segmentation problem to a simple Markov process, in most cases. For example, boxes 3 and 4 correspond to letter "v" of Seventeen, box 5 to "e", box 6 to "n" ... and so on. These assignments of boxes to letters are represented, for example, by
|
i |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
|
Ai |
S |
e |
v |
e |
n |
t |
e |
e |
n |
|
j(i) |
1 |
2 |
4 |
5 |
6 |
8 |
9 |
11 |
13 |
where i denotes the letter number, j(i) denotes the number of the last box corresponding to the i-th letter. Note that the number of the first box corresponding to the i-th letter is j(i-1)+1.
Given [j(i),i=1,2,..,n] the total likelihood of character is represented by
|
|
(4) |
where
is the
likelihood for i-th letter.
In the lexicon directed algorithm, an ASCII lexicon of possible words is provided and the optimal character segmentation is found for each lexicon word. All lexicon words are then ranked according to their optimal likelihood per character (L*/n) to select the best candidate word.
The optimal assignment (the optimal segmentation) which maximizes the total likelihood is found by applying the dynamic programming technique to likelihood values for different concatenations of the image segments
The following example shows the tables of L and j(k-1)* for given k and j(k). In this example, L(n,j(n))* = L(4,5) = 6.71 and j(n)* = 5. Succeeding j(k)'s are 4,3,2,0 respectively. The number "0" corresponds to a virtual box standing for the last box of the letter preceding the first letter "F" in the word shown in Figure 19.

Figure 19. Segmentation of handwritten word
Character likelihood is calculated using a modified quadratic discriminant function (MQDF) which is less sensitive to the estimation error of the covariance matrix and requires less computation time and storage than the ordinary quadratic discriminantfunction (QDF)26,27 It is given as follows:

where X denotes the input feature vector, M denotes the sample mean vector for each character class, and li and Fi denote the eigen values and eigen vectors of the sample covariance matrix. Values of constants h2 and k are selected experimentally to achieve the best performance28.
|
5 |
- |
- |
- |
6.71 |
|
4 |
- |
- |
4.87 |
4.57 |
|
3 |
- |
3.00 |
3.25 |
- |
|
2 |
1.65 |
3.11 |
- |
- |
|
1 |
1.90 |
- |
- |
- |
|
j(k)­ k® |
1 |
2 |
3 |
4 |
|
Letter® |
F |
o |
u |
r |
Table 2(a). L given k, j(k)
|
5 |
- |
- |
- |
4 |
|
4 |
- |
- |
3 |
3 |
|
3 |
- |
2 |
2 |
- |
|
2 |
0 |
1 |
- |
- |
|
1 |
0 |
- |
- |
- |
|
j(k)­ k® |
1 |
2 |
3 |
4 |
|
j(k) ® |
1 |
2 |
4 |
5 |
|
Letter® |
F |
o |
u |
r |

Figure 20. Illustration of feature vector determination
The histograms of the chain codes of the contour elements are used as the feature vector. The rectangular frame enclosing character contours is divided into 4x4 rectangular regions. In each region, a local histogram of the chain codes is calculated (Figure 20).
4.4.2 Legal line phrase matching
Legal phrase matching is an extremely challenging task for several reasons. First, most legal line amounts are written using a full cursive mode as opposed to hand printed style. Second, word separation is often ambiguous when using projection or profile analysis. Before attempting recognition, the legal phrase image needs some special processing (Figure 21). In addition to overhangs from the previous line, one must consider the removal of various artifacts associated with this legal field. Some issues to consider are (1) removal of underline without deleting portions of writing, (2) removal of the numeric portion of the monetary amount, (3) removal of long strokes to fill any gaps in the field. Correct processing will lead to second stage which includes noise suppression and slant correction and the final stage in which all the words of the phrase have been clearly identified.
![]()
Figure 21. Legal line with artifacts
Once the image has been prepared, legal line phrase matching can be accomplished in two modes. First, in a parallel mode, the legal field is recognized using a fixed size word lexicon consisting of numeric words "one" through "twenty", "thirty" through "hundred" in steps of 10, and "thousand". This assumes that clear word boundaries can be found, which is not always the case. Alternatively, in a serial mode, the courtesy amount is recognized first and the result used to generate a lexicon for the legal line field. Advantages and disadvantages of each mode are discussed below.
4.4.2.1 Parallel mode
The advantage in using a parallel mode is to obtain an independent (unbiased) contribution to the recognized amount. Should the legal amount then agree with the courtesy amount the confidence would be such that an extremely low error rate could be achieved. Also, if no recognition resulted from the courtesy amount, the legal line recognition and proper parsing would be the only possible result.
The disadvantage is speed. If the check amount is large, say $1253.76, then the legal expression might read
One thousand (and) two hundred and fifty-three dollars (and)
Thus as many as ten words may be written on a line of about four inches. The connector word "and" often appears at the end to connect with the fractional amount. In this case, gaps between words might become too small to correctly determine the word boundaries (and thus to extract words.) Thus, in many cases the full, independent result desired from the parallel mode may not actually be generated.
4.4.2.2 Serial mode
The advantage of this mode is speed. By restricting the possibilities to a few paths, reasonable time can be achieved. This is currently the best compromise to be able to use handwriting phrase recognition in a live production system. Because in serial mode the lexicon entries are valid phrases, no additional parsing is required to remove invalid sequence (e.g. "Thousand" followed by "Thousand"). However this requires the creation of the phrase lexicon from courtesy amount alternatives. One must consider the possibility that there are many variations in the legal field for the same numeric amount, e.g.:
| 1) One Hundred and Nineteen | 2) One Hundred Nineteen | 3) Hundred Nineteen | |||
| 4) One Hundred and Nineteen Dollars and | 5) One hundred and nineteen and | ||||
In the serial mode of check processing, the courtesy amount is recognized using the corroboration approach described earlier. Rank-ordered multiple choices for the courtesy amount are derived in this process. These multiple choices are used to generate a lexicon of phrases characterizing the legal field. Phrase recognition is then performed in a manner similar to isolated word recognition, to obtain the best sequence of characters/words (Figure 22).

Figure 22. Last stage character alignments in phrase recognition of $119.xx
In either mode, a confidence for each digit is needed to help the merge process of the courtesy and legal amounts. So for the example shown in Figure 23, the result might be:
| 1 (0.88) | 1 (0.88) | 9 (0.97) | (values in parentheses are normalized confidence). |
Much work is needed on handwritten phrase recognition and perhaps the best solution will be a hybrid, with short phrases done in a parallel mode and longer phrases handled serially.
4.5 Payee name matching
In lockbox processing, the ICR system receives a list of valid and invalid payee name from the workflow manager as part of the check processing transaction. The average size of the lexicon is about 20. The techniques used for word matching are the same as presented in the previous section on legal line recognition. If the payee name is machine printed, then a graph of the spotted words will be created and traversed to output a list of ranked payee names from the lexicon.
In lockbox processing a payee name is typically a business name, which means that special attention to frequent words (e.g., Inc., MD., etc.) and business related words (e.g., Transport, Banks, Medical Center, etc.) is needed to adequately compute a meaningful confidence. The bank offering this service has a database of valid and invalid payee names for each lockbox. A metadata analysis of this database not only identifies the non-information-bearing "stopwords" (e.g., Inc.), but also identifies the possible abbreviations (e.g., Incorporated, Inc.), and the misspelled words (e.g., Transportation) which will be at the tail of the word frequency histogram. To ensure accurate payee name matching, metadata tools are provided to the bank to analyze their data on an ongoing basis.
4.6 Error control, testing and introduction into a production setting
To bring an ICR system into production requires a large effort in integration. The testing cycle includes:
Designing an appropriate test set is a significant challenge when very low error rates are to be measured. The error rate must be strictly controlled in order to be acceptable in a production setting where actual financial transfers between businesses will take place without (in some cases) human intervention. The system must produce errors at no greater rate than is produced by current systems involving human verification (in the order of 1 incorrectly read amount in 10,000 checks). A caveat worth mentioning however, is that personal checks in circulation in the United States contain a certain percentage, perhaps the better part of one percent, of incorrectly filled out amount data. The most common case is of a check containing a legal amount and a courtesy amount that do not agree. Again, the system should be held to the same standards that are applied to systems using human verification. Usually these incorrect checks will be passed along using either the legal or courtesy amounts, particularly if the disagreement is restricted to the cents portion of the amount. Errors of this kind made by the system should not be counted against the requirement for one error in 10,000.
In order to achieve a reasonable level of confidence in the error rates, the system must be tested using samples of millions to tens of millions of checks. For practical reasons it is difficult to create a public database of check images to test against. The expense of collecting, storing and truthing this quantity would be considerable, and the privacy issues may be insurmountable. The best alternative is to insert the check processing system into an existing production system known to have acceptable error characteristics. The existing system would be configured to process checks as it normally would, but also to pass all check images to the check reader. Statistics would be collected to indicate all cases where the results of the check reader is different from the resolved amount, and to save the image for further analysis.
The testing process is iterative. Each run of one million or more images is subjected to failure analysis to determine adjustments to the various tuning thresholds. Because the quality of check images can vary greatly depending on the image capture characteristics and the conditions prevalent in the existing system, optimal values for tuning parameters may only be obtainable during this testing phase. As performance during these large tests improves, and repeated successful runs can be demonstrated, acceptance thresholds can be adjusted such that results from the check reader may gradually be accepted without being processed by the existing system. This method of introducing the check reader into a production setting minimizes the risk of adversely affecting the existing operation.
The system described in Section 4 is currently under development for a large US bank. Tuning, massive volume testing, and full integration will be done in the coming months. The unit-test results presented in this section were used not only to influence the overall design but also in the selection of classifiers. Results on a commercially available courtesy amount reader are presented in section 5.1. Section 5.2 presents corroboration results using the basic operators discussed in Section 4. We will then present some preliminary subcomponent results for target fields in a lockbox application.
5.1 Commercially available courtesy amount reader
A database of 1000 courtesy amount images extracted by a human operator was ground-truthed to test the Courtesy Amount Reader (CAR) available from Mitek Systems. Mitek's excellent neural network-based handprint digit classifier shows read rates reaching the high 90's on our databases (over 500,000 characters). The purpose of this experiment was to identify some of the problems in integrating a commercially available CAR package (e.g. should the image be filtered? the graphic objects be removed?)
In a "wide-open" system (i.e., no rejections based on confidence), the Mitek CAR read 62.2 percent correctly (i.e., the reported results after '$' agreed with the ground truth). About 14 percent of all images were rejected; these contained several fragments due to poor image quality. That left more than 23 percent with errors. Of the 234 wrongly recognized courtesy amounts, 58 were deletions, 27 insertions usually caused by graphic borders, 17 failures to resolve cents due to lack of punctuation, and the remaining were one or more character substitutions.
Next, we thresholded the confidence values to measure error isolation. At a character confidence of 90 or higher, 45 percent of the amounts were correctly recognized with a four percent error. This is good, given that for this test we have not tried to improve the image (e.g. to remove borders and noise.) After analyzing the errors and rejected fields, we can conclude that the Mitek CAR performance would lead to results in the range of what banks have reported (i.e., 45-55 percent read rate at 1-2 percent error rate).
Although we are currently using Mitek handprint digit recognizer for corroboration, the current corroboration scheme does not use the Mitek CAR for corroboration at the field level. Segmentation is certainly a major challenge, and we handle it through multiple segmentation alternatives (as discussed earlier) rather than having a second complete field recognizer. Corroboration at the field level using two separate CAR systems would be valuable primarily if the underlying digit classifiers are also based on different technologies.
5.2 Pairwise digit classifier corroboration
A study of four engines (allowing six pairwise combinations) was conducted to illuminate the behavior of various simple combining rules. The objectives of the study were (1) to identify success factors in the pairwise combinations (e.g., quality of constituent engines, orthogonality, sensitivity of the algorithms to engine peculiarities), and (2) to indicate how elaborate a combining rule might need to be, to attain compelling results.
We experimentally compared the accuracy of some simple confidence-based corroboration techniques, seeking to identify the factors affecting accuracy (e.g., pairwise orthogonality, individual engine quality, sensitivity of the algorithm to engine-specific properties.) To this end, four recognizers were tested in all pairwise combinations against over 60,000 handprint numeric characters, under various corroboration operators (as described in 4.2.1), for a wide range of read/error tradeoff38 thresholds.
5.2.1 Engine Selection
Two commercial hardware engines were included in the study: Engine A is an AEG-6160 statistical classifier, and Engine B is a Mitek neural network classifier. Engines A and B placed high in NIST handprint studies and are known to have been used successfully in commercial check-reading applications (for both handprint and machineprint documents, although for this test we only examined handprint.) The other two classifiers were non-commercial, software-based classifiers. Engine C is TRW's curvature-based classifier version 1.0. Engine D is the public domain classifier produced by NIST, based on the Karhunen-Loeve (K-L) transform.
The engines chosen exhibit a range of characteristics as shown in Table 3. The software engines chosen for the study are of lower overall quality than the commercial hardware products (but we show that they are quite useful as part of a corroborative pair.) The engines also showed some peculiarities in their output confidence distributions, to which we expected the combining algorithms would be somewhat sensitive. For example, although Engines A and B both return a preponderance of high-confidence, correct results, Engine B's outputs are coarsely quantized, with 90 percent of the outputs falling into only 4 confidence levels. The finest quantization (most confidence levels used) occurs in Engine D, which returns a floating-point confidence; however, the range of these values is extremely narrow and the values tend to be very high (above 99 percent) even for populations that include many errors. It is a truism in discussions of corroboration that the engines' error characteristics should be "orthogonal". If we measure correlation of errors at the wide-open (no threshold) setting, for the above engines we find that the highest correlation is .16 between the two highest-accuracy engines. If we measure orthogonality as the sine of the angle whose cosine is the correlation, all pairs tested were at least .987 orthogonal; thus we did not expect to find, and indeed did not find, that this measure had a large effect on the quality of results.
| A [AEG] | B [Mitek] | C [TRW] | D [NIST] | ||
| wide-open accuracy | high 90's | high 90's | low 90's | low 90's | |
| quantization of confidence | medium | coarse | fine | fine | |
| variance of confidence | medium | medium | wide | narrow | |
5.2.2 Evaluation Criteria
The "quality" of a recognizer, or combination of recognizers, may be difficult to state as a single number since it depends on an interaction between two variables (accept/reject threshold and correct/incorrect results). Rather than attempt a global objective function for the entire curve, we selected two points at which to characterize accuracy: one at a 95 percent accept rate and one at 60 percent accepts. At each point, the quality of recognition was determined from the fraction of accepted results that are incorrect. (Where results are summarized over several combinations of engines, a geometric mean is used; that is, averaging is done over the logs of the error rates.)
The 95 percent accept point represents applications in which the major goal is reduction of keying labor; often in such applications some further means of error reduction will be available beyond the confidence thresholding. The 60 percent point was selected from the observation that for most engines, the errors-per-accept ratio approaches an asymptote, so that 60 percent shows the approximate limits of a recognizer's ability to control errors by confidence alone.
5.2.3 Digit corroboration results
Six pairwise combinations were tested. Sample results for two of the combinations (AEG-Mitek and AEG-TRW) are shown in Figure 23. The most striking feature of the results is the lack of significance of some quality factors that might have been assumed important. For example:
By far the best predictor of the outcome of a pairwise combination was the accuracy of the constituent engines. In configurations tested, the engine which had the lowest errors at a given accept rate also showed the best ability to improve the results of the other engines. Overall, the error rate of a combination (at 95 percent accepts) correlated(0.65 with the error rate of the better engine.
By contrast, orthogonality was a poor predictor of pairwise accuracy (in fact, the best two engines were also the least orthogonal pair.) As a predictor of the degree of improvement over the better of the two engines in a pair, orthogonality correlated -0.08 with that improvement. Overall we conclude that the simple measure of orthogonality (errors at a wide-open threshold) is of no use in evaluating the ability of two engines to corroborate each other in a confidence-based system; apparently the orthogonality criterion is so easily satisfied as to exert little influence. This conclusion holds even at high accept rates, where we would have thought that the wide-open error behavior had some predictive value.
5.2.4 Quality of results - Summary
Sensitivities inherent in Min and Bayes operators (or Bayes's cousin LogOdds) cause difficulties for these algorithms in some engine combinations; the differences appear more pronounced at high accept rates. Overall the Avg operator (or its relative Linear) appears the most robust against engine peculiarities. Any of these algorithms could fail to produce results as good as their better constituent engine, if the second engine introduces a very large number of misreads with very high confidence. We summarize the findings of this study as follows:


Figure 23. Two engine combinations, three corroboration operators each
5.3 Target fields segmentation / recognition: preliminary results
The following results were estimated through failure analysis of the training set. They represent an expected potential for successfully extracting the fields on personal and machine printed checks. Some preliminary corroboration results between different fields and different sources (e.g., voice and ICR amounts) were also estimated.
5.3.1 Field segmentation
The field segmentation presented in this section is for business checks. (Fields on personal checks, as discussed previously, can generally be extracted successfully if the image has been deskewed.)
A correctly identified field means that the hand-segmented field was among the fields returned by the ICR system. A 25 percent overlap (not underlap) between the human and the ICR segmentation was accepted (i.e., the ICR segmented field must completely surround the target identified by a human operator).
Figure 25 summarizes the field segmentation performance estimated for each semantic type of interest. The legal amount and the payee name can reliably be segmented in about 95% of the processed checks. Image quality is by far the main factor affecting the ability to segment, followed by large spacing between the dollar amount and the cents.

Figure 25. Business check field segmentation performance
For the courtesy amount and date fields, regrouping (assignment of portions of a field to the wrong field) was the main problem after image quality. Also, the date graphical representation (box, line, etc.) created problems in both segmentation and recognition. It is important to remember that most regrouping segmentation errors are corrected by the postprocessor as it parses the character recognition results.
For the payee name field, image noise was the only significant source of error. Success rate on this field was among the highest, and the exceptions were frequently illegible to humans viewing the image.
For the signature field, location and detection are joined processes (i.e., if the density and regularity measure failed then the segmented field was rejected; failure to locate is failure to detect.) Quality of signature detection was highly subjective since not one single check in the training set truly had no signature (except for the stamp in Figure 2). But to be realistic, we had to reject several checks for which the signature area and the background noise were difficult to differentiate.
5.3.2 Field recognition
Each target field (except signature) recognition performance is presented separately, and each may have a different optimal read / error tradeoff point.38 The machine printed character recognition result engine used for this study was a Calera MM600. Other engines were also briefly tested and reported in section 5.3.2.3.
5.3.2.1 Legal amount
Allowing for field segmentation problems, the following results for machine printed checks were obtained:
Preliminary Test Results on Handwritten Word/Phrase Recognition
The algorithms described in the earlier sections were applied to legal line images extracted from bank checks. Many of the images were characterized by overhangs from the previous line field, incomplete elimination of underlines and the frequent presence of numeric amount (eg. 76/100) in the legal field, and long pen strokes separating the written amount from the amount of cents. Such images were removed from consideration in this preliminary test phase. A small set of 480 images was processed and the following results achieved:
| Checks correctly processed | 352 (First Choice) |
| Checks in the top three choices | 88 |
| Checks rejected | 21 |
| Checks with incorrect recognition | 19 |
These results exclude the use of any gap analysis for determining word boundaries. Also, the long strokes between the legal amount and the fractional dollar portion were not removed. Most of the errors were caused by poor writing, incomplete removal of underlines and the crossed out or overwritten words in the legal field.
5.3.2.2 Courtesy Amount
Allowing for field segmentation problems, the following machine printed results were obtained:
The Calera engine used in this part of the test returns only very coarse character recognition confidence, making it difficult to isolate the troublesome cases. Besides this per-digit confidence and the character confusion matrix, there were a few useful heuristics that helped isolate the correct fields (e.g., the last two digits are .00, or an amount 300.00 is more probable than 300.06, for example). Even with these heuristics, the courtesy amount correct vs. error tradeoff is a highly variable and discontinuous function. This reinforces our belief that a single reader's output for the courtesy amount is unlikely ever to attain a low (1 in 10,000) error rate at a significant field read rate.
Clearly, the selection of the machine print recognition engine(s) will have a serious impact on the recognition performance. In cases for which redundancy information is not available (e.g., courtesy amount with no legal amount), the tradeoff between error and correct rate will directly depend on the discriminating power of the confidence reported for each character.
5.3.2.3 Date
Because the date field had the worst segmentation and recognition performance, a small study of four recognition engines was initiated to identify factors affecting the quality of results. The recognition devices were: Calera MM600, AEG 6160m, Mitek QuickStrokes, and ExperVision 3.0 XA TypeReader. The purpose of the study was not only to estimate potential recognition performance, but also to investigate the ability to isolate errors through a meaningful reported confidence.
A very small sample of 500 valid date fields were tested. These fields were segmented by a human operator. To indicate how noisy the fields were, the following results were obtained on trying to spot the year 95 or 1995 using a priori knowledge of substitutions (e.g., S® 5, I® 1, etc.) and up to second classifier choice:
The complete date field recognition performance with the automated field segmentation, is estimated at:
The error rate in the above results is near 1 percent, i.e., roughly comparable to single-pass keying. The AEG and the Mitek have detailed confidence outputs, permitting thresholding that reduces the error rate to .1 percent or better with a read rate between 25-30 percent.
5.3.2.4 Payee name
The payee name reached the best recognition performance of all fields tested. Based on comparing the recognition result with the human-keyed payee name, the edit distance median was at 100 percent matching (i.e., more than half of tested payee names were recognized with no character recognition error). This truly shows the potential of reading clean images, as the payee name field is usually the most cleanly printed.
To more closely resemble a production system, a payee name matching test consisted of matching the located payee name against a lexicon of 20 words. The 20 words were selected at random from the list of all keyed payee names. Using a 0.7 fuzzy match threshold, 80 percent of the payee name were recognized with no matching error.
5.3.3 Amount corroboration
While the above tests show that a single channel can attain error rates comparable to a human keyer, results from corroboration tests showed the capability to reach error rates equivalent to or better than a two-pass keying system, thus representing significant savings in human labor.
5.3.3.1 Courtesy and legal amounts
The corroboration between legal and courtesy amount significantly improves the recognition performance of the check amount (compared to reading each field individually). About 42 percent of legal and courtesy amount pairs agree, with no errors found in that population. In about 4 percent of the tested checks, the courtesy amount had no valid data but the legal had a fully parsed result; all of these were correct.
To better isolate errors in the remaining population, corroboration results were split into 13 categories. Selection criteria included degree of agreement (e.g. complete disagreements showed an error rate of 30%). The legal amount classes were: nonexistent, digit-type format, alpha-type format, incomplete field recognition, or rejected. With corroboration, the estimated correct rate with "very low" error rate (actually no errors were found) is in the range of 50-55 percent.
5.3.3.2 Voiced, courtesy, and legal amounts
From the test set, the courtesy amounts keyed by a human operator were printed out line by line, and read back to a Verbex voice recognition system. The results obtained are consistent with our previous findings that the human-recognized voice for a digit field has a 1 percent field error rate. This also confirms that a single human-voiced system has too high an error rate to be used by itself in a financial application; it must be accompanied with either a human-based keying system or an ICR-based system.
Given three sources (voiced, legal and courtesy), different strategies are possible to control the error rate. For this specific case, the estimated error rate is so low that an estimated 10 million checks would have to be tested to obtained a statistically valid measure.
The estimated corroboration recognition performance between voice and ICR amounts can be summarized as follows:
5.4 Platform implementation
The ICR system requirement is to process 2-4 documents per second on a dual Pentium 166 MHz equipped with an external recognition device (one or more of the classifiers tested here, such as the Mitek hardware recognizer). In a 10 hour shift, a single ICR unit will process over 100,000 checks per day.
The check processing market is very active with a large volume expected for another decade. Within the next two years, several banks will commit to image-based systems with document understanding capabilities. The cost of an ICR subsystem is small compared to the overall investment in a full image-based system, but the labor savings from ICR are large if the error rate can be closely controlled. Because the ICR subsystem is integrated as part of an overall data entry capability that also includes keying, the bank must be able to easily control the read rate / error rate tradeoff to modify the application as needed. Under those conditions the bank can use very high ICR accuracy not only to reduce labor but also to offer new services.
We have described the design of a check reader that will achieve very low error rates. It will be used in a transaction-based processing in which the content of an envelope (check(s), stub, and invoice) is processed as a single unit so that the ICR system will be able to use balancing as a way to corroborate between items. The system described is based on corroboration at the character level, at the word image level, between fields such as courtesy and legal amounts, and with external sources. With some simple corroboration operators such as Min, Avg, Bayes, significant improvement of the tradeoff between correct and error rate was achieved. The Avg operator is a good tool to quickly identify which pair of classifiers hold some potential in corroboration. This is merely a prescreening of potential pairs, from which an in-depth analysis of the selected engines can better characterize their strengths and weaknesses and lead to an optimal merging algorithm.
The selection of a best pair of classifiers must also take into account cost, speed, and maintainability. A total of eight classifiers (four handprint, and four machine print) were mentioned in this study. The final selection of which pair of classifiers to use for a given character set is still an open decision, since vendors offer new classifier revisions once or twice a year. Furthermore, with advances in general-purpose computers, software ICR solutions are becoming more cost-efficient.
The challenge in handwritten legal line phrase recognition was described in detail. A sequential mode (i.e., using the legal line to validate results from the courtesy amount field) is being implemented and will be integrated as part of a system delivered to a large US bank.
Before a bank can rely entirely on an ICR system to complete a transaction without human intervention, the error rate must be validated through mass testing of millions of transactions. Ground truthing has to be an integral part of the bank's on-going quality measurements. Phased conversion from manual data entry to ICR will allow the ICR subsystem to be monitored and tuned against high-quality keyed data.
Some of the contextual information that is not used but could improve the system performance (especially the throughput), is the high correlation between consecutive checks. For instance, several consecutive checks addressed to the same payee name only differ by the amount (the courtesy and legal fields). The implementation of a short term memory can thus be considered as part of an ICR design. This short term memory would help acquire knowledge (statistical correlation between consecutive checks) as it runs in a production system. The acquired knowledge can then be rigorously applied in a controlled test environment to evaluate its usefulness before re-integrating it into the production environment.
References
Acknowledgements
The authors wish to thank the ICR Team and in particular: Halit Canbazoglu for testing the Mitek CAR reader, Larry Sakanashi for testing the AEG 6160, Carol Dietrich for supervising our laboratory and the ground-truthing process, and Jan Moss and Carol Palecki for their considerable assistance in preparing the manuscript.
Email Contact
Authors may be reached via electronic mail:| Gilles F. Houle | gilles@tfs.com |
| David Bradburn Aragon | dave@alumni.caltech.edu |
| Robert W. Smith | bob@tfs.com |
| Dr. M. Shridhar | mals@umich.edu |
|