|
| Affiliation at time of publication: |
TRW Financial Systems ICR Advanced Development Group 2118 Milvia Street Berkeley, California 94704 |
| Contact information updated as of January 1998: |
|
|
Character segmentation is a necessary preprocessing step for character recognition. Correct segmentation of handwritten addresses requires solutions to several problems, including the splitting of run-together or crossing characters and the merging of fragmented characters. Each case entails a decision based upon information derivable from the mail piece image. In existing systems, character segmentation decisions are usually made by a heuristically derived rule base. Here, we apply well-known pattern recognition techniques so that decision boundaries may be optimized by training over a large data set. Some typical problems are described: character splitting, character merging and word level separation. Training and test data sets of handprint zip codes are compiled, and easily measurable properties of the data are used to train Bayesian and neural network classifiers. We compare the results of the classification methods and results of segmentation guided by conventional heuristically derived rules. |
Motivation
Nearly any system for recognition of unconstrained handprint will encounter a number of character segmentation problems. The tremendous range of handwriting styles means that many of these problems are intrinsically difficult.
Pattern classification techniques have been successful in other aspects of character recognition, mostly focusing on classifying isolated character bitmaps as recognized character values. Lee, Srihari and Gaborski give a comparison of Bayesian and back propagation network classifiers, as applied to handprint digit and alphabetic character recognition [5]. Proposals have also been made to use neural networks for the problem of address block location on mail pieces [1,8]. By contrast, most published handprint segmentation methods tend to rely upon heuristically derived rules [3,7]. The use of classification techniques in this application is largely an unexplored area.
This study is intended to be a preliminary exploration of the use of
classification techniques for handprint character segmentation. The goals
are to find techniques that improve the accuracy of segmentation, and that
give some measure of their likelihood of success. We hold out the hope
of relieving system developers of the burden of making time-consuming investigations
into the heuristic rules that govern each segmentation problem encountered.
Outline of Experiments
Segmentation problems occur as blocks of text are broken down into lines, words and finally characters. Address blocks found on mail pieces in the USPS mail stream provide ample opportunity for the study of such problems. If a problem can be posed in terms of a set of well defined input parameters requiring that some operation be performed in order to achieve a correct segmentation, then that problem may be amenable to pattern classification techniques. A feature vector can be derived from the set of input parameters, and can be presented as input to a classifier. Finding the correct operation to perform is equivalent to finding the appropriate class to which that the input vector belongs. We will present the results of our investigations into three problems. The three problems are of increasing complexity and form a representative sample of the ones we have investigated.
Back propagation neural networks and Bayesian classifiers were selected as techniques to test against human crafted software modules that employ heuristic rules. In this sense, the classifiers are being tested against human ingenuity for their ability to generalize a solution to an inexact problem. For each problem, a training data set and a test data set of roughly equal size were generated. An attempt was made to randomly distribute the available data among the two sets. The pattern classifiers were created with the training sets, and then evaluated against the test sets. The heuristic rules were tuned for optimal performance on the training sets, and then evaluated on the test sets.
Bayesian classifiers were constructed according to Duda and Hart[2]. The training data is used to estimate the a posteriori probability of the occurrence of each class, given each input vector. The Parzen window method was used as a basis to approximate the multivariate probability distributions of the input vectors. During evaluation of the test sets, the a posteriori probabilities are used to select a classification that minimizes the likelihood of error. Results for two Bayesian classifiers, using different Parzen window sizes, are presented for each problem.
Back propagation networks were built according to Rumelhart and McLelland[6]. Each network has an input node for each element of the input vector and an output node for each of the mutually exclusive classes. Input data was normalized to values between 0.0 and 1.0. The output value for the desired class was set to 0.9, others were set to 0.1. Each network was fully connected, and weights were randomly assigned initial values between -0.5 and 0.5. The number of hidden nodes was varied during experimentation. Each network usually had its training data presented for at least 800 epochs, and the data was presented in random order during each epoch. Results for two topologies are presented for each problem.
The heuristic modules against which the classifiers are tested were
created by an experienced programmer familiar with handwritten character
segmentation. While they certainly do not represent the best that can be
done by human-derived rules, they do represent a reasonable effort and
their performance is probably typical of existing systems.
Probability of Segmentation Success
A useful attribute of pattern classifiers is that they can give a measure by which an input vector belongs to a given class. This measure is often normalized into some useful range and expressed as a confidence value. The confidence can in turn be mapped onto a probability that the classification was performed correctly. Context analysis modules of character recognition systems can make use of the probability that a word or field was correctly recognized, sometimes giving significant improvements in overall recognition rates.
By making the assumption that the probability that each character in a word was recognized correctly is independent of the probability that the other characters in the word were correct, the probability that a word consisting of n digits is correct can be approximated by:

Segmentation errors can have a significant effect on the recognition rates of an unconstrained handprint field. It is usually hoped that a segmentation error will lead to malformed characters, which would be recognized with low reported confidence. But it is often true that incorrectly segmented characters can give rise to incorrect recognition results with high reported confidence. A better approximation of the probability that a word was recognized correctly might be:

In order to look into the feasibility of using the classification confidence to derive a suitable value for PS , we have adopted a notion of disjointness, which attempts to measure the correlation between incorrect classifications and low reported confidence. To do so, we first define the error e of a classifier's d-dimensional output vector x and its idealized form y, that is, the classifier's perfect response for making the same classification:

The errors from a set of classifications are divided into two groups, those that were associated with incorrect classifications, 1, and those that were associated with correct classifications, C. The mean and standard deviation are computed for each group, and the disjointness D, over the entire data set, is defined to be:

D attempts to measure the average difference in error between correct and incorrect classifications, expressed in units of standard deviation. Large values of D, say, greater than 0.5, would indicate that a mapping from classification confidence to the probability of correct segmentation is possible.
Data
All data for these tests was obtained from a set of about 18,000 handwritten mail piece images. Address block locations and state and zip code information were assigned by human keyers for each image. The mail piece images were collected from a small number of USPS General Mail Facilities, so that the data has certain characteristics that are not true of the USPS mail stream at large. While this data set is insufficient to generate an analysis of character segmentation problems present in the USPS mail stream at large, it is sufficient to create case studies of representative segmentation problems.
Two data sets of roughly equal size were created for each problem. Features
were extracted from each element of both data sets and are used to create
classifier input vectors. Truth values, determined by human inspection,
were assigned to each input vector, representing the desired classification
for each element.
The Component Splitting Problem
The most successful methods for segmenting a word into individual characters involve generating the connected components for that word. If the number of characters expected in a word is known, as is the case with five digit zip codes, and if the number of connected components agrees with the number of expected characters, then the segmentation problem may be trivial. If the number of connected components is smaller than the number of expected characters, then it is likely that the word contains at least one instance of touching characters. Segmentation logic must first select a component to split, and then perform the split.
Here, we deal with the case where a five-digit zip code is expected, but only four connected components have been found that are of sufficient size to be digits. The problem is to merely select the component to split. It is assumed that the split will be performed by a later module. Classifiers were constructed that took input vectors of eight variables, namely, the widths and number of pixels in each component. These values were produced as a side effect of the connected component generation. The widths were scaled by the largest width, and the number of pixels was scaled by the largest number of pixels. Four output classes were defined, corresponding to the four possible selections of components to split. Figure 1 shows an example of a zip code that corresponds to class 2, that is, the second component should be split.
From the data set, 1338 examples of five digit zip codes with four connected
components were found. Cases that did not consist of adjacent digits simply
touching were then removed, leaving 1174 samples. This was divided into
a training set of 586 samples and a test set of 588 samples. Test results
are summarized in Table 1.
| Training Set (586) | Training Set Disjointness | Test Set (588) | Test Set Disjointness | |||
| Bayesian L | 579 | 98.8% | 0.52 | 554 | 94.5% | 1.12 |
| Bayesian S | 580 | 99.0% | 0.43 | 542 | 92.2% | 1.17 |
| Backprop 4H | 581 | 99.1% | 1.23 | 575 | 97.8% | 1.57 |
| Backprop 6H | 579 | 98.8% | 1.32 | 578 | 98.3% | 1.32 |
| Heuristic 1 | 560 | 95.6% | - | 563 | 95.7% | - |
| Heuristic 2 | 574 | 98.0% | - | 576 | 98.0% | - |
| Heuristic 3 | 573 | 97.8% | - | 574 | 97.6% | - |
Results for two Bayesian classifiers are shown, one with a larger Parzen
window size (L) and one with a smaller window size (S). Results for
back propagation networks with four (4H) and six (6H) hidden nodes are
presented. Results from applying three heuristic rules are also shown.
Heuristic 1 selects the widest component to split. Heuristic 2 selects
the component with the largest number of pixels. Heuristic 3 attempts to
use both width and number of pixels, by summing the squares of the two
normalized numbers and selecting the component with the largest sum.
The Component Merging Problem
A closely related problem is the case where a word has more large connected components that there are expected characters. We now look at the case where a word expected to contain a five-digit zip code is found to have six large connected components. It is likely that one of the digits has become fragmented into two connected components that are adjacent in the word. Selection of a pair of adjacent components to merge is a reasonable segmentation action to take at this point. After the selection is made, subsequent logic may elect to repair the broken digit, or to merely preserve the two pieces in their original spatial relationship for character recognition.
Input vectors for the classifiers consisted of 22 values. The smallest enclosing rectangle was computed for each of the six components. The left, right, top and bottom coordinates for each rectangle, relative to the smallest enclosing rectangle for the entire zip code word, were encoded in the vector. Since the left and right coordinates were scaled by the width of the entire word, the left coordinate of the leftmost component was always 0.0, and the right coordinate of the rightmost component was I.O. Both were omitted from the input vector. The top and bottom coordinates of each rectangle were scaled by the height of the entire word. Five output classes were defined, corresponding to each pair of adjacent components that might be joined. Figure 2 shows an example of a zip code corresponding to class 3 (the second and third components from the left should be merged).
From the data set, 3774 examples of five-digit zip codes with six connected components t: were found. Cases that were too complicated to be resolved by simple merging were eliminated, leaving 3091 samples. Most that were removed either had both touching and fragmented digits, or included an extraneous component. Some required a merge of components that were not adjacent. A training set of 1546 samples and a test set of 1545 samples were created. Test results are summarized in Table 2.
| Training Set (1546) | Training Set
Disjointness |
Test Set (1545) | Test Set Disjointness | |||
| Bayesian L | 1107 | 71.6% | 1.30 | 1078 | 69.8% | 1.34 |
| Bayesian S | 1110 | 71.8% | 1.30 | 1081 | 70.0% | 1.32 |
| Backprop 12H | 1495 | 96.7% | 1.24 | 1513 | 97.9% | 1.13 |
| Backprop 16H | 1473 | 95.3% | 1.09 | 1442 | 93.3% | 0.97 |
| Heuristic 1 | 1262 | 81.6% | - | 1235 | 79.9% | - |
| Heuristic 2 | 952 | 61.6% | - | 1027 | 66.4% | - |
| Heuristic 3 | 1359 | 87.9% | - | 1340 | 86.7% | - |
The City/State/Zip Word Separation Problem
Before individual characters can be isolated from words, it may be necessary to isolate larger structural elements. This is the case with address blocks. First the constituent lines of the block are separated, then the words within each line, and then the characters within each word. An attempt at separating words within a line might be made using only the sizes of the gaps between each connected component in the line. Straightforward approaches such as this often turn out to fail in the face of the great variability of handwriting. Even when the number and size of words can be partially predicted, as in the last line of an address block, difficulties arise when characters within words are fragmented or touching, causing them to have more or fewer connected components than expected.
The third problem in this study occurs when the last line of an address block has been isolated, and is known or assumed to be of the form "City, State Zip". For the purpose of this experiment, only lines containing five digit zip codes were examined. State names may be written out in full or abbreviated. Based upon the hunch that the arrangement of the enclosing rectangles of the connected components in the line occur in definite but not readily quantifiable patterns, we have attempted to find a way to express those patterns as a vector that may be presented to a pattern classifier. Each possible classification would correspond to one of the possible ways to group the connected components in the line into three words. The correct classification would correspond to the grouping that brought all the connected components for each word, and no others, into the same word.
Our approach was to first arrange the connected components from left to right, and compute the enclosing rectangles for each one. A cohesion factor was computed for each pair of adjacent components. A high cohesion value indicates that the two components are likely to belong to the same word. Cohesion was based partially upon the gap between the components, but also took into consideration relative heights and widths and horizontal alignment. Beginning with the pair with the highest cohesion, and progressing in order to the pairs with lower cohesion, adjacent components were joined into the same group. Ideally this process would progress until only three uncohered sets of components remained, representing the city, state and zip code. A set of 10,143 examples of address blocks with city/state/zip lines were isolated. It was found that if cohesion were allowed to progress until only three groups remained, only 3743 correct groupings would ensue, whereas if cohesion stopped when five groups remained, 9379 would still have the possibility for being correctly separated into words.
At the point where five cohered groups remain, the bounding rectangle for each group was computed, and an 18 element vector constructed that describes the gap between each rectangle, and the height, width and number of components in each. Classifiers were built that accept these vectors as input, and that recognize six classes. The classes were labeled "311", "221", "212", "131", "122" and "113", corresponding to the six possible ways that five groups may be joined together. Figure 3 shows an example of class "212" and Figure 4 shows an example of class "311 ". A training set of 4689 samples and a test of 4690 samples were created and submitted to various classifiers. Test results are given in Table 3.
| Training Set (4689) | Training Set
Disjointness |
Test Set (4690) | Test Set Disjointness | |||
| Bayesian L | 2551 | 54.5% | 0.39 | 2531 | 54.0% | 0.44 |
| Bayesian S | 2619 | 55.9% | 0.38 | 2580 | 55.0% | 0.43 |
| Backprop 8H | 4036 | 86.1% | 0.62 | 3970 | 84.6% | 0.56 |
| Backprop 14H | 3926 | 83.7% | 0.55 | 3845 | 81.2% | 0.50 |
| Heuristic 1 | 1879 | 40.1% | - | 1864 | 39.7% | - |
Results for two Bayesian classifiers are shown, one with a larger Parzen window size (L) and one with a smaller window size (S). Results for back propagation networks with eight (8H) and fourteen (14H) hidden nodes are presented. Results from applying two heuristic rule sets are also shown. Heuristic 1 shows the results of merely continuing the process of cohesion.
A similar word separation problem was approached by Seni and Cohen [7]. They attempted to determine the size of inter-word and inter-character gaps. The assumption was made that all inter-character gaps were smaller than all inter-word gaps, so that a discriminating threshold could be found upon which to base word separation. A number of heuristic rules were tried, and they achieved reasonable success in correctly identifying gaps as inter-word or inter-character. However, an address block segmentation module that solely used this technique would have trouble attaining high performance rates. Their result further illustrates the considerable difficulty in crafting segmentation heuristics for unconstrained handprint.
Summary
Back propagation networks show considerable promise as a component of handwritten segmentation modules. In all problems studied there exists at least one network that achieves performance superior to any other method. Especially encouraging is their ability to discriminate classifications that are likely to be incorrect by reporting confidences that correlate to the probability of error. Network performance for a particular problem does seem to be sensitive to the number of hidden nodes of the network. In devising networks to solve particular segmentation problems, we have found it worthwhile to try several topologies before making a final selection.
On the whole, the Bayesian classifiers did not produce creditable results. Good performance was shown on the training set for the component splitting problem, but the classifier did not generalize well to the test set. On the two successively more complex problems, the classifier performed successively worse. At this point it is not clear whether this type of classifier is inappropriate for handprint segmentation problems, or if a better formulation of the classifier is yet to be found.
It is somewhat remarkable that many heuristic modules lagged far behind back propagation networks in performance. The heuristic modules presented here do not represent the best that can be done by a human engineer who is skilled in the craft of handprint segmentation, but to do as well would be a challenge. Although we have generally limited the heuristics here to operate only on the same input vectors that were used for pattern classification, such modules are free to process arbitrarily large amounts of data to base their decisions on. It should be considered an advantage of a pattern classifier, however, that it only requires readily available information.
Further Work
In light of the success reported elsewhere in the use of Bayesian classifiers of binary valued feature vectors for character recognition [4,5], we are investigating the failure of the Bayesian classifiers to produce useful results for character segmentation. Other methods for approximating non-binary. multivariate distributions will be tried. The effect of training set size will be examined.
The City/State/Zip classifier described above is a member of a family of classifiers under development that classify various numbers of cohered regions, and make assignments to a wider variety of syntactic structures. These classifiers will be combined into a module for performing word separation of an entire handwritten address block. Experiments are also being performed to evaluate other formulations of the input vectors.
References
| [1] | A. Bergman and P. G. Mulgaonkar. Neural Networks for Address-Block Reading: A Comparison with Classical Techniques. Proc. of 3rd USPS Advanced Technology Conference, 2:736-750, 1988. |
| [2] | R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973. |
| [3] | R. Fenrich and S. Krishnamoorthy, Segmenting Diverse Quality Handwritten Digit Strings in Near Real-Time, Proc. of 4th USPS Advanced Technology Conference, 1:523-537, 1990. |
| [4] | S. Kahan, T. Pavlidis and H. Baird. On the Recognition of Printed Characters of Any Font and Size, IEEE PAMI-9, No 2, March 1987. |
| [5] | D.-S. Lee, S. Srihari, and R. Gaborski. Experiments in Handwritten Character Recognition with Pattern Recognition and Neural Network Approaches. Proc. of 4th USPS Advanced Technology Conference., 2:1013-1027, 1990. |
| [6] | J. L. McClelland, D. E. Rumelhart and PDP Research Group. Parallel Distributed Processing. MIT Press, 1986. |
| [7] | G. Seni and E. Cohen. Segmenting Handwritten Text into Words by Using Different Distance Measures Between Connected Components. Proc. of SPIE Electronic Imaging Science and Technology Conference, 1992. |
| [8] | J. Weaver and T. Antognini. Assessing the Likelihood that a Region is an Address Block Using Neural Networks. Proc. of 3rd USPS Advanced Technology Conference, 1:782-795, 1988. |




|