Handwriting Recognition

 

Why it sucks. Why it always will suck. Why it can be useful even though it sucks.

 

I was the design lead for Microsoft’s Pen Windows team. In addition to working on the interfacing of pens into the Windows code I was the principle architect of the algorithms that my team used to do the handwriting recognition. We had about 5 people working on the recognition code for about 2 years and in that time we learned quite a bit about what would and what would not work.

 

In an academic environment there would have been a strong incentive to publish the results of those ten plus man years of discovery so that others could build on that knowledge. However in the company environment the incentives are different. If the pen computing market had taken off at that time as we had hoped that it would, the information gained would have been passed on from team member to team member and been considered a company “trade secret” that led to a competitive advantage in building recognition code.

 

Unfortunately, in spite of all the hype that surrounded pen computing at the time (1990-92) or perhaps because of all the hype, pen computing did not make a large impact on the personal computer market. Since the market did not take off, there was little continued support for pen computing at Microsoft. This meant that the people who had done all the work to develop the handwriting recognition code drifted on to other projects and most of the knowledge gained was never written up.

 

I have for years intended to remedy this defect in the way that information gets transmitted in the company environment and write up an analysis of what we did, what we learned about handwriting recognition, what will work, and what will not.

 

I have only recently  (1998) worked up the enthusiasm to do this documentation chore for two reasons. First I am seeing more nonsense in the press again about how faster and faster machine will lead us into the heaven of voice recognition or in the case of the Palm Pilot to better handwriting recognition. Secondly, I have other work I should be doing and suddenly writing up an in depth analysis of handwriting recognition seems so delightfully removed from that work I should be doing that I cannot resist the procrastination opportunity.

 

Without going into great detail, I will attempt to outline the direction(s) that we took, the surprising discoveries, the tools that we found useful, and generally give sufficient information that will give some future software engineer some pointers that can help them navigate the treacherous waters of recognition. Certainly some of the observations pertain only to handwriting recognition but others have more general utility.

 

Topics to discuss – in the interest of finishing I will not worry about the overall organization at this point, merely list things that I feel I need to cover before I am done. Actually let me rephrase that. I am torn between laying things out in a historical approach, which is good to illuminate why we ended up where we did, or to lay it out in the most logical presentation. As a mathematician I was trained to cover up all the tracks of history and lay it out as a logical presentation, but I have since concluded that that is not always the best presentation. It often gives the impression that research consists of using great brilliance in order to make only the correct turns through the maze thus avoiding the dark alleys. Real investigation often involves wading into the middle of something, working on one class of problems before deciding that those were not the ones most in need of solving.

 

So the organization is to first list all the things that I don’t want to forget to cover and then to have at it.

 

Identify the problem

You don’t read your own handwriting, you remember what you wrote. That’s why no one else can read it.

Birds do it, bees do it, even a Cray can’t do it.

Programming Execl Macros with a pen is NOT natural and NOT easy.

Hundreds of Handwriting methods. Even the Japanese do not write like they say they do.

F from the bottom up.

Little factoids – 1/10 of a second per neuron firing. Ballistic trajectories.

Outline of process – Segment – Classify – Cleanup

Classification

Shape recognition is NOT the problem. Even people can’t recognize shapes. Context is everything.

Surprise – u = n  s=g

A decision tree is only as good as the first decision.

Classification – Features, Prototypes, Metric spaces

Statistical Pattern Recognition – Cluster Analysis, Principal components, Whitening

Neural Nets, Fuzzy Logic, UFOs – Bah Humbug!

Segmentation

Viterbi, approximate string matching, dynamic programming.

Context

Delayed Context – cows suck and zootsoots =

Aud = And

Prototypes

Stroke order independence – thousands of E’s

Dark Matter – when do you have enough prototypes?

Training – Must be on the fly. Humans can not repeat what they did.

If you won’t take the 45 test you do NOT have a user interface.

Hooks

Grafitti – change the problem to one you can solve.

2D parsing.

 

 

You don’t read your own handwriting, you remember what you wrote. That’s why no one else can read it.

The point of this topic is to point out that the argument that humans can read handwriting therefore computers ought to be able to read it as well does not hold water. If you scrawl something quickly in a notebook as a note to yourself, you may have no problem reading it later, but that is not necessarily actual reading. It is quite likely that all you are doing is remembering. You see the paper, it reminds you of the context when you wrote it and you remember what you wrote. The fact that other humans have a difficult time reading your hastily scrawled notes is good evidence that we are looking at memorization, not at reading. Thus it is entirely possible that the “handwriting recognition” problem of instantly translating any random thing that I scrawl onto the screen can NOT be solved. There may simply not be enough information in the scrawl to decode it into ASCII text.

 

The folks from the IBM research group told us that in their research about 90% of the people write clearly and 10% write illegibly. The 90% not only write well, they learn to modify their writing when writing into an electronic system and improve the recognition. However that unfortunate 10% who don’t write well to begin with, do not improve and there appears to be no hope for them.

 

Clearly there will always be a difference between carefully written characters and hastily scrawled one and one should expect that recognition would be better for characters in the former class and less for the hasty ones. We made the decision in Pen Windows to NOT recognize cursive precisely because cursive is written fast, fast is sloppy, and sloppy will lead to poorer recognition. We wanted the market to first experience a restricted input method that had relatively high accuracy, and let people whine at us for not reading their true handwriting rather than ship a cursive recognizer and have them whine because the accuracy was low. We felt that cursive was better introduced at a later time as a relaxation of restrictions on the original recognizer rather than as a contract in the first recognizer that we would recognize any random scrawl.

 

Unfortunately no one told Apple about this, they bought a not very robust cursive technology from the Russians and sure enough the market fried them for having such crappy recognition.

.

Birds do it, bees do it, even a Cray can’t do it.

 

Too many times people look at how poor the recognition is and comment that, “computers are getting faster all the time.” The clear hope is that the additional processing power will make it easier to solve the recognition problem. I merely point out that that is unlikely. The problem is usually NOT in the amount of processing power that is available. The amount of processing power available will limit where and how often the recognition technology is deployed but typically will not do much for the quality, at least in the near future.

 

My argument is simply this. Right now the top of the line Cray supercomputers (forgive me if Cray is no longer the top of the line at the time you are reading this) is at least an order of magnitude if not several orders faster and larger than the sort of PC that one typically sees on the desktop. Has some researcher shown how they can do 100% accurate voice recognition, handwriting recognition, optical character recognition, face recognition, whatever on the Cray? The answer is simply no. There are very few if any cases where the top of the line system just blows the doors off the quick and dirty PC recognizer. This is a good indication that the problems that we have in building accurate recognition software are NOT evidence that we do not have enough processing power. It is an indication that we have not yet figured out what we should be doing.

 

Do not get me wrong, I am not saying I don’t want fast machines and I am not pessimistic about whether or not we will figure out how to make recognizers better. I am actually rather optimistic. I am just pointing out that the fact that computers are getting faster is NOT likely to be the solution to the fundamental problem.

 

Programming Execl Macros with a pen is NOT natural and NOT easy.

 

This topic of this little flame has nothing to do with handwriting recognition. It has to do with not getting sucked into marketing bullshit, something that I believe damaged the emerging pen market. There is a strong desire to believe that if machines just TALKED or understood natural human writing then it would be so much easier to deal with them. This is confusing two very different things, 1) the input method and 2) the understanding. If we build a handwriting recognition system that recognizes my letters PERFECTLY and I use that to input my characters instead of using a keyboard, I have not really changed the complexity of interfacing to my spreadsheet program at all. Recognizing handwriting does NOT make it easier to program macros in the spreadsheet. Recognizing handwriting (or speech) does NOT mean you will be able to say “Write a Sort Routine in C++, Please” and viola the computer can now compose its own code. Yes, being able to ask the computer to do ANYTHING and have it UNDERSTAND your English question and do the right thing would be WONDERFUL. Unfortunately that is NOT what any of the groups working on either speech or handwriting recognition are working on. All they are doing is designing a different way to input character data.

 

I am not opposed to the idea of making it easier to interface with computers. I do not think that computer understanding is impossible, just poorly understood at present and a totally different problem from the one of replicating the keyboard in a different modality. If your database can’t currently field requests of the form, “Umm… I want to go to New York. I mean, Show me all the flights to New …No, wait. Not all of em, just the ones from Chicago, and not in the middle of the night. Oh yeah, show me the cheapest ones and I don’t want any layovers or any of that bullshit,” speech recognition will not help you get there.

Hundreds of Handwriting methods. Even the Japanese do not write like they say they do.

 

Sorry but the intervening 5 years has removed the actual number from the tip of my tongue and even left me with no pointer to the reference, but I recall reading that the number of different handwriting methods taught in the US is on the order of several hundred. Tennessee teaches different techniques than California and this local district in California teaches differently than that one. That there should be so many may be a surprise but from the combinatorial standpoint it is not really that odd. If there are 2 different ways to draw the letter A and three variants of the letter F then you have 2x3 or 6 different teaching possibilities. It does not take too many variants before you can have hundreds of different ways of writing. The rest is just the natural result of not having a single Nationally Legislated Standard method of writing. The price of freedom!

 

The reason to mention this is that when you start on this project you may have thoughts like “The right way to draw X is like Mrs. Smith taught me in 2nd grade.” These thoughts are most certainly wrong! You may have been taught that way, and Mrs. Smith may have been a strict disciplinarian and may have told you it was the right way, but those other people on the other side of the country were just as strictly instilled in some alternate method. You are wrong! Deal with it.

 

The variance in letter shapes that you see in the population is of course not just differences in what is taught, but what is learned as well. No one said all people do things the way that they were taught with no variation what so ever. Writing methods are typically developed by right handers for right handers. It is generally easier and more accurate to draw a line with a pull stroke  (pulling the fingers in tighter to the palm) than a push stroke. Left to right writing lets the right handers pull and forces the Lefties to invent their own methods.

 

The point of all this is that there is a lot of variance. The Japanese companies that we worked with pointed out that there was quite a difference between Japan where there IS one single nationally legislated standard system of writing and America where there is none. However when I collected data in Japan it was obvious that there is just as much variance in how the letters are drawn as there are in the U.S. I would collect data from a Japanese fellow that would assure me that their method was EXACTLY the one that Mrs. Smith taught them in 2nd grade and it was the one true government certified correct method. Of course the next person would draw the same character differently and tell me the same story. Similar to the U.S. one finds that left handers in Japan invent their own techniques for drawing something that looks like it was done using the “standard.”

 

<Brief digression>

Actually one of the amusing things that I did during the 6 months that I lived in Japan working on handwriting for Microsoft was to teach myself to draw the Kanji. I sat down with the dictionary for 2 hours every night and systematically went through and drew every character multiple times, being careful to follow the approved stroke order. It would take about 2 weeks to hack through the several thousand characters and when I got to the end I would just start over again at the beginning and do it again. After a couple months of this, having cranked through the dictionary several times, I could do it. I could write anything. I had no idea how to pronounce any of the characters and had no idea of their meaning but I could write anything and write it correctly. I got quite the reputation as an idiot savant. I could go to meetings, sit there understanding nothing of what as going on, but when someone would write something on the board, I could dutifully take notes. They were astounded. It was of course great fun to get into arguments about the proper stroke order, be told that being a foreigner and being unable to speak the language I could be forgiven my ignorance of the language. I would of course always push for a little wager and always win the bet, much to their chagrin. I would then remind them that while it was true that I did not understand the language, I had learned to write that character just last night and it was very fresh in my memory.

 

I would like to point this out to anyone that happens to be studying the Japanese language that I believe I have hit upon a novel and superior way to approach the massive task of learning the written language. By going at it in the way I did, you are able to quickly (2 months) get to a demonstrable milestone (can quickly and fluently write any character) and you will have seen ALL the radicals (the graphical sub components of the Kanji). Having learned them merely as abstract shapes you will have no collisions or mix up with any of your native language. You just know several thousand shapes and some of them keep showing up a lot. You will find yourself wanting to have a name to call that shape right there. i.e. rather than trying to attach a single funny shape that looks not much different from the other ones to a word that you have learned, you learn all the funny little shapes first and start attaching meaning and sounds to them. Yes it is backwards from the way that language is acquired by native speakers, but if you are learning Japanese as a second language you are demonstrably NOT a native speaker and any way you can get a handle on it should be considered.

 

The difference is that instead of learning that “NEKKO” is the Japanese word for cat and here is the funny little indistinguishable shape that you use to write “NEKKO” instead you learn a bunch of shapes, and once you can distinguish them, you learn that you call that shape that had no name “NEKKO”, and yes, that means cat. This way there is little language collision, when you see a cat your brain says, “CAT” and you must fight to get it to cough up the word “NEKKO” but you had no name for that funny shape and it is much easier to just learn that it is a “NEKKO”

</Brief digression>

 

To summarize, you will see considerable variance in the letter shapes regardless of the language and regardless of the strictness of the teaching methods and you are best off assuming that you DO NOT KNOW HOW YOUR OWN LANGUAGE IS WRITTEN!!!! Do not try to look within yourself for the answers. It does not matter if you were taught to come back and cross the T’s after you finished writing the entire word – that is NOT what people do. You must collect data and you must watch what they do. It is fascinating.

 

“F” - from the bottom up.

Self reflection is certainly of some use, but chances are you do not know what you do. I relate an aggravating experience that I had with an early recognition system until I figured out what was going on. This system recognized single printed characters and you could train in your own handwriting. I dutifully trained in the alphabet as instructed. It asked me to draw the letter “a” then “b” … I trained is as directed and then tried writing. It did not recognize my “f”. I went into the training mode and carefully drew some more f’s that looked just like the one I had drawn. I switched back to recognition mode and it still missed about half of my f’s. It took quite some time to find out why it was having such a terrible time with my “f”. The reason is that when I draw a single “f” I start at the top and draw downward. Every “f” that I drew into the trainer was drawn top down. However in the middle of writing I occasionally draw the “f” from the bottom up. The training part of the program, which only took in isolated characters, had never seen one of my bottom up “f” characters. In this system you could look at the training data and I could see no difference between the training data and the input that it could not recognize.

 

This little experience taught two important lessons for the system design.

 

1)       Do not train only in isolation. Not only do people write characters differently than they were taught in school, they also write them differently in different contexts. Your training method must be able to capture real data on the fly.

2)       It is not nice to present users with a misleading visualization of the prototype. I.e. in the above system stroke direction was significant, however that was not indicated in the on screen representation. Thus real data and prototype data that looked identical on screen were in fact quite different in the internal representation.

Little factoids – 1/10 of a second per neuron firing. Ballistic trajectories.

You will accumulate little bits of biophysical data. For example: Neurons can fire about 10 times a second. It takes about a tenth of a second for the thing to reset. Handwriting is done by firing off a bunch of neurons that tighten (or loosen) a bunch of muscle tissue. That muscle tissue acts like a spring and you have just changed the spring constant in this dynamical system (a mass, your hand, swinging on the end of a bunch of springs, your muscles). Once you have fired the nerves, the pen is traveling in a ballistic trajectory (ballistic in the sense that once the cannonball leaves the gun it is in free flight tracing the parabolic curve imposed by gravity). The pen will follow some rather deterministic (unalterable by the human) trajectory until the nerves reset and can fire the springs into a new configuration. From this you can notice things like that the average character has about 3 or 4 directions changes and since you can only manage about 10 in a second you should only be able to write about 2 or 3 characters a second (without skipping over some of the character shape which is what actually happens when you have someone writing cursive rapidly). You can start looking for features that ignore the actually shape of the trajectory (since it is mostly ballistic it is deterministic and has no information content) and instead concentrate on those places where the trajectory suddenly changed (such as velocity minima).

 

I am not attempting to be vague about what feature space you use except in the sense that I don’t believe that there is any one correct feature space. I think any of several will do just fine. I will eventually get around to describing the one that we used. You should however start collecting little bits of this sort of fluff because it is out of these little factoids that one builds the nest. I mentioned this particular one because it matches other little bits of UI folklore, like that anything that happens on screen in 1/10th of a second or less is perceived by the user as happening instantaneously, whereas if it take a whole second it will be perceived as pig slow.

 

It also means that you have huge amounts of time, whole tenths of seconds to do real processing, while you wait for the next significant bit of data to come in. It also means that you have some justification for having the sampling rate of the pen set higher than 10 samples a second but also to believe that going from 100 to 200 will not greatly improve things.

Outline of process – Segment – Classify – Cleanup

 

This is less a description of how to build a recognizer than it is a way of attaching names to some of the components that will probably show up in any recognition system. First of all you get a huge collection of sequential XY coordinates. This represents a line, perhaps several lines of text and certainly more than one character. Segmentation is the process of breaking the whole thing into chunks, lines, words, letters, and strokes.  Classification is the process of looking at one of those segments and identifying it, “That is an X.” Cleanup is anything that you do after you have done the first two stages where you look at the results that you got, decide that that is garbage but maybe you can fix it up. It is naive to believe that you will be able to write a program that correctly segments independent of the classification step. The three items listed are not sequential modules in a program, they are names for information that you must extract in the process of solving the problem.

 

The architecture is more likely to be something like this. You will have strong indicators of possible segmentation points. E.g. large white spaces between words. Based on what kind of problem you are dealing with you will CHOOSE to err either on the side of having too many segmentations points (meaning you’ve chopped letters up into pieces) or too few segmentation points (meaning you’ve left yourself with runons). You will decide which of these problems you would rather deal with and split it up that way.

 

Next, you will start stuffing the segments off to get them classified, possibly using the classification results to re-segment. For example, you may find that the first two segments separately do not classify as anything, but the two taken together as a single segment make a good character. (I refer you to the discussion of Grafiti, the artificial handwriting language for the Palm Pilot. They cure the segmentation problem by FORCING the user to use the rule 1 stroke = 1 character. The user is segmenting so the program has a trivial segmentation problem. Discrete speech recognition, where – you – must – pause – briefly – between – each – word – as – you – say – it, is another example of a place where the segmentation problem is pushed upon the user.)

 

Consolidating those results can be tricky. You will be forced into a choice, I can segment this way and get foo on the other hand I can segment that way and get bar. Which is more likely, foo or bar? Unless your classification method also has something that corresponds to likelihood, or certainty of the classification, you will find yourself at a loss to choose one possibility over another.

 

The cleanup is so dependent upon the particulars it is worthless to discuss it in general except to point out that the cleanup is so important as to be required. Also it will by its very nature destroy any possibility that the recognition code will have any conceptual elegance and modularity. The clean up will force you to go in and hack in exceptions that will nearly obliterate any of the original design concepts. I can’t say this with certainty but I believe that the reason that recognition problems are hard (in code on computers) is because they lack the regularity that computers do so will with. They present a lumpy and uneven landscape and the code is necessarily lumpy and uneven as well.

Shape recognition is NOT the problem. Even people can’t recognize shapes. Context is everything.

 

One of the first things we did on the project is collect a bunch of handwriting data. We segmented it by hand. Human beings carefully split it all up, this is an A this is a Q… We gave people sentences to copy, we let them write freely and we classified it all, the good, the bad and the ugly. Before we had done a lick of work on our statistical classifier, before we even knew what statistical classification was we choose a random feature space and a fairly ad hoc method for classifying the letters. We tested the system on our data set and it recognized about 90% of the characters. Not great, but not bad for a first shot. Just to see how that compared with human recognition we wrote a program to show the characters one at a time in random order on the screen and let a person classify them. 87% humm… must be a mistake, the machine did better than the human. We tried it again with a different human got 89%. We tried another 86%. That’s when we knew this problem was gonna be a piece of cake. Our very first hacked out attempt at recognition did better than human level recognition. We looked at some of the characters where people made the mistake and the computer got it right. Sure enough we had not screwed up the data the program was just better at classification than we were.

 

Needless to say, in the next couple of years we disabused ourselves that this was going to be easy. The mistake we made was in thinking that the problem was recognizing characters. The fact that people could NOT recognize the characters in isolation should have been the tip off that the information necessary to recognize writing is NOT in the shape of the character. A human deduces the characters from the words being read, and does not deduce the words from the characters (actually it is an interplay between the two). We spent a lot of time on the character recognizer, not because it was bad but rather to make it easier to compare scores and make choices between alternatives, but even so our over all recognition was no where near the capabilities of a human.

 

People are able to bring very broad range of information to bear on the problem including the ability to understand the language being written.  “John and Jane wxnt to the store,” is fairly easy to read regardless of what scribbled mess goes in the place of the letter x. You are using lots of language knowledge when you conclude that the word is “went”. You know that there ought to be a verb in the sentence and “want” doesn’t fit well in that sentence. People unconsciously and effortlessly use a great deal of contextual information to disambiguate garbage and the low scores on the character recognition test we gave them shows that the garbage content is quite high.  If the average character is only recognizable 90% of the time and the average word is five characters long, the chance that every single character in a word is recognizable is .9^5 = .59. Thus if you segment flawlessly, classify characters better than humans and use no context to correct, you will be in the sorry state where nearly every other word written will have a mistake in it!

 

It is trivial to get to the 90% level. Getting beyond that is hard work because beyond that the work is all trying to introduce context. Let’s stuff in a dictionary oops that bloats the size up by a factor of 10. Oops common names, Fred and Mississippi aren’t in the dictionary. Oops neither are uncommon names like Marlin or Eller. Maybe we need a language model so we can see when names are coming so we can know when to use the dictionary and when not.

Surprise – u = n  s=g

In your head conjure up the image of the prototypical perfect lowercase “u” and the lowercase “n”. Now you would think from those images that it should be fairly easy to distinguish a “u” from an “n”. One curves up, one curves down.  The “u” has a cusp at the top of the right hand side, the “n” has a cusp at the bottom of the left hand side. Totally different shapes! Well, surprise. You are looking in your head at the UR-characters that you were taught in 2nd grade. Those perfect specimens that were carefully printed in the book are totally different from each other. They also happen to be totally different from what people write which looks more like a backwards capital “N” ie. It is a down up down stroke moving gracefully to the right without a cusp on either end. The character as it is typically drawn is something halfway between the two. Nothing particularly deep here. People don’t draw that carefully. Given that U is a vowel and N is not means that if you read at the word level they are not often in situations where they will be confused.

 

The caveat here is that if you just spent a lot of time creating code or even just designing a feature space that looks for cusps (because that will help you distinguish character that have cusps) you have probably wasted your time. You just designed a feature space to decode the characters that were printed on the chart in the front of the room in second grade. Those idealized characters had cusps. The characters in your head have cusps. Those characters do not exist in the real world written on paper.

A decision tree is only as good as the first decision

 

At the start of the project we ordered up the biggest collection of published articles on handwriting recognition that we could from the library (something on the order of a couple hundred journal articles) and started plowing through them. One of the earlier papers (Stone age struff from the 60’s when you had to fit your code and data into a hundred bytes or so) was using decision trees for their classification. They would go test some feature like was there a cusp in the stroke (which they of course defined) and if it did, then it was one of these characters and if is did not it was one of those.

 

It was obvious to more modern eyes that this was a pathetically weak architecture. If you make a mistake in your first decision then you are off into the wrong branch of the tree and everything thereafter will be wrong. The accuracy of the system will be something less than the accuracy of that first decision.

 

This was useful in that it inspired us to avoid making any single decision until the last possible moment.  It was this thinking that led us to dealing with delayed context.

Hooks

Delayed Context – cows suck and zootsoots =

Capital “O”, Lower case “o”, and the number zero are all written with the same glyph. (We used the term “glyph” to identify a drawn shape and distinguish that from the term “letter” which is an ASCII code that is the interpretation of a glyph. Often different glyphs are used to represent the same letter and in the O-zero case a single glyph is used to represent multiple letters) Some of the ambiguous glyphs in the English language are { C,c – O,o,0 – 2,Z,z – 5,S,s – P,p – t,+ - U,u – V,v – W,w – X,x  } – Not an exhaustive list, and clearly the list grows as you add additional symbols that you wish to recognize that start colliding with other glyphs.

 

Often the only way to disambiguate which was the intended interpretation for an ambiguous glyph is through the use of context. For example if you know where the baseline is you can easily distinguish between the Upper case P and the Lower case P.

 

We used the phrase “cows suck” as a short hand to remind us of the problem of delayed context. If you consider the difference between the handwritten lower case version and the upper case version “COWS SUCK”  you will see that the only real difference is the shape of the final “k” The choice of what letter to assign to the initial “C” is not known until the user finally writes a disambiguating character that determines the midline and thus determines all the intervening character. “cows suck” was just chosen to show that you could have rather long sequences of such indeterminate characters.

 

Similarly “zootsoots =” is disambiguated by the presence of the terminal equals sign into actually being a numerical expression “200 + 500 + 5 =”

 

The point being, if you start in on an unconstrained handwriting recognition program you will need to deal with the fact that there are ambiguous characters whose only resolution may be context and that context may be considerably delayed. Of course there are ways that do not involve recognition such as constraints that can deal with this. For example you can present lined paper on the screen and save yourself from trying to infer a baseline and a midline or you can have numeric mode where letters are not permitted. Of course in real life, we automatically shift in and out of these modes easily. We see something that must be a phone number and we know phone numbers are numbers so we see numbers, except of course when the phone number is something like 1-800 – SUN-HAWK.

 

In brief outline the way we tackled the cows suck problem was this.  As we recognize each group of strokes into a character (or rather as we hypothecate a character for a certain group of strokes) we also, once we believe that we know what the character is, can produce a number for its baseline and its midline and its height. If the group of strokes is basically a circle, then we are confused as to whether it is a capital-Oh or a little-Oh. In both cases the baseline is the bottom of the character, but for the capital the mid line is in the middle of the character, for the littleOh the mid line is at the top of the character. Thus at the same time that we are modeling the characters and trying to choose what interpretaion we should attach to the strokes, we are also modeling the baseline, midline, ascender height and descender height.

 

If you were drawing on lined paper, those numbers would be supplied to you and you would descriminate between the uppercase and the lowercase Oh by whether the high point on the circle is closer to the ascender line or to the mid line. If you are not drawing on lined paper, you are creating a model that assigns a penalty for every time any of those supposedly constant lines jerk around. Thus when you write “COW” and all the heights come to the same place, you can essentially eliminate the interpretations cOW, CoW, cOw, etc. because you must pay a high penalty for having the midline from one character not align very well with the midline of the next character. You are only left with two penalty free interpretations, either they are all uppercase or they are all lowercase. When you finally hit an unambiguous character or characters, “COWBOY” resolves itself into the proper case.

 

In short, you recognize the context, namely the baseline and midline, at the same time that you are recognizing the characters and you choose the model and the characters that give you the best score. You can also do the identical thing with numbers, using the rather ad hoc requirement, that numbers fit better with numbers and letters with letters rather than intermixing them. This works great for disambiguating things like telephone numbers and street addresses, it works poorly for copy protection serial numbers on the back of a CD. (Of course, any application that intermixes the number zero with the letter Oh and expects the user to be able to see the differencs is a poorly designed application!) It does need a little help (context) for something like “105th St”. Let’s see, was that Lowercase-L, Capital-Oh, Capital-S, Lowercase-T, Lowercase-H or was that  OneHundredFive Plus Lowercase-H? Humm… really bad upper lower case mix in the one, number to letter transition in the other. Maybe it was really Ten Sth and there was a word break between the number and the Sth, but then Sth is not in the dictionary so that is unlikely. Sure is a good thing that the engineers hacked in a kludge that remove the penalty for just a few special digit/letter transitions. 1st 2nd 3rd 0th 1th 2th 3th 4th …

 

You are quickly drawn into creating language models that look at plausible patterns for intermixing numbers and letters and Upper and Lower case.

Classification – Features, Prototypes, Metric spaces

 

This paper is NOT going to be a general introduction to AI pattern recognition techniques, however classification systems which are an important component of a recognition system all seem to share certain structural similarities and so we will give a quick overview of the basics.

 

A classification system is a system that is presented some data (like the pen trace of a single glyph), which it is attempting to classify (return an ASCII character value, or return a glyph number). The data typically represents one thing out of a set of possible things and you are supposed to identify which one in the set it represents. Clearly there are many recognition problems like cursive or speech where the data actually represents several things and you need some other step to split the initial data into chunks. That is called the segmentation problem. We deal here only with the Classification problem.

 

In outline form you are going to collect some prototypical data, which you will classify by hand. This data is an “A” this data over here is a “B” this one over here is a different kind of “A”. These will be your prototypes. You will have a Metric, or a distance function that is capable of comparing a chunk of input data to any prototype and can tell you how close it is. This is called Nearest Neighbor Classification; you simply claim that the input data most likely has the same class as the prototype that is “closest” to it. Finally since the “raw input data” is usually large and noisy often there is a preprocess chunk of code that smashes the input data into a smaller easier to recognize chunk of data. Reducing the data is called “Feature extraction”

 

As you might guess, much of the art in building a classifier is in the feature extraction stage. You do not want your efforts to eliminate the noise to also eliminate the information that you need in order to solve the problem. As some examples, suppose your data input is a cloud of XY data points and you are looking for something that is tall and narrow as opposed to something short and wide. Rather than keep all of the hundreds of data points you will probably preprocess them down by computing a bounding box and just looking at the ratio of height to width. Quite a size reduction! However if the data is noisy and some of the points don’t belong to the set you might find that the single number is quite sensitive to a single outlying wrong data point. In that case you might consider keeping a range of widths and heights based on tossing out up to 5 extraneous points.

 

In any case, you choose features that will reduce the computational complexity usually by using your human knowledge about the problem space to throw out stuff that you know is irrelevant to the decision and to emphasize that that is central to the decision. Clearly for problems where it is clear what is irrelevant and what is central you can dispose altogether with the AI aspects and just make the calculations. The AI aspects come in when the problem is less clear and in those cases caution in choice of features is to be recommended. (Note do not make the mistake of assuming that you will make fewer mistakes in recognition if you keep all the original data around just to be safe. Not all data is relevant and any that is not simply complicates the final recognition process.)

 

Regardless of whether you end up using neural nets, fuzzy logic, statistical methods or any other ad hoc method, chances are you will be able to identify sections that basically consist of the three main components:

A)     mashing the raw data into some simpler internal format (the feature vector)

B)      some closeness function (the metric) for comparing an input feature vector with a prototype vector

C)       finally a bunch of prototypes to compare against the input

Statistical Pattern Recognition – Cluster Analysis, Principal components, Whitening

 

I will elaborate just a bit more on statistical pattern recognition partly because I am a mathematician and also partly because it is the form that we eventually chose as the best means for the problems that we had to work with. Basically it casts the problem in terms of classifying things that are points in some Euclidean N dimensional space, simply because we know so many really good tools for working with Euclidean N space (matrices and vectors, oh joy!)

 

The choice of Euclidean N space essentially eliminates any work in choosing the metric in that you just use the Euclidean metric, square root sum of squares. Your prototypes are represented as points in that space. The only work is to mash the messy input into a single point in that space. (this may be a challenge because your data may be some arbitrary number of XY coordinates that represent the pen trace of a glyph and you have to reduce that down to a fixed number of points, say 10 so that they all fit in a single 20 dimensional space)

 

You then can use the tools of Cluster Analysis, Principal Component Analysis, and Whitening Matrices to deal with some of the problems. Again, it is not my intent to give details on any of these things, but rather to give a brief description of what the tools are so that if you decide you need that tool you can go out and look up the details and use the tool.

 

Cluster Analysis used to help you select prototypes from some collection of data. This may be easy and it may not. Imagine that you have collected a lot of data, many samples from many people of all the different characters. Each character is rendered into a single point in N-Space. If there is but a single way to draw the character A you can just take the average value (something that makes sense in Euclidean N space!) of all the A’s and call it the prototypical A.   However suppose there are two distinctly different ways to draw the A. You really want to have two different prototypes describing the two ways. Cluster Analysis is a bag of tricks that looks as a whole bunch of data points representing all the A’s and tries to break it up into distinct clusters. If your clusters are little Gaussian clouds (i.e. there really are two or three or nine prototypical ways of writing the A and the input data is just bell shaped curve Gaussian noise added to the prototype) then you will find that Cluster Analysis helps. If your clusters are not Gaussian noise added to classic prototypes, you might want to try to change your feature space to make the shapes look more like little Gaussian clouds. (This is the way that real mathematics works, you find a tool that solves a problem like finding the average of a gaussian cloud and then you try to mash every problem into that form because then you can use the cool tool!)

 

What does one of these little Gaussian clouds look like? Basically it is a collection of dots in some ellipsoid shaped thing centered around the average prototype. The dots (the data points) are clustered thickly around the center and they thin out as you move away. Since these take the shape of a bell curve you can talk about standard deviations as you move away from the center. The reason that they are ellipsoidal is that there is not necessarily any relationship between the scale of the different dimension in your N-space one might be in inches and the other might be a mass in grams.  Also the major axis of the ellipsoid may lie in some funny angle not aligned with the coordinate axis.

 

Principal Component Analysis is the tool that you use to find the shape of an ellipsoid. It will tell you in which direction the major axis lies and how long it is (how far you go out to be one standard deviation from the center). It will then tell you in which direction the next axis lies and how long it is, and so on for each different axis of the ellipsoid. Note, the dimensions in which the data does not spread out are the areas that you are usually interested in. Places where the data does not display much variance are places where the data are showing great similarity to one another. Places where it is spread out are places where large differences are tolerated and are thus not very important. Note it is quite possible that you can do further noise reduction and feature extraction at this point. If you find that in 3 dimensions the data is quite constrained but it varies all over in the other dimensions you can often throw out those useless dimensions and reduce the problem to a smaller space. Principal Component Analysis will basically take in the data that represents a single Gaussian cloud and will produce a Matrix that you can apply to the data to rotate it, shear it, and stretch it into the form that the data in all in a simple normalized coordinate system.

 

The Matrix that you get from the Principal Component Analysis is called the Whitening matrix. When you apply it to the data the data is now white i.e. uniform in all directions. Note as mentioned above the matrix does not need to be square, you can use a rectangular matrix and throw out dimensions that do not appear to have any utility.

 

It is important to note that every cloud will have its own little Whitening matrix. Just because one character like “O” has tight constraints that the beginning point and the ending point are close together does not make that a good feature for some different character like “L”. The point is that if you want to see how “close” some piece of input data is to the “O” prototype you should compare the input data in the white data space surrounding the “O” prototype. Then if the beginning point and the end points are several deviations away from what you typically see for “O” you can see that it is not close to being an “O” whereas the “L” data space may be much more tolerant of separation between the endpoints

Neural Nets, Fuzzy Logic, UFOs – Bah Humbug!

 

One of the things that interested me about working on the handwriting recognition problem was that as I read through the many different research papers I seemed to notice a trend. Someone would use a neural net to work on the classification problem and would write something up about what they did and how it worked, and it became clear in the reading that they really did not care about solving the handwriting recognition problem, rather they were using handwriting recognition as a tool for probing the problems with neural nets. More and more papers, each doing one little thing and NOT doing the things that the other researchers had already concluded that you should do. I even had one frank academic explain to me that the purpose of the research was to find something that could be published, the less work involved the better. He didn’t want to solve handwriting recognition, though that would be nice if it happened, rather he just tried to find something that no one else had done so he could do it and write it up. Publish or Perish in action.

 

I concluded that we had a chance to do something that no one else had done before, namely to take all the results and observations that have been made by these many academicians over the years and put them together into a single working system

Viterbi, approximate string matching, dynamic programming.

Aud = And

 

After all this whining about things that do not work, and how essenital context is and blah blah blah, you may be feeling that this is all too hard. I mentioned how gross cleanup is and how it destroys the beauty of the code and yet how essential it is.  Let me step back from the theory for just a moment and give an example of clean up that worked. It worked well and it was a total kludge.

 

As mentioned earlier the letters u and n are often indistinguishable.  This is not just a computer classification problem, humans will look at the u that they themselves drew and clearly see an n. We were having problems where the word “and” would get misrecognized as “aud”. “And” is such a common word and “Aud” is virrtually non-existant. We finally decided to hardwire the recognizer to always convert the solo word “aud” to “and”.

 

 Rather to be honest I argued a long time with some of my programmers to change the system to work in that manner and they were disgusted to put in a change that would make it ABSOLUTELY IMPOSSIBLE for the user to write the word “aud” if that is what they wanted. (they used my own 45 test argument against me – see down below). I pointed out that we were simply using context, namely english word frequency, to bias the result in favor of the much more likely outcome. After much imploreing I finally insisted and it was done. Guess what? “Hey the recognition has gotten much better. You guys are making some progress.” The fact that the system was hardwired to NOT recognize the string “aud” means that if that is what you want – you get what you think is a misrecognition error and you have to go fix the ‘n’ to be a ‘u’. No one ever complained that they could not write “aud” but in return, we NEVER made mistakes on the word “and” any more.

 

One disgusting hardwired kludge that was trivial to implement and one more common problem was eliminated.

 

Sometimes ya just gotta know when to stop thinking and start kludging.

 

 

Stroke order independence – thousands of E’s

 

There are actually two different kinds of handwriting recognition software, On-line and Off-line. The off line version is what the post office cares about and is also known as OCR or Optical Character recognition. This is the form where you have been given a paper that has some characters on it and you are scan it into the computer and try to recognize the characters that were printed or written on the page. The second form, on-line, differs from the OCR largely in that a) you don’t need to deal with differing pen types (Fat wide felt tip writing, as well as fine tip pencil) and b) you know what order the character was written in and even what direction the strokes were drawn in.

 

It is clear that stroke order and stroke directions are NOT required pieces of information in order to read otherwise we could not read a printed page where such information is not available.

 

<Irrelevant Digression>

While stroke direction is NOT captured in characters written with a pen, it actually IS captured in Brush writing. If you look closely at traditionally written Chinese characters when you see a vertical line, you will actually see something more like a wedge, thick and round at the top where you first SPLAT down with the brush, mashing the bristles into the page, and then tapering off to a point where you lift it off the page.

 

There are characters in the katakana, one of the phonetic Japanese character sets, that differ primarily by stroke direction. One has a diagonal stroke running from the top right to the bottom left and the other has a stroke running from the bottom left up to the top right.

 

In the days when these were written with a brush there was no difficulty distinguishing these two characters. Now that everyone writes with a ballpoint pen, the stroke direction is no longer obvious in the handwritten language. So what people do to clarify their writing is that they draw a little hook on the end that they start on so that the starting end looks just a little fatter than the finishing end.

 

Once again changing technology in writing implements and writing media has an impact on the evolution of the written language. Kind of makes you wonder how much of our Roman character shapes are determined by the requirements that they be easy to chisel into Marble columns or to scratch into clay tablets with flattened sticks, or write onto papyrus with reed pen.

 

Makes you wonder where the written language will go as we start writing into our Palm Pilots and start writing on paper in some of the same ways.

</Irrelevant Digression>

 

There would seem to be some argument for developing a feature space that eliminates the stroke order and the stroke direction information. This would typically be done by developing some canonical direction and order for strokes. However except when it bites you - as it did with my bottom up F’s and as it does with capital E – it seems much easier to just leave the stroke direction in.  I.e. it is much easier to do the matching by assuming that stroke i of the prototype matches with stroke i of the input. Most individuals write fairly uniformly, always using one or possibly two variants for a shape. Even the population in general doesn’t get too wild in the way that they draw things.

 

For example, the capital letter E can be drawn with 4 strokes, with 3 strokes, or with 2 strokes. The 2 stroke variant is usually drawn by drawing something like a capital “C” followed by the mid line, similarly to the way one would draw the Greek epsilon. Even if we ignore all but the 4 stroke variants we conclude that if you allow the user to draw the four lines in any order you get 4! Or 24 possible stroke orders (including the unlikely ones like draw the bottom cross bar first, then the vertical stem, followed by the mid bar, and concluding with the top bar.) If you also allow the single strokes to be drawn from either end you get 2^4 possible stroke directions. When you multiply this all out you have 384 different E’s. While you could keep prototypes for all of these shapes, it seems obvious to reduce the complexity by inserting code to reduce the characters to a canonical form that is stroke order independent.

 

However, in reality, most of those E shapes are never used. In all the data that we collected (thousands of characters and hundreds of users) we found about 12 or 14 E shapes actually in use. Thus depending on your needs you may decide that it is easier to just keep the prototypes and NOT canonically reorder the input. The risk is that the underlying model fits what people do, but not what they see. Ie. They do draw strokes from one end to another, but they don’t see that information.

Dark Matter – when do you have enough prototypes?

 

Throughout the project we collected data, handwriting samples. Every time we collect data, we run across a new prototype, “Hey, Charlie, look at the way this bozo draws his Capital E. Never seen that one before have we.”  You got your marketing people asking you how accurate you are. “99.99%+ for me. I wrote it. I trained it. I fixed it for every one of my idiosyncratic behaviors. Works great!” Unfortunately the random individual who gets this fresh out of the box does not get the same mileage.  If they draw their “Z” in some way that we have never seen the computer has NO idea what that could be. How many such mistakes will we make? How bad will the recognition be for a first timer?

 

We call this the dark matter problem. The dark matter problem is that there appears to be more stuff out in the universe producing gravity than we can see. What is that stuff out there that we know must exist but can’t see? Similarly, we know from our collecting that there will always be character prototypes that we have not yet seen and everyone we have not seen causes a misrecognition for every user that writes that way. It is easy to know how many prototypes you have seen, but how do you know how many you have not seen?

 

This is one of those problems that the difficulty was less in solving the problem and more in recognizing that is was a problem and casting it into a form that we could solve. Any statisticians out there will immediately recognize that estimating the density function of something in a population based on a polled sample is done all the time.

 

<digression> I got my masters degree in mathematics but in all that time was never required to study statistics, a disgusting applied math. My wife, of course got her PhD in statistics. Only after several years of working on the recognizer, did we identify the dark matter problem and figure out a form of a solution. When I proudly tell the wife of the great advance we finally made on a long standing problem, I get the “No Duh” look. Had I had noticed that the problem was a stat problem I could have gone to the statistician for the simple solution long ago.

</digression>

 

Without getting into the grubby details, we cast the problem in to the form of a ball and urn problem (something that mathematicians know and love) and it becomes easier to deal with. Without going into the math involved I will give a rough description that gives a good rule of thumb first approximation to the dark matter problem.

 

Imagine that you have an urn filled with balls that have different colors (each different color represents a different prototype) You may have thousands of red balls, representing a very common prototype, hundreds of green balls representing a less common variant and a single blue ball representing a true iconoclast. You get to pull fixed size sample from the urn, say 100 balls. You look at the sample, record the colors you see and return the balls to the jar. Now you get to pull out a single ball and you want to know what is the probability that you have pulled out a color that you have never seen before? This is the dark matter problem, to estimate the size of the unknown stuff.

 

Let us look at some special cases. Suppose the urn has nothing but white balls (a whole lot of those) and some other color(s). If you pull out 100 white balls, you can NOT assume that the urn contains only white balls, but what you can conclude is that if there is some other color mixed in with the white, it is VERY UNLIKELY that half of the balls are non-white. If the balls were half and half you would expect to see 50 whites and 50 of some other color. The fact that you saw 100 white balls means that the whites outnumber the non-whites by a factor of something like 100 to 1. So, there may be a thousand different unseen colors, there may be a million different unknown prototypes out there, but the likelihood of tripping over one of them is on the order of 100 to 1.

 

Suppose every single ball in the urn has a different color. Then the 100 you draw out will probably each have a different color. And when you go back to the pot to pick the single one, the chances that it will match one that you have seen are miniscule. Did the one purple ball that you drew out of the hundred mean that purple balls have a density out in the population of 1 in 100? No, it could range anywhere from 0 to 2 or 3 in a hundred, but it is less likely that it is some large number like 50 in 100 because then you would have expected to see more like 50 purple balls in your sample of 100.

 

It is not too far wrong to reason thusly, suppose I saw 50 red balls, 25 greens and 25 whites. This means that approximately 50% is red, 25% is green and 25% is white. However I could have been off by 1 in each count (of course I could have been off by more than one and in either direction, but we are just waving hands here) so I pretend that the density of red is really 49%, 24% green, and 24% white for a total of 3% estimate of the dark matter. You can see if I got 100 white balls I would estimate 1% dark matter. If I have 100 balls, all different, I estimate 100% dark matter.

 

Thus if we collect 2000 samples of the capital “E” and we see 14 different prototypes we can estimate that there is somewhere around 14/2000 = .7% dark matter.

 

The utility out of this kind of reasoning is that you can estimate how much you do not yet know based on the sample that you do know and you can use this to determine how much more data you want to collect in order to reduce the possibility that someone will produce a character in a form you have never seen before.

 

For example suppose you want 99% prototype coverage. Given about 50 characters (upper, lower, numbers) and about 3 prototypes per character you need about 100*3*50  = 15,000 character samples. If you want 99.9% prototype coverage you would need about 150,000 samples.  Note however In the first case, even with a 99% chance that a new user draws a character in a known way the likelihood that that user can draw all 50 characters in known ways is .99^50 = 60%. So while 99% sounds great for an isolated character it still leaves you with 40% of the people having at least one character that the system cannot recognize. Note that .999^50 = 95% for a much more reasonable coverage.

 

After doing these calculations we all started chanting that we need “MO BETTA DATA” however there is an equally important design conclusion that you can come to, namely that you are not going to collect enough data, you will have a dark matter problem, your marketing geeks had better explain that to the users of the system so that they are not rudely surprised when it does not work right out of the box, and your technical geeks had better make sure that there is a painless way to add new prototypes into the system ON THE FLY that you do NOT have the luxury to grovel over with your statistical tools, carefully clustering, whitening, and discriminating from all the other prototypes.

Training – Must be on the fly. Humans cannot repeat what they did.

 

I’ve already mentioned this a couple of times. Not much to add. Because of dark matter you MUST do training. Because people do not know what they do and because they do it differently in context you can NOT present them with a canned training package and expect it to resolve all their problems (You probably DO want to include a canned training package because that will both reduce the customer expectations of having the thing work perfectly out of the box and it will immediately deal with obvious prototype differences).

If you won’t take the 45 test you do NOT have a user interface.

 

I knew we were doomed when none of my engineers volunteered to take the 45 test. The 45 test is simple. I offer some modest reward, like a hundred dollars, to any of my engineers that helped write the recognition system to write a single character of my choosing correctly. They built the system. They can load in all the prototypes. They know how sensitive the system is to hooks. They know exactly what shapes are confusing. In short they get to stack the deck as well as it can be stacked. I just want them to get one single character right. Oh yes, I forgot the 45 part. You get to do this test with a 45 gun held to your head. Make a mistake and it’s all over.

 

No one took me up on my offer. I used this graphically violent image to try to reinforce in my engineers how irritating it is to NOT be able to get what you want and the importance of getting what you want from a UI. Ask yourself, would you take the 45 test for a hundred bucks with a keyboard? I tell you to type the character J, hundred bucks if you get it right, bullet to the brain if you fail. How about if I make it easier. I give you 3 chances to type the J, if you screw up the first time, you can change the keyboard, reboot the system, do whatever you want to get it the way you like it. You can even practice the key several times before you take the test to be sure that everything is working. All you gotta do is type in one J for me.

 

Would you take the 45 test with a keyboard? Would you do it with your favorite pen recognition software?

How confidant are you that you can get the character you want when you want it?

Hooks

Grafitti – change the problem to one you can solve.

2D parsing.