Marlin’s Gottahack – 1999 Jan 07 – PREV HOME

The VERI Clustering Algorithm

---------------------------------------

My Dad found an interesting clustering algorithm on the web. It is at www.sandia.gov/1100/1155Web/users.htm You can either go read it there, or you can read the quick and dirty summary here. Well, actually not as quick and dirty as I’d like. I’ve thrown in a review of what clustering is and a traditional cluster analysis tool for contrast.

Cluster Analysis is the problem where you are given a set of points in some Euclidean space and you want to assign them to groups. This is done typically as a precursor to building a recognizer. As an example suppose you are working on a handwriting recognizer (don’t ask me why anyone would ever want to do that, this is just a random example ;) You collect your character samples, a few E’s a couple T’s. You crunch them all into some data representation which you think of as a single point in some Euclidean Nspace and then comes the cluster analysis. You might want to replace all the different A’s that you collected with a single A prototype, typically the center of gravity of the "cluster" of A’s that you have. However you will be screwed if there were two or more totally different ways of drawing the A’s. You will average two radically different ways of drawing the letter A and get a prototype that looks like nothing. Cluster analysis is the attempt to solve that problem by telling you that you have one group over here and another one over there.

Before we get into the VERI algorithm let me run you through a standard simple clustering algorithm so that you can compare the advantages and disadvantages of the new algorithm to an old one. The old one that I am going to haul out is the K-MEANS clustering algorithm.

K-MEANS

--------------

You got your set of data points in Rn (Euclidean space of n-dimensions) You ARBITRARILY decide that you have K different clusters (this may or may not be a problem in your application. In the example of handwriting having too few clusters compared to the true number is disastrous, but having too many is merely a little inefficient.) You seed the algorithm with K randomly located "center" points in Rn. These K points are going to be the centers of our K clusters. You now take each of your original datapoints and find which of the K centers it is closest to. This partitions your data points into K subsets. For each one of those subsets you calculate the mean of the subset and replace the old "center" point with the newly calculated mean point. Iterate this process until it converges.

If you imagine this algorithm where you got a cloud of points over here and a second cloud of points over there and you do the k-means with k=2 you will see that it mostly does just what you want. Unfortunately mostly isn’t always.

Suppose your fist randomly chosen center point was right in the middle between the two clouds, the mean of the entire dataset, and the second one was way off in the boondocks. In this case when you do the partition operation, the first point captures all the data and the second one captures nothing. The system converges to the wrong thing. This is clearly easy to detect i.e. if one of your partitions is empty, you know something is wrong. I am merely pointing out that the system is sensitive to the initial starting conditions and you must write some extra code to notice this.

Note that a single outlier in your data set can either incorrectly capture a center, or skew the entire results.

The K-Means algorithm is easy to do, works in simple cases and mostly does the right thing. It is a good choice if you know the shape of your data. Drawbacks are primarily that you had to choose K before you got started and for oddly shaped clusters (e.g. one cluster of points right around the origin and the second cluster near the surface of the unit sphere i.e. one cluster buried inside another) it totally blows chunks.

OK, that’s enough background. On to the new thing.

VERI

---------

This method is built around an attempt to model what people do when they "see" data clustering and most importantly eliminated any need to "know" any thing about the data in advance. i.e. you do not need to first eliminate outliers, guess the number of clusters, decide on thresholds or any other such stuff.

Step 1. You map out the human visual perceptual system by doing the something like the following experiment. You nail down two points (call them blue) and then randomly introduce a third (call it red). You ask the human to tell you whether they see the two blue points as belonging to the same group or not. Clearly if the red point is far away from the blue ones, the blue ones will be seen as a cluster with a remote red one. On the other hand, if the red one is put right next to a blue one, the red is seen as clustering with that blue and the other blue one is left out. That is what you are mapping out is the Region Of Interference, ROI, the region around the two points where the third point fucks up the clustering.

Vision researchers do the experiment and find a shape that looks somewhat like the two circles centered on the two points with radius equal to half the distance between the two points. Thus the two circles are tangent at the mid point between the two points. The actual shape is something different from the two circles, more of a dumb bell shape, but it is pretty close to two kissing circles.

I point out that this step one here is not really a step in the algorithm. It is a precursor that has already been done. All you need is to have the polygon that represents the shape of the region of interference.

Step 2. With that one planer region of interference you are ready to cluster the set of points. For each pair of points you cluster them together iff there is NO third point that interferes with the clustering. Recall that 3 points determine a plane so even if you are dealing with a space of many dimensions you are really just look at the 2D ROI in the plane containing the three points. All you do is look to see if some third point lies within that region of interference in that plane.

Step 3. The connected components are the clusters.

Piece of cake. On the surface it is O(N^3) algorithm. In practice you can do some quick and dirty hacking to make it more like N^2. In particular you find every point’s nearest neighbor. Chances are that a) the near neighbor clusters with you and also simultaneously prevents nearly any other clustering. This lets you locate interfering points at something closer to O(1).

Now for all the wonderful advantages:

  1. There was not a single arbitrary initial parameter (like first choosing K or being careful about initial conditions)
  2. Outliers are detected because they cluster with NOTHING.
  3. In a Classified set you detect regions of overlap. By this I mean the following. In the handwriting data you cluster everything all at once. If clouds are separate, some A’s over here, some A’s over there, the B’s way over there, you find that the connected components have identical class markings. However when two clouds overlap, like lowercase u with lowercase n, you find a region where u shapes and n shapes are co-mingled. You will find that you have a connected component that does not have a single unique classification.
  4. You can use the cluster analysis directly as a classifier in the following way. To identify a new point, you treat him like a new addition to the original data set and see who he clusters with. If the new guy is an outlier he clusters with no one and you realize that you can’t classify him. If it clusters with multiple groups, like some u’s and some n’s you know that you don’t have an exact match.
  5. The clusters that it identifies closely match the clusters that a human sees when looking at data because the system is modeled after the way that we tend to perceive clusters. You can either look at this as an advantage or a disadvantage. Either we are brilliant at cluster matching and this system models it or else we are totally blind to a much higher order, truly intelligent means of clustering and since we can’t see it, neither can this algorithm. Take your pick.

 

That’s about it. My own impression without doing any testing of it is that because they came at it from a human perceptual region of interference point of view they attach entirely TOO MUCH importance to the shape of the region that they measured in their human experiments. Their programming is clearly complicated by having to calculate whether a third point is within their complicated polygonal region. They point out some of the tricks that they do to get a quick true rejection, a quick true acceptance and a harder test for those that fail the two quick tests. My guess would be that if you replace their complicated region with my approximate description of the two balls centered on the two points you would get nearly identical results with much less computation (and no need to ever even consider the plane that contains the 3 points)

Color Clusters

------------------

The one particularly entertaining thing that they did with this is to do some clustering in X-Ray images of brain tissue. They simply clustered pixels having the same basic color and the same basic location.

It reminded me of an experiment I once did in developing a progressive image download format. The idea was to replace a costly to download complex image with a collection of solid color polygons that approximates the image. The method I used was to use K-Means to choose the K best color centers and then to posterize the image into those K clusters i.e. convert each pixel to be the nearest poster color. The next step was to encode the edges of the polygons. We used a turtle graphics model where you start at the coordinates of a corner of a pixel and using Up, Down, Left, Right you march around the boundaries of the single colored groups.

What made the thing progressive was that we would first code it up using K =1 which would just give you a rectangle of the average color of the image. Then we would use K=2 and split the image into two color clusters and partition the image into those two colors. At each stage we would take one of the existing colors in the partially transmitted image and split it into two colors and transmit the boundaries that separate the colors. It looked pretty cool. We never did use it though. Discovered Wavelets and got busy with other things.

Joke

------

Reporter is interviewing a blind man who is making a name for himself by taking up skydiving.

Reproter: So how exactly do you skydive if you can’t see?

Blindman: Oh, it is very easy. I just jump out of the plane, count to 10 a pull the ripcord.

Reporter: Yes, I understand that, but how can you tell if you are over flat ground or trees if you can’t see?

Blindman: Well, being blind forces me to rely on my other senses. I can smell the trees and steer away from them.

Reporter: I see, but tell me, how do know when you are going to hit the ground?

Blindman: That’s easy. Just before I hit the ground my dog’s leash goes slack.

Sorry, make that Bad Joke. See ya in a few.