Marlin’s Gottahack
– 1999 Jan 07 – PREV HOME
The VERI Clustering Algorithm
---------------------------------------
My Dad found an interesting clustering
algorithm on the web. It is at www.sandia.gov/1100/1155Web/users.htm You can either go read it there, or you can read the quick and dirty
summary here. Well, actually not as quick and dirty as I’d like. I’ve thrown in
a review of what clustering is and a traditional cluster analysis tool for
contrast.
Cluster Analysis is the problem where you are
given a set of points in some Euclidean space and you want to assign them to
groups. This is done typically as a precursor to building a recognizer. As an
example suppose you are working on a handwriting recognizer (don’t ask me why
anyone would ever want to do that, this is just a random example ;) You collect
your character samples, a few E’s a couple T’s. You crunch them all into some
data representation which you think of as a single point in some Euclidean
Nspace and then comes the cluster analysis. You might want to replace all the
different A’s that you collected with a single A prototype, typically the
center of gravity of the "cluster" of A’s that you have. However you
will be screwed if there were two or more totally different ways of drawing the
A’s. You will average two radically different ways of drawing the letter A and
get a prototype that looks like nothing. Cluster analysis is the attempt to
solve that problem by telling you that you have one group over here and another
one over there.
Before we get into the VERI algorithm let me
run you through a standard simple clustering algorithm so that you can compare
the advantages and disadvantages of the new algorithm to an old one. The old
one that I am going to haul out is the K-MEANS clustering algorithm.
K-MEANS
--------------
You got your set of data points in Rn
(Euclidean space of n-dimensions) You ARBITRARILY decide that you have K
different clusters (this may or may not be a problem in your application. In
the example of handwriting having too few clusters compared to the true number
is disastrous, but having too many is merely a little inefficient.) You seed
the algorithm with K randomly located "center" points in Rn. These K
points are going to be the centers of our K clusters. You now take each of your
original datapoints and find which of the K centers it is closest to. This
partitions your data points into K subsets. For each one of those subsets you
calculate the mean of the subset and replace the old "center" point
with the newly calculated mean point. Iterate this process until it converges.
If you imagine this algorithm where you got a
cloud of points over here and a second cloud of points over there and you do
the k-means with k=2 you will see that it mostly does just what you want.
Unfortunately mostly isn’t always.
Suppose your fist randomly chosen center
point was right in the middle between the two clouds, the mean of the entire
dataset, and the second one was way off in the boondocks. In this case when you
do the partition operation, the first point captures all the data and the
second one captures nothing. The system converges to the wrong thing. This is
clearly easy to detect i.e. if one of your partitions is empty, you know
something is wrong. I am merely pointing out that the system is sensitive to
the initial starting conditions and you must write some extra code to notice
this.
Note that a single outlier in your data set
can either incorrectly capture a center, or skew the entire results.
The K-Means algorithm is easy to do, works in
simple cases and mostly does the right thing. It is a good choice if you know
the shape of your data. Drawbacks are primarily that you had to choose K before
you got started and for oddly shaped clusters (e.g. one cluster of points right
around the origin and the second cluster near the surface of the unit sphere
i.e. one cluster buried inside another) it totally blows chunks.
OK, that’s enough background. On to the new
thing.
VERI
---------
This method is built around an attempt to
model what people do when they "see" data clustering and most
importantly eliminated any need to "know" any thing about the data in
advance. i.e. you do not need to first eliminate outliers, guess the number of
clusters, decide on thresholds or any other such stuff.
Step 1. You map out the human visual
perceptual system by doing the something like the following experiment. You
nail down two points (call them blue) and then randomly introduce a third (call
it red). You ask the human to tell you whether they see the two blue points as
belonging to the same group or not. Clearly if the red point is far away from
the blue ones, the blue ones will be seen as a cluster with a remote red one.
On the other hand, if the red one is put right next to a blue one, the red is
seen as clustering with that blue and the other blue one is left out. That is
what you are mapping out is the Region Of Interference, ROI, the region around
the two points where the third point fucks up the clustering.
Vision researchers do the experiment and find
a shape that looks somewhat like the two circles centered on the two points
with radius equal to half the distance between the two points. Thus the two
circles are tangent at the mid point between the two points. The actual shape
is something different from the two circles, more of a dumb bell shape, but it
is pretty close to two kissing circles.
I point out that this step one here is not
really a step in the algorithm. It is a precursor that has already been done.
All you need is to have the polygon that represents the shape of the region of
interference.
Step 2. With that one planer region of
interference you are ready to cluster the set of points. For each pair of
points you cluster them together iff there is NO third point that interferes
with the clustering. Recall that 3 points determine a plane so even if you are
dealing with a space of many dimensions you are really just look at the 2D ROI
in the plane containing the three points. All you do is look to see if some
third point lies within that region of interference in that plane.
Step 3. The connected components are the
clusters.
Piece of cake. On the surface it is O(N^3)
algorithm. In practice you can do some quick and dirty hacking to make it more
like N^2. In particular you find every point’s nearest neighbor. Chances are
that a) the near neighbor clusters with you and also simultaneously prevents
nearly any other clustering. This lets you locate interfering points at
something closer to O(1).
Now for all the wonderful advantages:
That’s about it. My own impression without
doing any testing of it is that because they came at it from a human perceptual
region of interference point of view they attach entirely TOO MUCH importance
to the shape of the region that they measured in their human experiments. Their
programming is clearly complicated by having to calculate whether a third point
is within their complicated polygonal region. They point out some of the tricks
that they do to get a quick true rejection, a quick true acceptance and a
harder test for those that fail the two quick tests. My guess would be that if
you replace their complicated region with my approximate description of the two
balls centered on the two points you would get nearly identical results with
much less computation (and no need to ever even consider the plane that
contains the 3 points)
Color Clusters
------------------
The one particularly entertaining thing that
they did with this is to do some clustering in X-Ray images of brain tissue.
They simply clustered pixels having the same basic color and the same basic
location.
It reminded me of an experiment I once did in
developing a progressive image download format. The idea was to replace a
costly to download complex image with a collection of solid color polygons that
approximates the image. The method I used was to use K-Means to choose the K
best color centers and then to posterize the image into those K clusters i.e.
convert each pixel to be the nearest poster color. The next step was to encode
the edges of the polygons. We used a turtle graphics model where you start at
the coordinates of a corner of a pixel and using Up, Down, Left, Right you
march around the boundaries of the single colored groups.
What made the thing progressive was that we
would first code it up using K =1 which would just give you a rectangle of the
average color of the image. Then we would use K=2 and split the image into two
color clusters and partition the image into those two colors. At each stage we
would take one of the existing colors in the partially transmitted image and split
it into two colors and transmit the boundaries that separate the colors. It
looked pretty cool. We never did use it though. Discovered Wavelets and got
busy with other things.
Joke
------
Reporter is interviewing a blind man who is
making a name for himself by taking up skydiving.
Reproter: So how exactly do you skydive if
you can’t see?
Blindman: Oh, it is very easy. I just jump
out of the plane, count to 10 a pull the ripcord.
Reporter: Yes, I understand that, but how can
you tell if you are over flat ground or trees if you can’t see?
Blindman: Well, being blind forces me to rely
on my other senses. I can smell the trees and steer away from them.
Reporter: I see, but tell me, how do know
when you are going to hit the ground?
Blindman: That’s easy. Just before I hit the
ground my dog’s leash goes slack.
Sorry, make that Bad Joke. See ya in a few.