Data Classification Algorithms, DBSCAN and Mahalanobis Distance

by Codewiz51 October 22, 2011 07:16

I've been working on data classification algorithms for the last several weeks.  For the multi-feature (multi-dimensional) data I am working with, I've settled in on a modified version of DBSCAN that utilizes Mahalanobis distance.  Using the Mahalanobis distance instead of the Euchlidean distance (ED) offers some great advantages.  (Mahalanobis distance will be referred to as MD in the rest of this posting to save me some typing.)  MD has an advantage over ED (no pun intended) in that it is not as sensitive to large differences in the order of magnitude of features (dimensions).  e.g. In multi-featured data, one feature may have an order of magnitude (OOM) of 10**5 and another feature may have an OOM of 10**-2.  If you attempt to utilize ED, the first feature is so much larger than the second that it tends to dominate the distance calculation.  MD, because of the nature of the covariance matrix calculation, tends to scale the results so that one or more large OOM features do not dominate the MD.   See this reference for a visualization of this effect.

For all of it's promise and effective results, I am discovering that the entire process is very dependent upon my assumptions.  Guessing the limiting MD to judge whether the point is in a group or how many points a point must be bound to before it is considered a member of the group has a great effect on how points are classified.  Often, slight changes to the MD limit, or the number of points necessary to define a group can rearrange the entire output of the algorithm.  This becomes a very real problem when there are more than three features (dimensions) to you data.  If you cannot visualize your data, then how do you determine if the starting assumptions are "good".

I'll address my partial solution to this issue as I am able to test methods for improving the algorithm.

Comments are closed

Powered by BlogEngine.NET 1.6.0.0
Theme by Mads Kristensen | Modified by Mooglegiant

Disclaimer

This blog represents my personal hobby, observations and views. It does not represent the views of my employer, clients, especially my wife, children, in-laws, clergy, the dog, the cats or my daughter's horse. In fact, I am not even sure it represents my views when I take the time to reread postings.

© Copyright 2008-2011