Title: "Acceleration and Scalability for c-Means Clustering"

Abstract: This talk begins with characterization the three canonical problems of clustering: tendency assessment (does the data have cluster substructure?); clustering (how do we find partitions of the data?); and validation (are the partitions we find accurate and/or useful?). I identify the four types of models used in clustering: partition only (single linkage); prototype only (self-organizing maps); partition and prototypes (hard, fuzzy and possibilistic c-means); and (partition, prototype, other parameter) models (EM algorithm for Gaussian mixture decomposition). I will give a brief account of the basic models and algorithms for hard, fuzzy, possibilistic c-means. These algorithms are reliable, but can be slow, and may not scale well for very large data sets. So, the main issues when dealing with huge data sets are acceleration and scaling.


There have been many ideas for acceleration of FCM advanced over the years. I will review eight methods that can be used to speed up clustering with FCM when the entire data set is loadable. Included are AFCM, mrFCM, brFCM and a method for acceleration that depends on mounting the algorithm on a GPU instead of a conventional computer. Reported improvements in speed range from 2:1 to 100:1.


The final part of the talk covers two approaches for (approximate) FCM clustering in VL data. First I will discuss two methods that are based on incremental, distributed clustering (spFCM and oFCM). The second set of three methods considers approaches to approximating FCM clusters for VL data by a much different method – viz., sampling followed by non-iterative extension. This general technique can be used for EM and many other algorithms that are not discussed in this talk. The three algorithms are eFFCM (efficient fast FCM for image data); geFFCM (for feature vector data); and eNERF (for relational data). We will have time to look at only eFFCM here.

James Bezdek Jim received the PhD in Applied Mathematics from Cornell University in 1973. Jim is past president of NAFIPS (North American Fuzzy Information Processing Society), IFSA (International Fuzzy Systems Association) and the IEEE CIS (Computational Intelligence Society): founding editor the Int'l. Jo. Approximate Reasoning and the IEEE Transactions on Fuzzy Systems: Life fellow of the IEEE and IFSA; and a recipient of the IEEE 3rd Millennium, IEEE CIS Fuzzy Systems Pioneer, and IEEE technical field award Rosenblatt medals. Jim's interests: woodworking, optimization, motorcycles, pattern recognition, cigars, clustering in very large data, fishing, visual methods for clustering, blues music, wireless sensor networks, poker and co-clustering in rectangular relational data. Jim retired in 2007, and will be coming to a university near you soon (especially if there is fishing nearby).