A good review paper on Statistical Pattern Recognition:

  • Anil K. Jain, Robert P.W. Duin, Jianchang Mao, "Statistical Pattern Recognition: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan., 2000

Abstract:The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

A paper dealing with general considerations on feature extraction:

  • I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, vol. 3, pp.1157-1182, 2003

Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

A technical report on Bayesian Classification Theory

  • Robert Hanson, John Stutz, Peter Cheeseman, "Bayesian Classification Theory", NASA Ames Research Center (Artificial Intelligence Research Branch) pdf

Abstract: The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the AutoClass system searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, have discovered new independently-verified phenomena, and have been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters though a class hierarchy. In this paper we summarize the mathematical foundations of Autoclass.

A 1967 paper introducing Nearest neighbor algorithm using the Bayes probability of error

  • T. Cover and P. Hart, "Nearest neighbor pattern classification", IEEE Transactions on Information Theory vol. 13, Issue 1, Jan 1967, pp21-27

Abstract: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications, and hence the probability of errorRof such a rule must be at least as great as the Bayes probability of errorR^{ast}--the minimum probability of error over all decision rules taking underlying probability structure into account. However, in a large sample analysis, we will show in theM-category case thatR^{ast} leq R leq R^{ast}(2 --MR^{ast}/(M-1)), where these bounds are the tightest possible, for all suitably smooth underlying distributions. Thus for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

Two papers on a different approach for the Bayes classifier, the so-called Naive Bayes Classifier

  • P. Domingos and M. Pazzani, "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss", Journal of Machine Learning, vol 29, pp 103-130, year 1997.
  • I. Rish "An empirical study of the naive Bayes classifier", Workshop on Empirical Methods in Artificial Intelligence, IJCAI 2001

An article from the Journal of Multivariate Analysis on Bayesian Estimators for Normal Discriminant Functions

  • Rukhin, A. L. 1992. Generalized Bayes estimators of a normal discriminant function. J. Multivar. Anal. 41, 1 (Apr. 1992), 154-162.

Abstract: It is shown that the traditional estimator of a discriminant coefficient vector is the generalized Bayes estimator with respect to a prior which can be approximated by proper priors. As a corollary the admissibility of this procedure in the class of all scale equivariant estimators is derived.

An historical paper about how R.A. Fisher introduced the Maximum Likelihood method in 1922:

  • J. Aldridge, "R.A. Fisher and the making of Maximum Likelihood 1912-1922", Statistical Science, 1997, vol. 12, pp. 162-176.

Abstract: In 1922 R.A. Fisher introduced the method of maximum likelihood. He first presented the numerical procedure in 1912. This paper considers Fisher's changing justifications for the method, the concepts he developed around it (including likelihood, sufficiency, efficiency and information) and the approaches he discarded (including inverse probability).

A tutorial on Maximum Likelihood Estimation

  • In Jae Myung, "Tutorial on Maximum Estimation", Journal of Mathematical Psychology, vol. 47, pp. 90-100, 2003

Abstract: In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation in statistics and is an indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles.

A paper that describes the fisher's linear discriminant for patter recognition

  • T.Cooke, "Two variations on Fischer's Linear Discriminant for pattern recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, Feb. 2002, pp. 268-273

Abstract: Discriminants are often used in pattern recognition to separate clusters of points in some multidimensional feature space. The paper provides two fast and simple techniques for improving on the classification performance provided by Fisher's linear discriminant for two classes. Both of these methods are also extended to nonlinear decision surfaces through the use of Mercer Kernels.

An article dealing with face recognition by using Fisher Linear Discriminant

  • C. Liu and H. Wechsler, "Enhanced Fisher linear discriminant models for face recognition", Proceedings of the 14th International Conference on Pattern Recognition, Queensland, Australia, August 1998.

Abstract: We introduce in this paper two Enhanced Fischer Linear Discriminant (FLD) Models (EFM) in order to improve the generalization ability of the standard FLD based classifiers such as Fisherfaces. Similar to Fisherfaces, both EFM models apply first Principal Compenent Analysis (PCA) for dimensionality reduction before proceeding with FLD type of analysis.

A paper describing a method for binary classfication using the Rayleigh quotient (Generalized Rayleigh Quotient) for malignant tumors detection

  • T. Mu, A.K. Nandi, M. Rangaraj, and Rangay yan, "Pairwise Rayleigh quotient classifier with application to the analysis of breast tumors", Proceedings of the 4th conference on IASTED International Conference: Signal Processing, Pattern Recognition and Applications, Innsbruck, Austria, pp. 356-361, 2007

Abstract: In this paper, we propose a new supervised learning method for binary classification, named the pairwise Rayleigh quotient (PRQ) classifier, in which the nonlinearity is achieved by employing kernel functions. The PRQ classifier generates a Rayleigh quotient based on a set of pairwise constraints, which consequently leads to a generalized eigenvalue problem with low complexity of implementation. The PRQ classifier is applied in the original feature space for linear classification, as well as in a transformed feature space by employing the triangle kernel for nonlinear classification, to discriminate malignant breast tumors from a set of 57 regions in mammograms, of which 20 are related to malignant tumors and 37 to benign masses. Nine different feature combinations are studied. Experimental results demonstrate that the proposed linear PRQ classifier provides results comparable to those obtained with Fisher linear discriminant analysis (FLDA). In the case of nonlinear classification, the PRQ classifier with the triangle kernel provides a perfect performance of 1.0 for all of the nine feature combinations evaluated in terms of the area under the Receiver Operating Characteristics curve, but with good robustness limited to the setting of the kernel parameter in a certain range. We propose a measure of robustness to evaluate the PRQ classifier.

A paper from 1958 when the perceptron was introduced by Rosenblatt

  • F. Rosenblatt, "The Perceptron: A probabilistic model for information storage and organization in the brain", Psynchological review, Vol. 65, No. 6, 1958.

A paper explaining several neural-network algorithms including perceptron

  • B. Widrow, and M.A. Lehr, "30 years of adaptive neural networks: perceptron, Madaline, and backpropagation", Proceedings of the IEEE, Vol. 78, pp. 1415-1442, Sept 1990.

Abstract:Fundamental developments in feedforward artificial neural networks from the past thirty years are reviewed. The history, origination, operating characteristics, and basic theory of several supervised neural-network training algorithms (including the perceptron rule, the least-mean-square algorithm, three Madaline rules, and the backpropagation technique) are described. The concept underlying these iterative adaptation algorithms is the minimal disturbance principle, which suggests that during training it is advisable to inject new information into a network in a manner that disturbs stored information to the smallest extent possible. The two principal kinds of online rules that have developed for altering the weights of a network are examined for both single-threshold elements and multielement networks. They are error-correction rules, which alter the weights of a network to correct error in the output response to the present input pattern, and gradient rules, which alter the weights of a network during each pattern presentation by gradient descent with the objective of reducing mean-square error (averaged over all training patterns).

A paper from one of the authors that we talked in class (Feb 19th): Vapnik

  • C. Cortes, and V. Vapnik, "Support-Vector Networks", Journal of Machine Learning, vol. 20, pp. 273-297, 1995.

Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the supportvector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non separable training data. High generalization ability of support vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

Application of NN (I): This paper is about modelling of object behaviours using statistical models

  • N.Johnson, and D. Hogg, "Learning the Distribution of object Trajectories for Event Recognition", Journal of Image and Vision Computing, Vol. 14, pp. 609-615,1996.

Abstract: The advent in recent years of robust, real-time, model-based tracking techniques for rigid and non-rigid moving objects has made automated surveillance and event recognition a possibility. We present a statistically based model of object trajectories which is learnt from image sequences. Trajectory data is supplied by a tracker using Active Shape Models, from which a model of the distribution of typical trajectories is learnt. Experimental results are included to show the generation of the model can be used for the identification of incidents, event recognition and trajectory prediction.

Application of NN (II): Speechreading (Lipreading)

  • M. Vogt, "Fast Matching of a Dynamic Lip Model to color video Sequences under Regular Illumination Conditions", Speechreading bu humans and Machines, vol. 150 of NATO ASI Series. Computer and Systems Sciences, pp. 399-408, 1996.

Abstract: Automatic speechreading is based on a robust lip image analysis. In this work, no special illumination or lip make-up is used. Compared to other approaches the analysis is based on true color video images. The system allows for realtime tracking and storage of the lip region and robust off-line lip model matching. The proposed model is based on cubic outline curves. A neural classifier detects visibility of teeth edges and other attributes. At this stage of the approach the edge between the closed lips is automatically modeled if applicable, based on a neural network's decision. Further interior model parts for teeth and lip opening may be applied in the future. To allow fast model adaptation, image processing is only performed where necessary. An image processing cache stores results for fast further access. The energy function, which is minimized during adaptation, considers internal model forces as well as processed image data, outer forces, and dynamic constraints. The rotationally invariant model allows easy extraction of spatial and dynamic parameters. Also principal component analysis or neural dimension reduction techniques may be applied to obtain significant features for recognition.

Application of NN (III): Real-time Target Identification for Security Applications

  • S.J. McKenna, and S. Gong, "Non-intrusive person authentication for access control by visual tracking and face recognition", Lecture Notes in Computer Science, Springer, Vol. 1206, pp. 177-183, 1997.

Abstract: Face recognition systems typically operate robustly only within highly constrained environments. This paper describes work aimed at performing face recognition in more unconstrained environments such as occur in security applications based on closed-circuit television (CCTV). The system described detects and tracks several people as they move through complex scenes. It uses a single, fixed camera and extracts segmented face sequences which can be used to perform face recognition or verification. Example sequences are given to illustrate performance. An application in non-intrusive access control is discussed.

Decision Trees and Applications in Pattern Recognition and Intrusion Detection

  • Xiao-Bai Li, "A scalable decision tree system and its application in pattern recognition and intrusion detection", ACM Digital Library, Decision Support Systems, Volume 41, pg. 112-130, 2005.

Abstract: The publication discusses one of the major problems of data mining dealing with very large datasets. It is very hard to develop scalable algorithms to solve these datasets, since their sizes exceed the maximum memory capacity of a computer. The author suggests using a new decision tree algorithm called SURPASS, Scaling Up Recursive Partitioning with Sufficient Statistics, which is very effectively when dealing with these types of datasets. The SURPASS technique uses linear discriminants to partition the decision tree recursively. This algorithm represents the data needed to create the decision tree into a group of statistics about the collection. The algorithm will then gather the data sequentially in small segments, and operates on these subsets in main memory one at a time. Consequently, the algorithm is able to achieve higher efficiency in working with very large collections. The paper presents an experiment which operates on 3 sets of pattern recognition and intrusion detection. The author claims that SURPASS is a scalable algorithm when used with large sets, and this produces an efficient and accurate decision tree.

The seminal paper of S. Johnson discussed on class (04/10)

  • S. Johnson, "Hierarchical clustering schemes", Psychometrika, vol. 32, No. 3, pp. 241-254, 1967.

Abstract: Abstract Techniques for partitioning objects into optimally homogeneous groups on the basis of empirical measures of similarity among those objects have received increasing attention in several different fields. This paper develops a useful correspondence between any hierarchical system of such clusters, and a particular type of distance measure. The correspondence gives rise to two methods of clustering that are computationally rapid and invariant under monotonic transformations of the data. In an explicitly defined sense, one method forms clusters that are optimally connected, while the other forms clusters that are optimally compact.

The paper by Benjamin King describing a step-wise procedure for the clustering (class 04/10)

  • B. King, "Step-Wise Clustering Procedures", Journal of the American Statistical Association, vol. 62, No. 317, pp. 86-101, 1967.

Abstract: A simple step-wise procedure for the clustering of variables is described. Two alternative criteria for the merger of groups at each pass are discussed: (1) maximization of the pairwise correlation between the centroids of two groups, and (2) minimization of Wilks' statistic to test the hypothesis of independence between two groups. For a set of sample covariance matrices the step-wise solution for each criterion is compared with the optimal two-group separation of variables found by total enumeration of the possible groupings.

Optimization-based clustering methods

  • P. Hansen and B. Jaumard, “Cluster Analysis and Mathematical Programming,” Math. Programming, vol. 79, pp. 191-215, 1997.

Abstract: Cluster analysis involves the problem of optimal partitioning of a given set of entities into a pre-assigned number of mutually exclusive and exhaustive clusters. Here the problem is formulated in two different ways with the distance function (a) of minimizing the within groups sums of squares and (b) minimizing the maximum distance within groups. These lead to different kinds of linear and non-linear (0-1) integer programming problems. Computational difficulties are discussed.

  • J. MacQueen, “Some Methods for Classification and Analysis Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. I: Statistics, pp. 281-297, 1967.

Abstract: This paper describes a number of applications of the 'k-means', a procedure for classifying a random sample of points in E sub N. The procedure consists of starting with k groups which each consist of a single random point, and thereafter adding the points one after another to the group whose mean each point is nearest. After a point is added to a group, the mean of that group is adjusted so as to take account of the new point. Thus at each stage there are in fact k means, one for each group. After the sample is processed in this way, the points are classified on the basis of nearness to the final means. The portions which result tend to be fficient in the sense of having low within class variance. Applications are suggested for the problems of non-linear prediction, efficient communication, non-parametric tests of independence, similarity grouping, and automatic file construction. The extension of the methods to general metric spaces is indicated.

  • A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” Proc. Fourth SIAM Int’l Conf. Data Mining, pp. 234-245, 2004.

Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

  • R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. 20th Int’l Conf. Very Large Data Bases, pp. 144-155, 1994.

Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a role to play in spatial data mining. To this end, we develop a new clustering method called CLARANS which is based on randomized search. We also develop two spatial data mining algorithms that use CLARANS. Our analysis and experiments show that with the assistance of CLARANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms. Furthermore, experiments conducted to compare the performance of CLARANS with that of existing clustering methods show that CLARANS is the most efficient.

Hierarchical Clustering Methods

  • C. Ding and X. He, “Cluster Aggregate Inequality and Multi-Level Hierarchical Clustering,” Proc. Ninth European Conf. Principles of Data Mining and Knowledge Discovery, pp. 71-83, 2005.

Abstract: We show that (1) in hierarchical clustering, many linkage functions satisfy a cluster aggregate inequality, which allows an exact O(N2) multi-level (using mutual nearest neighbor) implementation of the standard O(N3) agglomerative hierarchical clustering algorithm. (2) a desirable close friends cohesion of clusters can be translated into kNN consistency which is guaranteed by the multi-level algorithm; (3) For similarity-based linkage functions, the multi-level algorithm is naturally implemented as graph contraction. The effectiveness of our algorithms is demonstrated on a number of real life applications.

  • G. Karypis, E. Han, and V. Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,” Computer, vol. 32, pp. 68-75, 1999.

Abstract: Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as K-means, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called Chameleon that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closenessof items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in Chameleon is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of Chameleon in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. Experimental results on these data sets show that Chameleon can discover natural clusters that many existing state-of-the art clustering algorithms fail to find.

Density-Based Clustering Methods

  • M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996.

Abstract: Clustering algorithms are attractive for the task of class indentification in spatial databases rises the following requirements of domain knowledgement to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value of it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

  • J. Sander, M. Ester, H. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998.

Abstract: The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we generalize this algorithm in two important directions. The generalized algorithm—called GDBSCAN—can cluster point objects as well as spatially extended objects according to both, their spatial and their nonspatial attributes. In addition, four applications using 2D points (astronomy), 3D points (biology), 5D points (earth science) and 2D polygons (geography) are presented, demonstrating the applicability of GDBSCAN to real-world problems.

  • R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. ACM-SIGMOD Int’l Conf. Management of Data, pp. 94-105, 1998.

Abstract: Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

  • A. Hinneburg and D.A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 58-65, 1998.

Abstract: Clustering in data mining is used for identifying useful patterns and interesting distributions in the underlying data. Several algorithms for clustering large data sets have been proposed in the literature using different techniques. Density-based method is one of these methodologies which can detect arbitrary shaped clusters where clusters are defined as dense regions separated by low density regions. We present a new clustering algorithm to enhance the density-based algorithm DBSCAN. Synthetic datasets are used for experimental evaluation which shows that the new clustering algorithm is faster and more scalable than the original DBSCAN.

  • M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proc. ACMSIGMOD Int’l Conf. Management of Data, pp. 49-60, 1999.

Abstract: Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only 'traditional' clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.

Grid-Based Clustering Methods

  • W. Wang, J. Yang, and R. Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining,” Proc. 23rd Very Large Data Bases Conf., pp. 186-195, 1997.

Abstract: Spatial data mining, i.e., discovery of interesting characteristics and patterns that may implicitly exist in spatial databases, is a challenging task due to the huge amounts of spatial data and to the new conceptual nature of the problems which must account for spatial distance. Clustering and region oriented queries are common problems in this domain. Several approaches have been presented in recent years, all of which require at least one scan of all individual objects (points).

  • G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A Wavelet-Based Clustering Approach for Spatial Data in Very Large Databases,” The Very Large Databases J., vol. 8, pp. 289-304, 2000.

Abstract: Many applications require the management of spatial data in a multidimensional feature space. Clustering large spatial databases is an important problem, which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using the multiresolution property of wavelet transforms, we can effectively identify arbitrarily shaped clusters at different degrees of detail. We also demonstrate that WaveCluster is highly efficient in terms of time complexity. Experimental results on very large datasets are presented, which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods.

Graph-Based Clustering Methods

  • C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “Spectral Min-Max Cut for Graph Partitioning and Data Clustering,” Proc. First IEEE Int’l Conf. Data Mining, pp. 107-114, 2001.

Abstract: An important application of graph partitioning is data clustering using a graph model - the pairwise similarities between all data objects form a weighted graph adjacency matrix that contains all necessary information for clustering. Here we propose a new algorithm for graph partition with an objective function that follows the min-max clustering principle. The relaxed version of the optimization of the min-max cut objective function leads to the Fiedler vector in spectral graph partition. Theoretical analyses of min-max cut indicate that it leads to balanced partitions, and lower bonds are derived. The min-max cut algorithm is tested on news-group datasets and is found to outperform other current popular partitioning/clustering methods. The linkage-based refinements in the algorithm further improve the quality of clustering substantially. We also demonstrate that the linearized search order based on linkage differential is better than that based on the Fiedler vector, providing another effectivepartition method.

  • L. Ertoz, M. Steinbach, and V. Kumar, “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,” Proc. Third SIAM Int’l Conf. Data Mining, pp. 47-58, 2003.

Abstract: The problem of finding clusters in data is challenging when clusters are of widely differing sizes, densities and shapes, and when the data contains large amounts of noise and outliers. Many of these issues become even more significant when the data is of very high dimensionality, such as text or time series data. In this paper we present a novel clustering technique that addresses these issues. Our algorithm first finds the nearest neighbors of each data point and then redefines the similarity between pairs of points in terms of how many nearest neighbors the two points share. Using this new definition of similarity, we eliminate noise and outliers, identify core points, and then build clusters around the core points. The use of a shared nearest neighbor definition of similarity removes problems with varying density, while the use of core points handles problems with shape and size. We experimentally show that our algorithm performs better than traditional methods (e.g., K-means) on a variety of data sets: KDD Cup '99 network intrusion data, NASA Earth science time series data, and two dimensional point sets. While our algorithm can find the “dense” clusters that other clustering algorithms find, it also finds clusters that these approaches overlook, i.e., clusters of low or medium density which are of interest because they represent relatively uniform regions “surrounded” by non-uniform or higher density areas. The run-time complexity of our technique is O(n2) since the similarity matrix has to be constructed. However, we discuss a number of optimizations that allow the algorithm to handle large datasets efficiently. For example, 100,000 documents from the TREC collection can be clustered within an hour on a desktop computer.

Links for papers and tutorials on k-means clustering

Andrew Moore: “K-means and Hierarchical Clustering - Tutorial Slides”

Brian T. Luke: “K-Means Clustering”

Tariq Rashid: “Clustering”

Hans-Joachim Mucha and Hizir Sofyan: “Cluster Analysis”

Frigui Hichem: “Similarity Measures and Criterion Functions for clustering”

Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”

Pier Luca Lanzi: “Ingegneria della Conoscenza e Sistemi Esperti – Lezione 2: Apprendimento non supervisionato”

Stephen P. Borgatti: “How to explain hierarchical clustering”

Maria Irene Miranda: “Clustering methods and algorithms”

Jia Li: “Data Mining - Clustering by Mixture Models”

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett