Revision as of 15:42, 10 April 2008 by Mboschru (Talk)

Contents

A good review paper on Statistical Pattern Recognition:

  • Anil K. Jain, Robert P.W. Duin, Jianchang Mao, "Statistical Pattern Recognition: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan., 2000

Abstract:The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

A paper dealing with general considerations on feature extraction:

  • I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection", Journal of Machine Learning Research, vol. 3, pp.1157-1182, 2003

Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

A technical report on Bayesian Classification Theory

  • Robert Hanson, John Stutz, Peter Cheeseman, "Bayesian Classification Theory", NASA Ames Research Center (Artificial Intelligence Research Branch) pdf

Abstract: The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the AutoClass system searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, have discovered new independently-verified phenomena, and have been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters though a class hierarchy. In this paper we summarize the mathematical foundations of Autoclass.

A 1967 paper introducing Nearest neighbor algorithm using the Bayes probability of error

  • T. Cover and P. Hart, "Nearest neighbor pattern classification", IEEE Transactions on Information Theory vol. 13, Issue 1, Jan 1967, pp21-27

Abstract: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications, and hence the probability of errorRof such a rule must be at least as great as the Bayes probability of errorR^{ast}--the minimum probability of error over all decision rules taking underlying probability structure into account. However, in a large sample analysis, we will show in theM-category case thatR^{ast} leq R leq R^{ast}(2 --MR^{ast}/(M-1)), where these bounds are the tightest possible, for all suitably smooth underlying distributions. Thus for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

Two papers on a different approach for the Bayes classifier, the so-called Naive Bayes Classifier

  • P. Domingos and M. Pazzani, "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss", Journal of Machine Learning, vol 29, pp 103-130, year 1997.
  • I. Rish "An empirical study of the naive Bayes classifier", Workshop on Empirical Methods in Artificial Intelligence, IJCAI 2001

An article from the Journal of Multivariate Analysis on Bayesian Estimators for Normal Discriminant Functions

  • Rukhin, A. L. 1992. Generalized Bayes estimators of a normal discriminant function. J. Multivar. Anal. 41, 1 (Apr. 1992), 154-162.

Abstract: It is shown that the traditional estimator of a discriminant coefficient vector is the generalized Bayes estimator with respect to a prior which can be approximated by proper priors. As a corollary the admissibility of this procedure in the class of all scale equivariant estimators is derived.

An historical paper about how R.A. Fisher introduced the Maximum Likelihood method in 1922:

  • J. Aldridge, "R.A. Fisher and the making of Maximum Likelihood 1912-1922", Statistical Science, 1997, vol. 12, pp. 162-176.

Abstract: In 1922 R.A. Fisher introduced the method of maximum likelihood. He first presented the numerical procedure in 1912. This paper considers Fisher's changing justifications for the method, the concepts he developed around it (including likelihood, sufficiency, efficiency and information) and the approaches he discarded (including inverse probability).

A tutorial on Maximum Likelihood Estimation

  • In Jae Myung, "Tutorial on Maximum Estimation", Journal of Mathematical Psychology, vol. 47, pp. 90-100, 2003

Abstract: In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation in statistics and is an indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles.

A paper that describes the fisher's linear discriminant for patter recognition

  • T.Cooke, "Two variations on Fischer's Linear Discriminant for pattern recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, Feb. 2002, pp. 268-273

Abstract: Discriminants are often used in pattern recognition to separate clusters of points in some multidimensional feature space. The paper provides two fast and simple techniques for improving on the classification performance provided by Fisher's linear discriminant for two classes. Both of these methods are also extended to nonlinear decision surfaces through the use of Mercer Kernels.

An article dealing with face recognition by using Fisher Linear Discriminant

  • C. Liu and H. Wechsler, "Enhanced Fisher linear discriminant models for face recognition", Proceedings of the 14th International Conference on Pattern Recognition, Queensland, Australia, August 1998.

Abstract: We introduce in this paper two Enhanced Fischer Linear Discriminant (FLD) Models (EFM) in order to improve the generalization ability of the standard FLD based classifiers such as Fisherfaces. Similar to Fisherfaces, both EFM models apply first Principal Compenent Analysis (PCA) for dimensionality reduction before proceeding with FLD type of analysis.

A paper describing a method for binary classfication using the Rayleigh quotient (Generalized Rayleigh Quotient) for malignant tumors detection

  • T. Mu, A.K. Nandi, M. Rangaraj, and Rangay yan, "Pairwise Rayleigh quotient classifier with application to the analysis of breast tumors", Proceedings of the 4th conference on IASTED International Conference: Signal Processing, Pattern Recognition and Applications, Innsbruck, Austria, pp. 356-361, 2007

Abstract: In this paper, we propose a new supervised learning method for binary classification, named the pairwise Rayleigh quotient (PRQ) classifier, in which the nonlinearity is achieved by employing kernel functions. The PRQ classifier generates a Rayleigh quotient based on a set of pairwise constraints, which consequently leads to a generalized eigenvalue problem with low complexity of implementation. The PRQ classifier is applied in the original feature space for linear classification, as well as in a transformed feature space by employing the triangle kernel for nonlinear classification, to discriminate malignant breast tumors from a set of 57 regions in mammograms, of which 20 are related to malignant tumors and 37 to benign masses. Nine different feature combinations are studied. Experimental results demonstrate that the proposed linear PRQ classifier provides results comparable to those obtained with Fisher linear discriminant analysis (FLDA). In the case of nonlinear classification, the PRQ classifier with the triangle kernel provides a perfect performance of 1.0 for all of the nine feature combinations evaluated in terms of the area under the Receiver Operating Characteristics curve, but with good robustness limited to the setting of the kernel parameter in a certain range. We propose a measure of robustness to evaluate the PRQ classifier.

A paper from 1958 when the perceptron was introduced by Rosenblatt

  • F. Rosenblatt, "The Perceptron: A probabilistic model for information storage and organization in the brain", Psynchological review, Vol. 65, No. 6, 1958.

A paper explaining several neural-network algorithms including perceptron

  • B. Widrow, and M.A. Lehr, "30 years of adaptive neural networks: perceptron, Madaline, and backpropagation", Proceedings of the IEEE, Vol. 78, pp. 1415-1442, Sept 1990.

Abstract:Fundamental developments in feedforward artificial neural networks from the past thirty years are reviewed. The history, origination, operating characteristics, and basic theory of several supervised neural-network training algorithms (including the perceptron rule, the least-mean-square algorithm, three Madaline rules, and the backpropagation technique) are described. The concept underlying these iterative adaptation algorithms is the minimal disturbance principle, which suggests that during training it is advisable to inject new information into a network in a manner that disturbs stored information to the smallest extent possible. The two principal kinds of online rules that have developed for altering the weights of a network are examined for both single-threshold elements and multielement networks. They are error-correction rules, which alter the weights of a network to correct error in the output response to the present input pattern, and gradient rules, which alter the weights of a network during each pattern presentation by gradient descent with the objective of reducing mean-square error (averaged over all training patterns).

A paper from one of the authors that we talked in class (Feb 19th): Vapnik

  • C. Cortes, and V. Vapnik, "Support-Vector Networks", Journal of Machine Learning, vol. 20, pp. 273-297, 1995.

Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the supportvector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non separable training data. High generalization ability of support vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

Application of NN (I): This paper is about modelling of object behaviours using statistical models

  • N.Johnson, and D. Hogg, "Learning the Distribution of object Trajectories for Event Recognition", Journal of Image and Vision Computing, Vol. 14, pp. 609-615,1996.

Abstract: The advent in recent years of robust, real-time, model-based tracking techniques for rigid and non-rigid moving objects has made automated surveillance and event recognition a possibility. We present a statistically based model of object trajectories which is learnt from image sequences. Trajectory data is supplied by a tracker using Active Shape Models, from which a model of the distribution of typical trajectories is learnt. Experimental results are included to show the generation of the model can be used for the identification of incidents, event recognition and trajectory prediction.

Application of NN (II): Speechreading (Lipreading)

  • M. Vogt, "Fast Matching of a Dynamic Lip Model to color video Sequences under Regular Illumination Conditions", Speechreading bu humans and Machines, vol. 150 of NATO ASI Series. Computer and Systems Sciences, pp. 399-408, 1996.

Abstract: Automatic speechreading is based on a robust lip image analysis. In this work, no special illumination or lip make-up is used. Compared to other approaches the analysis is based on true color video images. The system allows for realtime tracking and storage of the lip region and robust off-line lip model matching. The proposed model is based on cubic outline curves. A neural classifier detects visibility of teeth edges and other attributes. At this stage of the approach the edge between the closed lips is automatically modeled if applicable, based on a neural network's decision. Further interior model parts for teeth and lip opening may be applied in the future. To allow fast model adaptation, image processing is only performed where necessary. An image processing cache stores results for fast further access. The energy function, which is minimized during adaptation, considers internal model forces as well as processed image data, outer forces, and dynamic constraints. The rotationally invariant model allows easy extraction of spatial and dynamic parameters. Also principal component analysis or neural dimension reduction techniques may be applied to obtain significant features for recognition.

Application of NN (III): Real-time Target Identification for Security Applications

  • S.J. McKenna, and S. Gong, "Non-intrusive person authentication for access control by visual tracking and face recognition", Lecture Notes in Computer Science, Springer, Vol. 1206, pp. 177-183, 1997.

Abstract: Face recognition systems typically operate robustly only within highly constrained environments. This paper describes work aimed at performing face recognition in more unconstrained environments such as occur in security applications based on closed-circuit television (CCTV). The system described detects and tracks several people as they move through complex scenes. It uses a single, fixed camera and extracts segmented face sequences which can be used to perform face recognition or verification. Example sequences are given to illustrate performance. An application in non-intrusive access control is discussed.

Decision Trees and Applications in Pattern Recognition and Intrusion Detection

  • Xiao-Bai Li, "A scalable decision tree system and its application in pattern recognition and intrusion detection", ACM Digital Library, Decision Support Systems, Volume 41, pg. 112-130, 2005.

Abstract: The publication discusses one of the major problems of data mining dealing with very large datasets. It is very hard to develop scalable algorithms to solve these datasets, since their sizes exceed the maximum memory capacity of a computer. The author suggests using a new decision tree algorithm called SURPASS, Scaling Up Recursive Partitioning with Sufficient Statistics, which is very effectively when dealing with these types of datasets. The SURPASS technique uses linear discriminants to partition the decision tree recursively. This algorithm represents the data needed to create the decision tree into a group of statistics about the collection. The algorithm will then gather the data sequentially in small segments, and operates on these subsets in main memory one at a time. Consequently, the algorithm is able to achieve higher efficiency in working with very large collections. The paper presents an experiment which operates on 3 sets of pattern recognition and intrusion detection. The author claims that SURPASS is a scalable algorithm when used with large sets, and this produces an efficient and accurate decision tree.

The seminal paper of S. Johnson discussed on class (04/10)

  • S. Johnson, "Hierarchical clustering schemes", Psychometrika, vol. 32, No. 3, pp. 241-254, 1967.

Abstract: Abstract Techniques for partitioning objects into optimally homogeneous groups on the basis of empirical measures of similarity among those objects have received increasing attention in several different fields. This paper develops a useful correspondence between any hierarchical system of such clusters, and a particular type of distance measure. The correspondence gives rise to two methods of clustering that are computationally rapid and invariant under monotonic transformations of the data. In an explicitly defined sense, one method forms clusters that are optimally connected, while the other forms clusters that are optimally compact.

The paper by Benjamin King describing a step-wise procedure for the clustering (class 04/10)

  • B. King, "Step-Wise Clustering Procedures", Journal of the American Statistical Association, vol. 62, No. 317, pp. 86-101, 1967.

Abstract: A simple step-wise procedure for the clustering of variables is described. Two alternative criteria for the merger of groups at each pass are discussed: (1) maximization of the pairwise correlation between the centroids of two groups, and (2) minimization of Wilks' statistic to test the hypothesis of independence between two groups. For a set of sample covariance matrices the step-wise solution for each criterion is compared with the optimal two-group separation of variables found by total enumeration of the possible groupings.

Alumni Liaison

Ph.D. 2007, working on developing cool imaging technologies for digital cameras, camera phones, and video surveillance cameras.

Buyue Zhang