Glossary for "Decision Theory" (ECE662)

If you want to add an entry to the glossary, create the article for the topic and enter the information there. Include the term here on this page by providing a reference to include that article, or instead write a brief description. Follow the example for the other entries with sqare brackets that link to the topic's article and double equal signs.

Bayes' decision rule creates an objective function which minimizes the probability of error (misclassification). This method assumes a known distribution for the feature vectors given each class.

Bayesian Parameter Estimation

Bayesian Parameter Estimation is a technique for parameter estimation which uses probability densities as estimates of the parameters instead of exact deterministic ones, as MLE_Old Kiwi does.

Central Limit Theorem

The CLT explains why many distributions tend to be close to the normal distribution. The important point is that the "random variable" being observed should be the sum or mean of many independent random variables. (variables need not be iid)(See the PROOF )

Classes

The categories into which we are trying to classify the patterns.

Classification

Assigning a class to a measurement, or equivalently, identifying the probabilistic source of a measurement. The only statistical model that is needed is the conditional model of the class variable given the measurement. This conditional model can be obtained from a joint model or it can be learned directly. The former approach is generative since it models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models. The latter approach is discriminative since it focuses only on discriminating one class from another. It can be more efficient once trained and requires fewer modeling assumptions. Methods include logistic regression, generalized linear classifiers, and nearest-neighbor. See "Discriminative and Learning".

Clustering

Grouping similar objects in a multidimensional space. It is useful for constructing new features which are abstractions of the existing features. Some algorithms, like k-means, simply partition the feature space. Other algorithms, like single-link agglomeration, create nested partitionings which form a taxonomy. Another possibility is to learn a graph structure between the clusters, as in the Growing Neural Gas. The quality of the clustering depends crucially on the distance metric in the space. Most techniques are very sensitive to irrelevant features, so they should be combined with feature selection.

References

Consistent Estimator

An unbiased estimator that converges more closely to the true value as the sample size increases is called a consistent estimator.

Curse of Dimensionality

Refers to the problem caused by exponential growth of hypervolume as a function of dimensionality. This term was coined by Richard Bellman in 1961.

Decision Tree

A decision tree is a classifier that maps from the observation about an item to the conclusion about its target value. It is also realized to be deterministic ancestor of the hierarchical mixture of experts. Decison tree use can be traced back from problems that require probabilistic inference which is a core technology in AI, largely due to developments in graph-theoretic methods for the representation and manipulation of complex probability distributions. Whether in their guise as directed graphs (Bayesian networks) or as undirected graphs (Markov random fields), probabilistic decision trees/models have a number of virtues as representations of uncertainty and as inference engines. The foundation formed by decision trees forms the basis for Graphical Models.(see Graphical Models_Old Kiwi).

References

1.Decision Tree Tutorial

2.DM Tutorial-Decision Trees

Discriminant Analysis

Constructing new features via linear combination so that classification is easier. The notion of "easier" is quantified by Fisher's criterion, which applies exactly to Gaussian classes with equal variance and approximately to other models. Variants like Flexible discriminant analysis consider nonlinear combinations as well. See Duda&Hart, "Eigenfaces vs. Fisherfaces", and "Flexible Discriminant and Mixture Models".

Discriminant Function

One way to represent classifiers. Given classes w1, w2, .., wk and feature vector x, which is in n-dimensional space, the discriminant functions g1(x), g2(x), .., gk(x) where g#(x) maps n-dimensional space to real numbers, are used to make decisions. Decisions are defined as w# if g#(x) >= gj(x) for all j.

Distance

See Metric_Old Kiwi

Drawbacks of Euclidean distance

The drawback of the E.D approach is that it assumes that the sample points are distributed about the sample mean in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of a 'test point' belonging to the set to depend not only on the distance from the sample mean but also on the DIRECTION.

Dual Problem

Every linear programming problem, referred to as a primal problem, can be converted in a dual problem, which provides an upper bound to the optimal value of the primal problem. In matrix form, we express the primal problem as:

Maximize $$ c^Tx $$

Subject to $Ax \geq b, x\geq 0$

The corresponding dual problem is:

Minimize $$ b^Ty $$

Subject to $A^Ty\geq c, y\geq 0$

Where y is used instead of x as the variable vector.

There are two ideas fundamental to duality theory. One is the fact that the dual of a dual linear program is the original primal linear program. Additionally, every feasible solution for a linear program gives a bound on the optimal value of the objective function of its dual.

Article on Dual Problems

Euclidean Distance (ED)

Is the distance between two points in a cartesian coordinate system that is "measured by a ruler". The Euclidean distance between the points X=[x1,x2,...xn] and Y=[y1,y2,...yn] is defined as $dist(X,Y)=\sqrt{\sum_{i=1}^n{(x_i-y_i)^2}}$ .

Expectation Maximization

Expectation Maximization is a popular clustering method. Particularly, If the data to be clustered consists of Gaussian clusters, it starts with randomly initialized means and covariances for the clusters, then using maximum likelihood approach, it tries to find the actual means and the covariances of the clusters.

Evidence

It a scale factor that insures that the posterior probabilities sum to one.

Feature Vector

Vector containing a numerical representation of features.

Features

The measurements which represent the data. The statistical model one uses is crucially dependent on the choice of features. Hence it is useful to consider alternative representations of the same measurements (i.e. different features). For example, different representations of the color values in an image. General techniques for finding new representations include discriminant analysis, principal component analysis, and clustering.

Fisher Linear Discriminant

Fisher's linear discriminant is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class. See Lecture 10 for detailed explanation.

The Lecture and Example of Fisher Linear Discriminant

Generalized Rayleigh Quotient

$J(\vec{w}) = \frac{\vec{w}^{\top}S_B \vec{w}}{\vec{w}^{\top}S_w \vec{w}}$

Generic Property

In mathematics, properties that hold for typical examples are called generic properties. For instance, a generic property of a class of functions is one that is true of most of those functions, as in the statements, " A generic polynomial does not have a root at zero," or "A generic matrix is invertible." As another example, a generic property of a space is a property that holds at most points of the space, as in the statementm, "if f: M->N is a smooth function between smooth manifolds, then a generic point of N is not a critical value of f" (This is by Sard's theorem.)

Graphical Models

Graphical models allow a separation between qualitative, structural aspects of uncertain knowledge and the quantitative, parametric aspects of uncertainty—the former represented via patterns of edges in the graph and the latter represented as numerical values associated with subsets of nodes in the graph. This separation is often found to be natural by domain experts, taming some of the problems associated with structuring, interpreting, and troubleshooting the model. Even more importantly, the graph-theoretic framework has allowed for the development of general inference algorithms, which in many cases provide orders of magnitude speedups over brute-force methods.(see Decision trees_Old Kiwi.

Hilbert Space

A Hilbert space named after the German mathematician David Hilbert, generalizes the notion of Euclidean space in a way that extends methods of vector algebra from the two-dimensional plane and three-dimensional space to infinite-dimensional space. A Hilbert space is an inner product space which is complete under the induced metric. A Hilbert space is a special class of Banach space in the norm induced by the inner product, since the norm and the inner product both induce the same metric. Any finite-dimensional inner product space is a Hilbert space, but it is worth mentioning that some authors require the space to be infinite dimensional for it to be called a Hilbert space.

References

1.Hilbert space definition and proposition

2.Wikepedia-Hilbert Space

3.Mathworld-Hilbert Space

Histogram Density Estimation

Histogram Density Estimation is one of the primitive and easiest non-parametric density estimation methods. The given feature space is divided into equally-spaced bins or cells. The number of training feature samples that fall into each category is computed, (i.e) if ' $$ n_i $$ ' is the number of feature samples that fall into the bin 'i', and 'V' is the volume of each cell (the volume of all the cells are equal because they are equi-spaced), the density is given by $$ p(x) = n_i/V $$ .

Impurity

Impurity of a class is defined as the amount of data misclassified into that class. It is zero when all training data belongs to one class. See Lecture 21 - Decision Trees(Continued)_Old Kiwi for the definitions of different types of impurities

Informative Prior

In the Bayesian framework, the prior distribution of the variable is called an informative prior if it provide some specific information about the variable. Informative priors allows to incorporate a prior somewhat defined knowledge about the random variable under consideration. (See also: Uninformative Prior and Prior)

K-means

A parametric algorithm for clustering data into exactly k clusters. First, define some initial cluster parameters. Second, assign data points to clusters. Third, recompute better cluster parameters, given the data assignment. Iterate back to step two. It is related to the Expectation-Maximization algorithm for fitting a mixture of Gaussians.

Kernel Functions

These functions operate in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the mappings of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. Kernel functions have been introduced for sequence data, text, images, as well as vectors. (REF: wiki)

KNN-K Nearest Neighbor

The classifiers do not use any model to fit the data and only based on memory. The KNN uses neighborhood classification as the predication value of the new query. It has advantages - nonparametric architecture, simple and powerful, requires no traning time, but it also has disadvantage - memory intensive, classification and estimation are slow. Please refer to KNN tutorial website.

1. KNN Tutorial : Contents are below / How K-Nearest Neighbor (KNN) Algorithm works? / Numerical Example (hand computation) / KNN for Smoothing and Prediction / How do we use the spreadsheet for KNN? / Strength and Weakness of K-Nearest Neighbor Algorithm / Resources for K Nearest Neighbors Algorithm

2. KNN

3. Class Prediction using KNN

4. WIKIPEDIA

Kruskal's Algorithm

Kruskal's algorithm is an algorithm of graph theory that constructs the minimum spanning tree of a graph by adding edges to the spanning tree one-by-one. At all points during its execution the set of edges selected by Kruskal's algorithm forms a forest of trees. The steps in this algorithm can be seen here http://students.ceid.upatras.gr/~papagel/project/pseukrus.htm

Kruskal's algorithm is conceptually quite simple. The edges are selected and added to the spanning tree in increasing order of their weights. An edge is added to the tree only if it does not create a cycle.

Lagrange Multipliers

In mathematical optimization problems, the method of Lagrange multipliers, named after Joseph Louis Lagrange, is a method for finding the extrema of a function of several variables subject to one or more constraints; it is the basic tool in nonlinear constrained optimization. Simply put, the technique is able to determine where on a particular set of points (such as a circle, sphere, or plane) a particular function is the smallest (or largest).

Likelihood Principle

The likelihood principle is a principle of statistical inference which asserts that all of the information in a sample is contained in the likelihood function. A likelihood function arises from a conditional probability distribution considered as a function of its second argument, holding the first fixed. Eg: consider a model which gives the probability density function of observable random variable X as a function of parameter O. Then for a specific value x of X, the function L(O|x) = P(X=x|O) is a likelihood function of O. It gives a measure of how likely any particular value of O is, if we know that X has a value x. Two likelihood functions are equivalent if one is a scalar multiple of the other.

Linear Discriminant Functions (LDF)

Functions that are linear combinations of x.

$$ g(x) = w^t x + w_0 $$ Where $$ w $$ is the weight vector and $$ w_0 $$ is the bias as threshold. In the two category case $$ g(x) $$ defines the decisions surface where: $$ w_1 $$ , when $$ g(x) > 0 $$ or $$ w_2 $$ , when $$ g(x) < 0 $$

LUT - Look-Up Table

In classification applications, LUT's can be used for comparing decisions, as long as memory is available. This method requires a discrete feature space that is reasonably small. (This method is typically not mentioned in textbooks.)

Mahalanobis Distance

Differs from Euclidean Distance in that it takes into account the correlations in the data (by using the covariance matrix). Mahalanobis distance is widely used in cluster analysis and other classification techniques. (Ref - Wikipedia)

Manhattan Distance

Also known as taxicab metric. The Manhattan distance between two points (X,Y) in a cartesian system is defined as $dist(X,Y)=\sum_{i=1}^n{|x_i-y_i|}$ . This is equal to the length of paths connecting X and Y in a combination across all dimensions.

MAP or Maximum A Posteriori

A parameter estimation heuristic that seeks parameter values that maximize the posterior density of the parameter. This implies that you have a prior distribution over the parameter, i.e. that you are Bayesian. MAP estimation has the highest chance of getting the parameter exactly right. But for predicting future data, it can be worse than Maximum Likelihood; predictive estimation is a better Bayesian method for that purpose. MAP is also not invariant to reparameterization.

Mercer's Condition Valid kernel functions are sometimes more precisely referred to as Mercer kernels , because they must satisfy Mercer's condition. for any given $g(\vec{x})$ such that $\int{g(\vec{x})^2d \vec{x}}$ is finite,

$\int{K(\vec{x},\vec{z})g(\vec{x})g(\vec{z})d\vec{x}d\vec{z}} \geq 0$

We say that $K(\vec{x},\vec{z})$ satisfies Mercer's condition if the above equation must hold for every $g(\vec{x})$ . A kernel function $K(\vec{x},\vec{z})$ must be continuous, symmetric, and have a positive definite gram matrix. Such a $K(\vec{x},\vec{z})$ means that there exists a mapping to a reproducing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function $K(\vec{x},\vec{z})$ .

Reference:

Wikipedia's Mercer's Condition[1]

Metric

Also known as distance. Is a function that defines the distance between two elements of a set. Examples of metrics are the Manhattan, Euclidean and, Mahalanobis.

Minimum Spanning Trees

Given a connected, undirected graph, a spanning tree is a graph that is a tree and connects all the vertices together. There may be many different spanning trees for a graph. One of them can be selected based on the weights of the edges. A minimum spanning tree is a spanning tree with weight less than or equal to the weight of every other spanning tree. A maximum spanning tree, in a similar way, is the spanning tree with the sum of the weights higher than or equal to the other possible trees. A minimum spanning tree can be determined using Prim's or Kruskal's algorithms. Also see Prim's Algorithm_Old Kiwi and Kruskal's Algorithm_Old Kiwi

Minkowski Metric

The k-Minkowski metric between two points $$ P_1 = (x_1,x_2,...,x_n) $$ and $$ P_2 = (y_1,y_2,...,y_n) $$ is defined as $d_k = (\sum_{i=1}^n \parallel x_i-y_i \parallel ^k)^{\frac{1}{k}}$ . Manhattan metric (k=1) and Euclidean metric(k=2) are the special cases of Minkowski metric. Here $$ P_1 $$ and $$ P_2 $$ are the points corresponding to feature vectors.

Mixture Models

These models use joint distributions over latent variables. By using observed variables, they try to create complicated joint distributions. Mixture Models can be used to cluster datasets.

MLE

See Maximum Likelihood Estimation

Nonparametric regression/density estimation

An approach to regression/density estimation that doesn't require much prior knowledge but only a large amount of data. This includes histograms, kernel smoothing, and nearest-neighbor.

Non-parametric Model

Non-parametric models differ from parametric models in that the model structure is not specified a priori but is instead determined from data. The term nonparametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.

Overfitting_Old Kiwi

In statistics, overfitting means that some of the relationships that appear statistically significant are actually just noise. A model with overfitting has much more freedom degrees than the data that we have. Consequently, this model doesn't replicate well and does a lousy job when predicting future responses.

In machine learning, overfitting training occurs when the training is too long or the training set is rare. As a result, the learner may adjust to very specific random features that has no causal relation to the target function. This may lead to the case when the training error is reducing while the testing is increasing. The process of applying a technique to avoid overfitting is referred to as Regularization_Old Kiwi.

[Parameter Estimation

Density estimation when the density is assumed to be in a specific parametric family. Special cases include maximum likelihood, maximum a posteriori, unbiased estimation, and predictive estimation.

Parametric Classifiers

We find parametric decision boundaries to approximate true decision boundaries between classes. (Discussed in Lecture 9 - Linear Discriminant Functions_Old Kiwi)

Parametric Model

A parametric model is a set of related mathematical equations in which alternative scenarios are defined by changing the assumed values of a set of fixed coefficients (parameters). In statistics, a parametric model is a parametrized family of probability distributions, one of which is presumed to describe the way a population is distributed.

Parzen Window

Parzen windows are very similar to K nearest neighborhoods(KNN). Both methods can generate very complex decision boundaries. The main difference is that instead of looking at the k closest points to a piece of training data, all points within a fixed distance are considered. In practice, one difference is that datasets with large gaps get treated much different. kNN will pick far away points, but it is possible that relatively small Parzen Windows will actually enclose zero points. In this case, only the priors can be used to classify. Unfortunately, this dataset had many holes in it at the fringes Thhe Parzen-window density estimate using n training samples and the window function tex: \pi is defined by

$p_n(x) = 1/n \sum_{i=1}^{n} 1/V_n \pi({x-x_i}/h_n)$

The estimate $$ p_n(x) $$ is an average of (window) functions. Usually the window function has its maximum at the origin and its values become smaller when we move further away from the origin. Then each training sample is contributing to the estimate in accordance with its distance from x.

References

1. Parzen-window

2. Parzen-window density estimation

3. Parzen-window density estimation

Partial Differential Equations (PDE)

PDE's are differentiable equations of several independent variables in which can be differentiated with respect to those variables. In this course PDE's are used to find cluster boundaries.

Links:

Lecture 27

Wikipedia

Mathworld

Patterns

The items that we are trying to classify are vectors of these features.

Penalty Methods

In optimization, penalty methods are used to reformulate a constraint optimization problem into several unconstrained optimization problems. It can be shown that the solutions of these derived unconstrained optimization problems will converge to the solution of the original constrained problem. For example, a minimization problem with inequality constraints can be recast as as unconstrained problems by adding a penalty term which value is a function of the original constraints. Except for minor variations penalty methods are very similar to the so-called Barrier methods.

Perceptron

Maps an input to a single binary output value. (First introduced in Lecture 9)

Posterior

The probability, using prior knowledge, that a case belongs to a group.

Prim's Algorithm

Prim's algorithm is a graph algorithm to determine a minimum spanning tree for a connected weighted graph. This means it finds a subset of the edges that forms a tree including every vertex, in which the sum of weights of all the edges in the tree is minimized. The steps in this algorithm can be seen here http://students.ceid.upatras.gr/~papagel/project/pseuprim.htm

Principal Component Analysis

Constructing new features which are the principal components of a data set. The principal components are random variables of maximal variance constructed from linear combinations of the input features. Equivalently, they are the projections onto the principal component axes, which are lines that minimize the average squared distance to each point in the data set. To ensure uniqueness, all of the principal component axes must be orthogonal. PCA is a maximum-likelihood technique for linear regression in the presence of Gaussian noise on both x and y. In some cases, PCA corresponds to a Fourier transform, such as the DCT used in JPEG image compression.

Prior

The probability that adjusts the classification of cases by using known information, hence the name prior. (See Also: Informative Prior and Uninformative Prior)

Pruning

Pruning (also called as inverse splitting) is a term in mathematics and informatics which allows to cut parts of a decision tree. Pruning is done by taking two leaves that have a common parent and merging them if it helps to improve the classification accuracy of decision tree. Pruned parts of the tree are no longer considered because the algorithm knows based on already collected data (e.g. through sampling) that these subtrees do not contain the searched object. The decision tree is simplified by removing some decisions. Pruning increases generalization.

Quadratic Programming Problem

A linearly constrained optimization problem with a quadratic objective function is called a quadratic program (QP). Quadratic programming is often considered as a discipline in and of itself owing to its vast number of applications.

Regularization

Regularization_Old Kiwi refers to a set of techniques used to avoid overfitting.

Reinforcement learning

A reinforcement learning machine learns to perform actions which will maximize rewards (or minimize punishments).

ROC Curves

The Receiver Operating Characteristic (ROC) Curve is a commonly-used technique for showing the trade-off between false-positives and false-negatives as a threshold is varied.

Slack Variable

In linear programming , a slack variable is referred to as an additional variable that has been introduced to the optimization problem to turn a inequality constraint into an equality constraint. This procedure is often performed to formulate linear optimization problems into a form which can be efficiently solve using a fast algorithm: the simplex algorithm. As a result a slack variable is always positive since this is a requirement for variables in the simplex method.

Supervised learning

Imagine an organism or machine which experiences a series of inputs from different sensors: x1, x2, x3, x4, . . . The machine is also given desired outputs y1, y2, . . ., and its goal is to learn to produce the correct output given a new input x. (See also: Unsupervised learning_Old Kiwi)

Support Vector Machines or SVM

Algorithms which define optimal hyperplanes as the hyperplane whose distance from the nearest training patters (the support vectors) is maximum. http://en.wikipedia.org/wiki/Support_vector_machine

ECE662:Glossary Old Kiwi - Rhea

Glossary for "Decision Theory" (ECE662)

Contents

MAP or Maximum A Posteriori

Support Vector Machines or SVM

Alumni Liaison