Difference between revisions of "Curse of Dimensionality" - Rhea

Latest revision as of 10:41, 22 January 2015

Curse of Dimensionality

A slecture by ECE student Haonan Yu

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Needle in a Haystack

You lost your needle in a haystack. If the world you live in has just one dimension, then you are lucky because the haystack is a line segment. Finding the needle only requires you to start from one end of the segment, heading towards the other end until hitting the needle:

Not bad, right? If you are a 2D creature, a needle in a haystack now becomes a small trouble because it can be any point in a square:

There is much more space for the possible location: within the haystack, the needle can move along either the $$ x $$ -axis or the $$ y $$ -axis. Things get worse and worse as the number of dimensions grows. You can't imagine how alien high-dimensional creatures would look for something missing in their world because even in three dimensions, this task turns out to be frustrating (from everyday life experience):

The difficulty of finding the needle starting from a certain point in a high-dimensional haystack implies an import fact: the points tend to distribute more sparsely (i.e., they are more dissimilar to each other) in high-dimensional space, because the Euclidean distance between two points is large as long as they are far apart in the direction of at least one dimension, even though they may be quite close in the direction of all the other dimensions. This fact can be further illustrated quantitatively by the following example. In $$ N $$ -dimensional space, suppose we want to count all the neighbors around a central point within a hypercube with the edge length equal to 1. We also consider a bigger hypercube around the central point with the edge length equal to $1+2\epsilon$ , i.e., the smaller hypercube is nested inside the bigger one. As a special case, in 3D space we have the following scenario:

Now if by assumption the points are evenly distributed in the space, then the ratio of the number of the points in the smaller hypercube to the number of the points in the bigger hypercube is given by

  $(\frac{1}{1+2\epsilon})^N$

If we take $\epsilon=0.01$ , then for $$ N= $$ 1,2,3,5,10,100, and 1000 we have the ratio equal to 0.98, 0.96, 0.94, 0.91, 0.82, 0.14, and $2.51\text{e}^{-09}$ . It can be seen that when $$ N >= 10 $$ , the neighbors around a given point become so few that nearly all points are located inside the $\epsilon$ -shell around the smaller hypercube. To maintain a certain number of neighbors, exponentially many new points should be added into the space.

Curse of Dimensionality

Math theorems can often generalize from low dimensions to higher dimensions. If you know how to compute gradients for a function with two variables, then you probably can also figure out how to do that for multi-variable functions. However, this is not the case when you try to generalize your two-dimensional density estimator to hundreds or thousands of dimensions. Theoretically, your estimator should still work in high dimension. But that requires you to collect an exponential number of training data to avoid sparsity. In practice, the available training data are often quite limited.

Your two-dimensional estimator stops working in high dimension because the limited training data in your hand are not statistically significant due to their sparsity, i.e., there is not enough information for you to recover the true underlying distribution. The concept of statistical significance can be illustrated in the one dimensional case as follows. Suppose you are drawing some points from a normal distribution and trying to estimate its parameters. It makes a big difference between if you draw just 10 points and if you draw 100 points from the distribution.

In the above, the first one is the true underlying normal distribution. The second one is being estimated from 10 points. The third one is being estimated from 100 points. Insufficient training points result in a poor estimation.

The curse of dimensionality describes a scenario in which you can't possibly obtain an exponential number of training points in high dimensional space which is just like that somehow after being cursed in the 1D world you can't easily obtain over 10 training points in the above example, thus resulting in high difficulty of reliably estimating density function. This difficulty also arises in seeking a decision boundary between classes using a discriminative classifier. You never know whether the boundary drawn to separate the training points just barely separates some special training samples from the whole population or actually generalizes well to other data. The lack of ability to reliably estimate and generalize is called overfitting. Even with enough training data in low dimensional space, overfitting can still exist due to various reasons (e.g., high model complexity, incorrect collection of training data, etc). However, in high dimensional space, overfitting is more likely to happen because of the lack of training data.

There are several ways to break the curse:

feature extraction: Most people don't realize that they are actually dealing with the curse of dimensionality in their daily work. No one will directly use all the pixel intensities in an image as the input to their face recognition algorithm. It's simply because there is too much redundant information in that. For example, if you have a 100x100 image, then the raw vector of intensities has a length of 10000! However, if you use the edge pixels or corner points extracted from the image, the dimensionality can be reduced to as small as hundreds or even dozens, with only a small amount of information lost. Because of this reason, lots of people in machine learning and statistics work on the extraction/selection of features. Sometimes it requires prior knowledge to decide on good features.

dimensionality reduction analysis: People came up with statistical methods to automatically reduce the dimensionality. Depending on different criteria, these methods usually have very different reduction results. If the goal is to represent the original high dimensional data in low dimensional space losing as little information as possible (signal representation), then the components that have the largest variance in feature values within classes are kept, which is known as Principle component analysis (PCA). If the goal of reduction is to discriminate different classes in low dimensional space, then the components containing feature values that have the best separation across classes are kept, which is known as Linear discriminative analysis (LDA). The following figure (borrowed from reference 2) shows the difference between the two methods.

kernel: The needle in a haystack example shows that points tend to be outliers in high dimensional space. Dealing with those points directly is lots of pain. Kernel trick used by support vector machines (SVMs) implicitly maps data to much higher or even infinite dimensions but without ever explicitly representing the high dimensional features. The mapping makes the data that are not linearly separable in low dimensional space linearly separable in high dimensional space (which is also why people want to use many different features). While the classifier learns the decision boundary in a very high dimensional feature space, it actually deals with low dimensionality.

None of the above spells is a complete cure for the curse of dimensionality. There are still many open problems on this topic.

References

Questions and comments

If you have any questions, comments, etc. please post them on this page.

Back to ECE662, Spring 2014

@@ Line 6: / Line 6: @@
 <center><font size= 4>
-维数灾难
+Curse of Dimensionality
 </font size>
-[https://www.projectrhea.org/learning/slectures.php 学生课件] by [[ECE]] 学生 Haonan Yu
+A [http://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student Haonan Yu
-部分取材于 [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] [[user:mboutin|Mireille Boutin]]教授的课堂讲义
+Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].
 </center>
+[[Curse_of_Dimensionality_Chinese|Version of this page in Chinese]]
 ----
-== 稻草里堆捞针 ==
+== Needle in a Haystack ==
-你不小心把针掉到了一个稻草堆里。如果你生活在一个一维世界里，在稻草堆里找到这枚针是一件很容易的事情。因为如果针是一个点，那么稻草堆只能是一条线段。所以你可以从线段的一端开始，朝另一端出发，不停寻找直到在线段某点看到那枚针。
+You lost your needle in a haystack. If the world you live in has just one dimension, then you are lucky because the haystack is a line segment. Finding the needle only requires you to start from one end of the segment, heading towards the other end until hitting the needle:
 <center>
@@ Line 21: / Line 23: @@
 </center>
-还不算太难，对吧？当然，如果你是一个二维生物，那么这个问题就不是那么简单了，因为现在你要在一个正方形里寻找一个点。
+Not bad, right? If you are a 2D creature, a needle in a haystack now becomes a small trouble because it can be any point in a square:
 <center>
@@ Line 27: / Line 29: @@
 </center>
-这个正方形里有更多的空间：那枚针可以沿着<math>x</math>轴或者<math>y</math>轴移动。在日常生活中的三维，针变成了正方体里的一个点。
+There is much more space for the possible location: within the haystack, the needle can move along either the <math>x</math>-axis or the <math>y</math>-axis. Things get worse and worse as the number of dimensions grows. You can't imagine how alien high-dimensional creatures would look for something missing in their world because even in three dimensions, this task turns out to be frustrating (from everyday life experience):
 <center>
@@ Line 33: / Line 35: @@
 </center>
-这个找针的问题会随着维数的增加变得越来越难。你或许都已经想象不到假如有生活在超高维度的外星生物存在于宇宙某个角落，当他们丢了东西时，应该怎么寻找？
+The difficulty of finding the needle starting from a certain point in a high-dimensional haystack implies an import fact: '''the points tend to distribute more sparsely (''i.e.'', they are more dissimilar to each other) in high-dimensional space''', because the Euclidean distance between two points is large as long as they are far apart in the direction of at least one dimension, even though they may be quite close in the direction of all the other dimensions. This fact can be further illustrated quantitatively by the following example. In <math>N</math>-dimensional space, suppose we want to count all the neighbors around a central point within a hypercube with the edge length equal to 1. We also consider a bigger hypercube around the central point with the edge length equal to <math>1+2\epsilon</math>, ''i.e.'', the smaller hypercube is nested inside the bigger one. As a special case, in 3D space we have the following scenario:
-在高维空间里的稻草堆某一点出发试图找到一枚丢失的针的超高难度说明了一个重要的事实，那就是点在空间中的分布变得越来越离散（也就是说点和点之间的不相似性越来越倾向于变大）。这个结论也很容易从欧拉距离的定义看出来：两个点之间的欧拉距离足够大只要这两个点在至少一个维度方向的差别足够大就行了，虽然不排除这两个点有可能在其他维度方向上十分相似甚至完全相同。我们还可以根据下面这个例子来定量地证明这个结论。在<math>N</math>维空间中，假设我们想在一个边长为1的超正方体中计数某个处于超正方体正中的一个点周围的邻居点个数。同时我们考虑嵌套在这个超正方体外面的一个更大的变长为<math>1+2\epsilon</math>的超正方体。作为一个特殊例子，在三维空间里我们有：
 <center>
@@ Line 41: / Line 41: @@
 </center>
-再假设空间中所有的点都是均匀分布的，那么位于里面那个小超正方体中的点的个数和位于外面那个大超正方体中的点的个数的比值可以表示为:
+Now if by assumption the points are evenly distributed in the space, then the ratio of the number of the points in the smaller hypercube to the number of the points in the bigger hypercube is given by
 <center>
@@ Line 47: / Line 47: @@
 </center>
-取<math> \epsilon=0.01 </math>，那么对于<math>N=</math>1,2,3,5,10,100和1000，这个比值分别是0.98,0.96,0.94,0.91,0.82,0.14和<math>2.51\text{e}^{-09}</math>。可以看出当<math>N >= 10</math>的时候，这个点周围已经基本上没有邻居在较小的超正方体里了，大部分的点都集中在了围绕它的那个厚度<math>\epsilon</math>的“壳”里。如果要想继续保持一定数量的邻居点，那么所有点的个数必须成指数增加。
+If we take <math> \epsilon=0.01 </math>, then for <math>N=</math>1,2,3,5,10,100, and 1000 we have the ratio equal to 0.98, 0.96, 0.94, 0.91, 0.82, 0.14, and <math>2.51\text{e}^{-09}</math>. It can be seen that when <math>N >= 10</math>, the neighbors around a given point become so few that nearly all points are located inside the ''<math>\epsilon</math>-shell'' around the smaller hypercube. To maintain a certain number of neighbors, ''exponentially'' many new points should be added into the space.
-== 维数灾难 ==
+== Curse of Dimensionality ==
-数学定理通常都能很轻易地从低维一般化到高维。这就是为什么如果你懂得如何计算两个变量的函数的梯度导数，你基本上就知道怎么计算含有成千上万个变量的函数的梯度。但是在机器统计学习里，你会发现你高高兴兴地写了一个拟合二维概率密度函数的算法，却发现基本上不能直接用在高维空间中。理论上来说，你的算法确确实实经过维数的小改动可以原封不动地拿到高维里去用，但是前提条件是你必须要有足够的指数增长的训练数据。然而实际情况中，训练数据通常有限，训练点在空间中分布稀疏。
+Math theorems can often generalize from low dimensions to higher dimensions. If you know how to compute gradients for a function with two variables, then you probably can also figure out how to do that for multi-variable functions. However, this is not the case when you try to generalize your two-dimensional density estimator to hundreds or thousands of dimensions. Theoretically, your estimator should still work in high dimension. But that requires you to collect an exponential number of training data to avoid ''sparsity''. In practice, the available training data are often quite limited.
-这些有限的训练数据不能很好地让你的算法工作的原因不是因为你算法出了理论上的问题，而是因为他们因为太稀疏从而根本就不统计显著。也就是说，训练数据提供给你算法的信息太少了所以很难估计到正确的概率模型。统计显著这个概率可以进一步用下面一个一维空间中的简单例子来说明。假如你从一个正态分布中抽取一定数量的点来估计这个分布的均值和方差。这个分布如下：
-<center>
-   [[Image:true.png|320px|the true distribution]]
-</center>
-如果你只抽取了10个点得到的结果：
-<center>
-[[Image:insignificant.png|320px|estimated from 10 points]]
-</center>
-如果你抽取了100个点：
+Your two-dimensional estimator stops working in high dimension because the limited training data in your hand are not '''statistically significant''' due to their sparsity, ''i.e.'', there is not enough information for you to recover the true underlying distribution. The concept of statistical significance can be illustrated in the one dimensional case as follows. Suppose you are drawing some points from a normal distribution and trying to estimate its parameters. It makes a big difference between if you draw just 10 points and if you draw 100 points from the distribution.
 <center>
-[[Image:significant.png|320px|estimated from 100 points]]
+[[Image:true.png|320px|the true distribution]] [[Image:insignificant.png|320px|estimated from 10 points]] [[Image:significant.png|320px|estimated from 100 points]]
 </center>
-可以看出抽取10个点和抽取100个点相差非常大。因为在10个点的情况下，那10个点可能并不能很好地代表整个空间中点的分布情况。所以你的算法结果只取决于抽签运气好不好了。
+In the above, the first one is the true underlying normal distribution. The second one is being estimated from 10 points. The third one is being estimated from 100 points. Insufficient training points result in a poor estimation.
-维数灾难指的就是因为正常情况下在高维空间中获取足够训练数据基本上不可能所以很难准确估计出高维空间中的概率密度函数。用上面一维的例子来进行类比，就是假如你突然中了诅咒，每次用来训练的数据量不能超过10个，否则你的电脑就会死机。即使你不估计概率密度函数，而是直接通过分类器来寻找不同类别之间的决策边界，你根本就没办法知道到底在有限的训练集上得到的决策边界仅仅是能够将这些特殊的不能代表整个空间中点分布的数据区分开还是能够一般化到整个数据空间。这种不能准确地估计并且一般化到未知数据的情况被称为“过拟合”。即使在低维空间中有足够的训练数据，过拟合还是会发生（比如因为过度复杂的概率模型，不科学的数据采集方法等）。但是过拟合在高维空间中更容易发生因为通常都没有足够的训练数据。
+The '''curse of dimensionality''' describes a scenario in which you can't possibly obtain an exponential number of training points in high dimensional space which is just like that ''somehow after being cursed in the 1D world you can't easily obtain over 10 training points in the above example'', thus resulting in high difficulty of reliably estimating density function. This difficulty also arises in seeking a decision boundary between classes using a discriminative classifier. You never know whether the boundary drawn to separate the training points just barely separates some special training samples from the whole population or actually generalizes well to other data. The lack of ability to reliably estimate and generalize is called '''overfitting'''. Even with enough training data in low dimensional space, overfitting can still exist due to various reasons (''e.g.'', high model complexity, incorrect collection of training data, ''etc''). However, in high dimensional space, overfitting is more likely to happen because of the lack of training data.
-一般来说有几种缓解维数灾难的方法：
+There are several ways to break the curse:
-. 特征提取。其实很多人在没听说过维数灾难这个概念之前就已经在解决这个问题了！比如说你想识别一个图像中的人脸。一般人稍微有点儿经验的就不会把全部像素的灰度当成一个巨大的向量拿去给算法当成输入来运行，因为这个向量中参杂的冗余信息太多了。举个例子，加入你有一张100x100的图像，如果不提取特征的话用来进行分类的灰度向量有10000这么长！但是，如果你能运用从图像中提取的边缘，角点信息，就可以把维数降到几百甚至几十，同时只丢掉少量的信息。这种方式的特征选择与提取通常要求一些先验知识，这些知识可以从其他学科而来，比如脑科学，视觉心理学等等。
+* ''feature extraction'': Most people don't realize that they are actually dealing with the curse of dimensionality in their daily work. No one will directly use all the pixel intensities in an image as the input to their face recognition algorithm. It's simply because there is too much redundant information in that. For example, if you have a 100x100 image, then the raw vector of intensities has a length of 10000! However, if you use the edge pixels or corner points extracted from the image, the dimensionality can be reduced to as small as hundreds or even dozens, with only a small amount of information lost. Because of this reason, lots of people in machine learning and statistics work on the extraction/selection of features. Sometimes it requires prior knowledge to decide on good features.
-. 简化维数。人们提出一些统计方法来自动简化维数。根据需求的不同，简化的结果通常区别较大。如果目标是将原来高维空间的数据在低维空间中尽可能准确地表示出来（丢去少量信息），那么在一个类别中特征值方差最大的方向会被保存下来，也就是[[PCA|''主成分分析法（PCA）'']]。如果目标是在低维空间中最好地区分不同的类别，那么在类别之间特征值分布相对分离的方向会被保存下来，也就是[http://en.wikipedia.org/wiki/Linear_discriminant_analysis ''线性判别分析法（LDA）'']。下面这个图(来自 [[#References|reference 2]])说明了这两种方法的不同。
+* ''dimensionality reduction analysis'': People came up with statistical methods to automatically reduce the dimensionality. Depending on different criteria, these methods usually have very different reduction results. If the goal is to represent the original high dimensional data in low dimensional space losing as little information as possible (signal representation), then the components that have the largest variance in feature values within classes are kept, which is known as [[PCA|''Principle component analysis (PCA)'']]. If the goal of reduction is to discriminate different classes in low dimensional space, then the components containing feature values that have the best separation across classes are kept, which is known as [http://en.wikipedia.org/wiki/Linear_discriminant_analysis ''Linear discriminative analysis (LDA)'']. The following figure (borrowed from [[#References|reference 2]]) shows the difference between the two methods.
 <center>
@@ Line 84: / Line 72: @@
 </center>
-. 核。[[#Needle in a Haystack|稻草堆中找针|]]的例子说明了当数据量不足时，高维空间中的每个点看起来都像是某个分布的异常值。直接用分类器处理这些高维空间中的点简直就是痛苦。[http://en.wikipedia.org/wiki/Support_vector_machine ''支持向量机（SVM）'']中使用的核函数能够自动将数据映射到决策边界更好寻找的高维甚至无限维空间中。这样的映射使得在低维空间中线性不可分的特征在高维空间中线性可分（这也是人们为什么喜欢使用大量数据的原因）。换句话说，你的分类器虽然在很高维度的特征空间中学习决策边界，实际上它只用处理在低维度空间中的参数估计。
+* ''kernel'': The [[#Needle in a Haystack|needle in a haystack]] example shows that points tend to be outliers in high dimensional space. Dealing with those points directly is lots of pain. Kernel trick used by [http://en.wikipedia.org/wiki/Support_vector_machine ''support vector machines (SVMs)''] implicitly maps data to much higher or even infinite dimensions but without ever explicitly representing the high dimensional features. The mapping makes the data that are not linearly separable in low dimensional space linearly separable in high dimensional space (which is also why people want to use many different features). While the classifier learns the decision boundary in a very high dimensional feature space, it actually deals with low dimensionality.
-所有上面的方法都不是万能之策。维数灾难还是一个有待进一步研究的问题。
+None of the above spells is a complete cure for the curse of dimensionality. There are still many open problems on this topic.
 == References ==
@@ Line 92: / Line 80: @@
 # http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf
 ----
-==[[slecture_title_of_slecture_review|Questions and comments]]==
+==[[reviews_on_curse_of_dimensionality|Questions and comments]]==
-If you have any questions, comments, etc. please post them on [[slecture_title_of_slecture_review|this page]].
+If you have any questions, comments, etc. please post them on [[reviews_on_curse_of_dimensionality|this page]].
 ----
 [[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]

Difference between revisions of "Curse of Dimensionality" - Rhea

Latest revision as of 10:41, 22 January 2015

Contents

Needle in a Haystack

Curse of Dimensionality

References

Questions and comments

Alumni Liaison