Bayes Classification: Experiments and Notes Old Kiwi - Rhea

Experiment setup

Maybe you will think that the categorization accuracy of Bayes system goes down as the number of features goes up. The fact is that as we increase the size of the feature vector we have more elements to describe and categorize a class. Therefore, more features should not be bad. The issue is that if we add features with high correlation between classes, the information provided by those features is almost useless. The classes at those features are not separable, so we cannot make a good decision based on highly correlated features. Consequently, when we are characterizing classes by their features, we need to select features that have poor correlation between the classes. Then, the information in the feature vector is meaningful for our categorization. The more meaningful features we have the more accurate the system will be.

The setup for the experiment should be the following. We only have two classes, Class 1 and Class 2. You can test your algorithms for different priori probabilities for Class 1 and Class 2. However if you are working with synthetic data, it is better to have equal priori probabilities to analyze the data. First, generate Gaussian data to train and test the system. We generated N = 10⁵ samples. Also, the variance for each feature was the same and the mean of each class feature changed depending on the hypothesis we wanted to test. Then for each data set, compute the probability model parameters—mean and covariance matrix. Consequently, use Bayes decision formula to categorize the data. Kept track of the accuracy of the system for each feature vector size. The size of our feature vector was in the range of 1 to 20 features. Our measure of accuracy was the number of samples classified correctly divided by the total number of samples for the class (N = 10⁵).

Experiments with highly correlated data

Figure 1

Figure 2

As you can see in Figure 1 the classes are highly correlated. In Figure 2, you can observe that as we increase the size of the feature vector the accuracy converges close to 50%. The brute force approach is to classify the sample in either class, for an expected accuracy of 50%. Consequently, data that is highly correlated, increasing the feature vector size does not help. We need to recall that the accuracy results directly depend on the data we are analyzing. Therefore, we can expect the accuracy to goes up or down for different data sets. Although, we can conclude that the accuracy will be close to 50% for very similar data distributions.

Experiments with poorly correlated data

Figure 3

Figure 4

In this experiment, we expected the accuracy to converge to one. There are a low correlation between features in class 1 and class2. Therefore, with each extra feature we can be sure that we are doing a better categorization of the samples. Our expectations were achieved as shown in Figure 4.

Experiments with highly/poorly correlated data

Figure 5

Figure 6

In this experiment, the first two features have a low correlation between Class 1 and Class 2. Then the next two features are highly correlated. As expected, we can see, in Figure 6, that there are no improvements in the accuracy of the system; the accuracy goes flat when we add feature 3 and 4.

Conclusions and final comments

For optimal results we should add features where their inter-class correlation is low. The more uncorrelated the feature is, the best classification we can achieve. The tradeoff of adding more features shows up when we have computational and memory constraints. We can add more features, but with the cost of more processing time and computation effort. The only way we can get worst results from extra features is that the model used to analyze the data provided in those features is wrong. Therefore, if there is a significant difference between inter-class features and we assume a correct statistical model for the data at hand, we can expect the accuracy of the system to get better as we add more features. However, the accuracy of a classification algorithm based on Bayes decision theory alone depends greatly on the data we are trying to classify.