Virtue of Decision Trees (DT) when the # of categories is large
We discussed about the virtue of DT when the # of categories is large in class. Here is some more thoughts about it.
To understand the virtue of DT, let's consider it in comparison to classifiers which based on density estimation or seperation hyperplanes. As I see it, the most significant difference between the two groups of methods is that the 2nd group tries to separate all samples in a category from the rest at one shoot, while DT adopts a divide-and-conquer strategy - partition samples into smaller sub-groups which can be separated from the rest more easily.
When # categories is larger than 2, the method based on density estimation or seperation hyperplane has to consider pair-wise or one-vs-all discrimination between categories. The problem gets a lot more difficult when then # categories becomes large. However, for DT this would not be a big issue. DT aims to find a set of rules which partitions the data space into several subspaces in which the sample impurity is minimized. In constructing the rules, we focused on the internal impurity within subspaces rather than the capacity of discrimination between categories. That's why # of categories would affect DT much (or at least not as much as to the other methods).