Top 10 algorithms in data mining
31
combinations of splitters within each node, a capability that has not been included in the
released software.
10.7 Cost-sensitive learning
Costs are central to statistical decision theory but cost-sensitive learning received only modest
attention before Domingos [
21
]. Since then, several conferences have been devoted exclu-
sively to this topic and a large number of research papers have appeared in the subsequent
scientific literature. It is therefore useful to note that the CART monograph introduced two
strategies for cost-sensitive learning and the entire mathematical machinery describing CART
is cast in terms of the costs of misclassification. The cost of misclassifying an instance of
class i as class j is C
(i, j) and is assumed to be equal to 1 unless specified otherwise;
C
(i, i) = 0 for all i. The complete set of costs is represented in the matrix C containing a
row and a column for each target class. Any classification tree can have a total cost computed
for its terminal node assignments by summing costs over all misclassifications. The issue in
cost-sensitive learning is to induce a tree that takes the costs into account during its growing
and pruning phases.
The first and most straightforward method for handling costs makes use of weighting:
instances belonging to classes that are costly to misclassify are weighted upwards, with a
common weight applying to all instances of a given class, a method recently rediscovered
by Ting [
78
]. As implemented in CART. the weighting is accomplished transparently so that
all node counts are reported in their raw unweighted form. For multi-class problems BFOS
suggested that the entries in the misclassification cost matrix be summed across each row
to obtain relative class weights that approximately reflect costs. This technique ignores the
detail within the matrix but has now been widely adopted due to its simplicity. For the Gini
splitting rule the CART authors show that it is possible to embed the entire cost matrix into
the splitting rule, but only after it has been symmetrized. The “symGini” splitting rule gener-
ates trees sensitive to the difference in costs C
(i, j) and C(i, k), and is most useful when the
symmetrized cost matrix is an acceptable representation of the decision maker’s problem.
In contrast, the instance weighting approach assigns a single cost to all misclassifications of
objects of class i . BFOS report that pruning the tree using the full cost matrix is essential to
successful cost-sensitive learning.
10.8 Stopping rules, pruning, tree sequences, and tree selection
The earliest work on decision trees did not allow for pruning. Instead, trees were grown
until they encountered some stopping condition and the resulting tree was considered final.
In the CART monograph the authors argued that no rule intended to stop tree growth can
guarantee that it will not miss important data structure (e.g., consider the two-dimensional
XOR problem). They therefore elected to grow trees without stopping. The resulting overly
large tree provides the raw material from which a final optimal model is extracted.
The pruning mechanism is based strictly on the training data and begins with a cost-
complexity measure defined as
R
a
(T ) = R(T ) + a|T |,
(31)
where R
(
T ) is the training sample cost of the tree, |
T | is the number of terminal nodes in the
tree and a is a penalty imposed on each node. If a
= 0 then the minimum cost-complexity
tree is clearly the largest possible. If a is allowed to progressively increase the minimum
cost-complexity tree will become smaller since the splits at the bottom of the tree that reduce
123
32
X. Wu et al.
R
(T ) the least will be cut away. The parameter a is progressively increased from 0 to a value
sufficient to prune away all splits. BFOS prove that any tree of size Q extracted in this way
will exhibit a cost R
(Q) that is minimum within the class of all trees with Q terminal nodes.
The optimal tree is defined as that tree in the pruned sequence that achieves minimum cost
on test data. Because test misclassification cost measurement is subject to sampling error,
uncertainty always remains regarding which tree in the pruning sequence is optimal. BFOS
recommend selecting the “1 SE” tree that is the smallest tree with an estimated cost within
1 standard error of the minimum cost (or “0 SE”) tree.
10.9 Probability trees
Probability trees have been recently discussed in a series of insightful articles elucidating
their properties and seeking to improve their performance (see Provost and Domingos 2000).
The CART monograph includes what appears to be the first detailed discussion of probabil-
ity trees and the CART software offers a dedicated splitting rule for the growing of “class
probability trees.” A key difference between classification trees and probability trees is that
the latter want to keep splits that generate terminal node children assigned to the same class
whereas the former will not (such a split accomplishes nothing so far as classification accu-
racy is concerned). A probability tree will also be pruned differently than its counterpart
classification tree, therefore, the final structure of the two optimal trees can be somewhat
different (although the differences are usually modest). The primary drawback of probabil-
ity trees is that the probability estimates based on training data in the terminal nodes tend
to be biased (e.g., towards 0 or 1 in the case of the binary target) with the bias increasing
with the depth of the node. In the recent ML literature the use of the LaPlace adjustment has
been recommended to reduce this bias (Provost and Domingos 2002). The CART monograph
offers a somewhat more complex method to adjust the terminal node estimates that has rarely
been discussed in the literature. Dubbed the “Breiman adjustment”, it adjusts the estimated
misclassification rate r *
(t) of any terminal node upwards by
r
∗
(t) = r(t) + e/(q(t) + S)
(32)
where r
(t) is the train sample estimate within the node, q(t) is the fraction of the train-
ing sample in the node and S and e are parameters that are solved for as a function of the
difference between the train and test error rates for a given tree. In contrast to the LaPlace
method, the Breiman adjustment does not depend on the raw predicted probability in the
node and the adjustment can be very small if the test data show that the tree is not overfit.
Bloch et al. [
5
] reported very good performance for the Breiman adjustment in a series of
empirical experiments.
10.10 Theoretical foundations
The earliest work on decision trees was entirely atheoretical. Trees were proposed as methods
that appeared to be useful and conclusions regarding their properties were based on observing
tree performance on a handful of empirical examples. While this approach remains popular
in Machine Learning, the recent tendency in the discipline has been to reach for stronger
theoretical foundations. The CART monograph tackles theory with sophistication, offering
important technical insights and proofs for several key results. For example, the authors
derive the expected misclassification rate for the maximal (largest possible) tree, showing
that it is bounded from above by twice the Bayes rate. The authors also discuss the bias
variance tradeoff in trees and show how the bias is affected by the number of attributes.
123