Top 10 algorithms in data mining
29
distribution in two child nodes:
I
(split) = .25(q(1 − q))
u
k
|p
L
(k) − p
R
(k)|
2
,
(29)
where
k indexes the target classes,
p
L
( ) and p
R
( ) are the probability distributions of the
target in the left and right child nodes respectively, and the power term u embeds a user-
trollable penalty on splits generating unequal-sized child nodes. (This splitter is a modified
version of Messenger and Mandell [
58
].) They also introduce a variant of the twoing split
criterion that treats the classes of the target as ordered; ordered twoing attempts to ensure
target classes represented on the left of a split are ranked below those represented on the right.
In our experience the twoing criterion is often a superior performer on multi-class targets
as well as on inherently difficult-to-predict (e.g. noisy) binary targets. For regression (con-
tinuous targets), CART offers a choice of Least Squares (LS) and Least Absolute Deviation
(LAD) criteria as the basis for measuring the improvement of a split. Three other splitting
rules for cost-sensitive learning and probability trees are discussed separately below.
10.3 Prior probabilities and class balancing
In its default classification mode CART always calculates class frequencies in any node rel-
ative to the class frequencies in the root. This is equivalent to automatically reweighting the
data to balance the classes, and ensures that the tree selected as optimal minimizes balanced
class error. The reweighting is implicit in the calculation of all probabilities and improve-
ments and requires no user intervention; the reported sample counts in each node thus reflect
the unweighted data. For a binary (0/1) target any node is classified as class 1 if, and only if,
N
1
(node)/N
1
(root) > N
0
(node)/N
0
(root).
(30)
This default mode is referred to as “priors equal” in the monograph. It has allowed CART
users to work readily with any unbalanced data, requiring no special measures regarding class
rebalancing or the introduction of manually constructed weights. To work effectively with
unbalanced data it is sufficient to run CART using its default settings. Implicit reweighting
can be turned off by selecting the “priors data” option, and the modeler can also elect to
specify an arbitrary set of priors to reflect costs, or potential differences between training
data and future data target class distributions.
10.4 Missing value handling
Missing values appear frequently in real world, and especially business-related databases,
and the need to deal with them is a vexing challenge for all modelers. One of the major
contributions of CART was to include a fully automated and highly effective mechanism
for handling missing values. Decision trees require a missing value-handling mechanism at
three levels: (a) during splitter evaluation, (b) when moving the training data through a node,
and (c) when moving test data through a node for final class assignment. (See [
63
] for a clear
discussion of these points.) Regarding (a), the first version of CART evaluated each splitter
strictly on its performance on the subset of data for which the splitter is available. Later
versions offer a family of penalties that reduce the split improvement measure as a function
of the degree of missingness. For (b) and (c), the CART mechanism discovers “surrogate”
or substitute splitters for every node of the tree, whether missing values occur in the training
data or not. The surrogates are thus available should the tree be applied to new data that
123
30
X. Wu et al.
does include missing values. This is in contrast to machines that can only learn about miss-
ing value handling from training data that include missing values. Friedman [
25
] suggested
moving instances with missing splitter attributes into both left and right child nodes and
making a final class assignment by pooling all nodes in which an instance appears. Quinlan
[
63
] opted for a weighted variant of Friedman’s approach in his study of alternative miss-
ing value-handling methods. Our own assessments of the effectiveness of CART surrogate
performance in the presence of missing data are largely favorable, while Quinlan remains
agnostic on the basis of the approximate surrogates he implements for test purposes [
63
].
Friedman et al. [
26
] noted that 50% of the CART code was devoted to missing value handling;
it is thus unlikely that Quinlan’s experimental version properly replicated the entire CART
surrogate mechanism.
In CART the missing value handling mechanism is fully automatic and locally adaptive
at every node. At each node in the tree the chosen splitter induces a binary partition of the
data (e.g., X1
<= c1 and X1 > c1). A surrogate splitter is a single attribute Z that can predict
this partition where the surrogate itself is in the form of a binary splitter (e.g., Z
<= d and
Z
> d). In other words, every splitter becomes a new target which is to be predicted with
a single split binary tree. Surrogates are ranked by an association score that measures the
advantage of the surrogate over the default rule predicting that all cases go to the larger child
node. To qualify as a surrogate, the variable must outperform this default rule (and thus it
may not always be possible to find surrogates). When a missing value is encountered in a
CART tree the instance is moved to the left or the right according to the top-ranked surro-
gate. If this surrogate is also missing then the second ranked surrogate is used instead, (and
so on). If all surrogates are missing the default rule assigns the instance to the larger child
node (possibly adjusting node sizes for priors). Ties are broken by moving an instance to the
left.
10.5 Attribute importance
The importance of an attribute is based on the sum of the improvements in all nodes in which
the attribute appears as a splitter (weighted by the fraction of the training data in each node
split). Surrogates are also included in the importance calculations, which means that even a
variable that never splits a node may be assigned a large importance score. This allows the
variable importance rankings to reveal variable masking and nonlinear correlation among
the attributes. Importance scores may optionally be confined to splitters and comparing the
splitters-only and the full importance rankings is a useful diagnostic.
10.6 Dynamic feature construction
Friedman [
25
] discussed the automatic construction of new features within each node and,
for the binary target, recommends adding the single feature
x
∗ w,
where
x is the original attribute vector and
w is a scaled difference of means vector across
the two classes (the direction of the Fisher linear discriminant). This is similar to running
a logistic regression on all available attributes in the node and using the estimated logit as
a predictor. In the CART monograph, the authors discuss the automatic construction of lin-
ear combinations that include feature selection; this capability has been available from the
first release of the CART software. BFOS also present a method for constructing Boolean
123