Top 10 algorithms in data mining
27
adopting the naive Bayes independence assumption. However, exactly the same structure for
the ratio results if we model
f
(x|1) by g(x)
p
j
=1
h
1
(x
j
) and f (x|0) by g(x)
p
j
=1
h
0
(x
j
),
where the function g
(
x) is the same in each model. The ratio is thus
P
(1|x)
P
(0|x)
=
P
(1)g(x)
p
j
=1
h
1
(x
j
)
P
(0)g(x)
p
j
=1
h
0
(x
j
)
=
P
(1)
P
(0)
·
p
j
=1
h
1
(x
j
)
p
j
=1
h
0
(x
j
)
.
(26)
Here, the h
i
(x
j
) do not even have to be probability density functions—it is sufficient that
the g
(x)
p
j
=1
h
i
(x
j
) are densities. The model in (
26
) is just as simple as the naive Bayes
model, and takes exactly the same form—take logs and we have a sum as in (
23
)—but it is
much more flexible because it does not assume independence of the
x
j
in each class. In fact,
it permits arbitrary dependence structures, via the g
(x) function, which can take any form.
The point is, however, that this dependence is the same in the two classes, so that it cancels
out in the ratio in (
26
). Of course, this considerable extra flexibility of the logistic regression
model is not obtained without cost. Although the resulting model form is identical to the
naive Bayes model form (with different parameter values, of course), it cannot be estimated
by looking at the univariate marginals separately: an iterative procedure has to be used.
9.4 Concluding remarks on naive Bayes
The naive Bayes model is tremendously appealing because of its simplicity, elegance, and
robustness. It is one of the oldest formal classification algorithms, and yet even in its simplest
form it is often surprisingly effective. It is widely used in areas such as text classification
and spam filtering. A large number of modifications have been introduced, by the statistical,
data mining, machine learning, and pattern recognition communities, in an attempt to make it
more flexible, but one has to recognize that such modifications are necessarily complications,
which detract from its basic simplicity. Some such modifications are described in [
27
,
66
].
10 CART
The 1984 monograph, “CART: Classification and Regression Trees,” co-authored by
Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, [
9
] represents a major
milestone in the evolution of Artificial Intelligence, Machine Learning, non-parametric
statistics, and data mining. The work is important for the comprehensiveness of its study
of decision trees, the technical innovations it introduces, its sophisticated discussion of tree-
structured data analysis, and its authoritative treatment of large sample theory for trees. While
CART citations can be found in almost any domain, far more appear in fields such as electrical
engineering, biology, medical research and financial topics than, for example, in marketing
research or sociology where other tree methods are more popular. This section is intended to
highlight key themes treated in the CART monograph so as to encourage readers to return to
the original source for more detail.
10.1 Overview
The CART decision tree is a binary recursive partitioning procedure capable of processing
continuous and nominal attributes both as targets and predictors. Data are handled in their
raw form; no binning is required or recommended. Trees are grown to a maximal size with-
out the use of a stopping rule and then pruned back (essentially split by split) to the root
via cost-complexity pruning. The next split to be pruned is the one contributing least to the
123
28
X. Wu et al.
overall performance of the tree on training data (and more than one split may be removed
at a time). The procedure produces trees that are invariant under any order preserving trans-
formation of the predictor attributes. The CART mechanism is intended to produce not one,
but a sequence of nested pruned trees, all of which are candidate optimal trees. The “right
sized” or “honest” tree is identified by evaluating the predictive performance of every tree
in the pruning sequence. CART offers no internal performance measures for tree selection
based on the training data as such measures are deemed suspect. Instead, tree performance is
always measured on independent test data (or via cross validation) and tree selection proceeds
only after test-data-based evaluation. If no test data exist and cross validation has not been
performed, CART will remain agnostic regarding which tree in the sequence is best. This
is in sharp contrast to methods such as C4.5 that generate preferred models on the basis of
training data measures.
The CART mechanism includes automatic (optional) class balancing, automatic missing
value handling, and allows for cost-sensitive learning, dynamic feature construction, and
probability tree estimation. The final reports include a novel attribute importance ranking.
The CART authors also broke new ground in showing how cross validation can be used
to assess performance for every tree in the pruning sequence given that trees in different
CV folds may not align on the number of terminal nodes. Each of these major features is
discussed below.
10.2 Splitting rules
CART splitting rules are always couched in the form
An instance goes left if CONDITION, and goes right otherwise,
where the CONDITION is expressed as “attribute X
i
<=
C” for continuous attributes. For
nominal attributes the CONDITION is expressed as membership in an explicit list of values.
The CART authors argue that binary splits are to be preferred because (1) they fragment
the data more slowly than multi-way splits, and (2) repeated splits on the same attribute
are allowed and, if selected, will eventually generate as many partitions for an attribute as
required. Any loss of ease in reading the tree is expected to be offset by improved perfor-
mance. A third implicit reason is that the large sample theory developed by the authors was
restricted to binary partitioning.
The CART monograph focuses most of its discussion on the Gini rule, which is similar
to the better known entropy or information-gain criterion. For a binary (0/1) target the “Gini
measure of impurity” of a node t is
G
(t) = 1 − p(t)
2
− (1 − p(t))
2
,
(27)
where
p
(t) is the (possibly weighted) relative frequency of class 1 in the node, and the
improvement (gain) generated by a split of the parent node P into left and right children L
and R is
I
(P) = G(P) − qG(L) − (1 − q)G(R).
(28)
Here, q is the (possibly weighted) fraction of instances going left. The CART authors favor
the Gini criterion over information gain because the Gini can be readily extended to include
symmetrized costs (see below) and is computed more rapidly than information gain. (Later
versions of CART have added information gain as an optional splitting rule.) They intro-
duce the modified twoing rule, which is based on a direct comparison of the target attribute
123