S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	13/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

Top 10 algorithms in data mining

adopting the naive Bayes independence assumption. However, exactly the same structure for

the ratio results if we model f

(x|1) by g(x)

p

j

=1

h

) and f (x|0) by g(x)

p

j

=1

h

where the function g

(x) is the same in each model. The ratio is thus

P

(1|x)

(0|x)

=

P

(1)g(x)

p

j

=1

h

)

P

(0)g(x)

p

j

=1

h

)

(1)

(0)

·

p

j

=1

h

)

p

=1

h

)

(26)

Here, the h

) do not even have to be probability density functions—it is sufﬁcient that

the g

(x)

p

j

=1

h

) are densities. The model in (

) is just as simple as the naive Bayes

model, and takes exactly the same form—take logs and we have a sum as in (

)—but it is

much more ﬂexible because it does not assume independence of the x

j

in each class. In fact,

it permits arbitrary dependence structures, via the g

(x) function, which can take any form.

The point is, however, that this dependence is the same in the two classes, so that it cancels

out in the ratio in (

). Of course, this considerable extra ﬂexibility of the logistic regression

model is not obtained without cost. Although the resulting model form is identical to the

naive Bayes model form (with different parameter values, of course), it cannot be estimated

by looking at the univariate marginals separately: an iterative procedure has to be used.

9.4 Concluding remarks on naive Bayes

The naive Bayes model is tremendously appealing because of its simplicity, elegance, and

robustness. It is one of the oldest formal classiﬁcation algorithms, and yet even in its simplest

form it is often surprisingly effective. It is widely used in areas such as text classiﬁcation

and spam ﬁltering. A large number of modiﬁcations have been introduced, by the statistical,

data mining, machine learning, and pattern recognition communities, in an attempt to make it

more ﬂexible, but one has to recognize that such modiﬁcations are necessarily complications,

which detract from its basic simplicity. Some such modiﬁcations are described in [

].

10 CART

The 1984 monograph, “CART: Classiﬁcation and Regression Trees,” co-authored by

Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, [

] represents a major

milestone in the evolution of Artiﬁcial Intelligence, Machine Learning, non-parametric

statistics, and data mining. The work is important for the comprehensiveness of its study

of decision trees, the technical innovations it introduces, its sophisticated discussion of tree-

structured data analysis, and its authoritative treatment of large sample theory for trees. While

CART citations can be found in almost any domain, far more appear in ﬁelds such as electrical

engineering, biology, medical research and ﬁnancial topics than, for example, in marketing

research or sociology where other tree methods are more popular. This section is intended to

highlight key themes treated in the CART monograph so as to encourage readers to return to

the original source for more detail.

10.1 Overview

The CART decision tree is a binary recursive partitioning procedure capable of processing

continuous and nominal attributes both as targets and predictors. Data are handled in their

raw form; no binning is required or recommended. Trees are grown to a maximal size with-

out the use of a stopping rule and then pruned back (essentially split by split) to the root

via cost-complexity pruning. The next split to be pruned is the one contributing least to the

123

X. Wu et al.

overall performance of the tree on training data (and more than one split may be removed

at a time). The procedure produces trees that are invariant under any order preserving trans-

formation of the predictor attributes. The CART mechanism is intended to produce not one,

but a sequence of nested pruned trees, all of which are candidate optimal trees. The “right

sized” or “honest” tree is identiﬁed by evaluating the predictive performance of every tree

in the pruning sequence. CART offers no internal performance measures for tree selection

based on the training data as such measures are deemed suspect. Instead, tree performance is

always measured on independent test data (or via cross validation) and tree selection proceeds

only after test-data-based evaluation. If no test data exist and cross validation has not been

performed, CART will remain agnostic regarding which tree in the sequence is best. This

is in sharp contrast to methods such as C4.5 that generate preferred models on the basis of

training data measures.

The CART mechanism includes automatic (optional) class balancing, automatic missing

value handling, and allows for cost-sensitive learning, dynamic feature construction, and

probability tree estimation. The ﬁnal reports include a novel attribute importance ranking.

The CART authors also broke new ground in showing how cross validation can be used

to assess performance for every tree in the pruning sequence given that trees in different

CV folds may not align on the number of terminal nodes. Each of these major features is

discussed below.

10.2 Splitting rules

CART splitting rules are always couched in the form

An instance goes left if CONDITION, and goes right otherwise,

where the CONDITION is expressed as “attribute X

i

<= C” for continuous attributes. For

nominal attributes the CONDITION is expressed as membership in an explicit list of values.

The CART authors argue that binary splits are to be preferred because (1) they fragment

the data more slowly than multi-way splits, and (2) repeated splits on the same attribute

are allowed and, if selected, will eventually generate as many partitions for an attribute as

required. Any loss of ease in reading the tree is expected to be offset by improved perfor-

mance. A third implicit reason is that the large sample theory developed by the authors was

restricted to binary partitioning.

The CART monograph focuses most of its discussion on the Gini rule, which is similar

to the better known entropy or information-gain criterion. For a binary (0/1) target the “Gini

measure of impurity” of a node t is

G

(t) = 1 − p(t)

− (1 − p(t))

(27)

where p

(t) is the (possibly weighted) relative frequency of class 1 in the node, and the

improvement (gain) generated by a split of the parent node P into left and right children L

and R is

(P) = G(P) − qG(L) − (1 − q)G(R).

(28)

Here, q is the (possibly weighted) fraction of instances going left. The CART authors favor

the Gini criterion over information gain because the Gini can be readily extended to include

symmetrized costs (see below) and is computed more rapidly than information gain. (Later

versions of CART have added information gain as an optional splitting rule.) They intro-

duce the modiﬁed twoing rule, which is based on a direct comparison of the target attribute

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16