S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	15/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

Top 10 algorithms in data mining

combinations of splitters within each node, a capability that has not been included in the

released software.

10.7 Cost-sensitive learning

Costs are central to statistical decision theory but cost-sensitive learning received only modest

attention before Domingos [

]. Since then, several conferences have been devoted exclu-

sively to this topic and a large number of research papers have appeared in the subsequent

scientiﬁc literature. It is therefore useful to note that the CART monograph introduced two

strategies for cost-sensitive learning and the entire mathematical machinery describing CART

is cast in terms of the costs of misclassiﬁcation. The cost of misclassifying an instance of

class i as class j is C

(i, j) and is assumed to be equal to 1 unless speciﬁed otherwise;

(i, i) = 0 for all i. The complete set of costs is represented in the matrix C containing a

row and a column for each target class. Any classiﬁcation tree can have a total cost computed

for its terminal node assignments by summing costs over all misclassiﬁcations. The issue in

cost-sensitive learning is to induce a tree that takes the costs into account during its growing

and pruning phases.

The ﬁrst and most straightforward method for handling costs makes use of weighting:

instances belonging to classes that are costly to misclassify are weighted upwards, with a

common weight applying to all instances of a given class, a method recently rediscovered

by Ting [

]. As implemented in CART. the weighting is accomplished transparently so that

all node counts are reported in their raw unweighted form. For multi-class problems BFOS

suggested that the entries in the misclassiﬁcation cost matrix be summed across each row

to obtain relative class weights that approximately reﬂect costs. This technique ignores the

detail within the matrix but has now been widely adopted due to its simplicity. For the Gini

splitting rule the CART authors show that it is possible to embed the entire cost matrix into

the splitting rule, but only after it has been symmetrized. The “symGini” splitting rule gener-

ates trees sensitive to the difference in costs C

(i, j) and C(i, k), and is most useful when the

symmetrized cost matrix is an acceptable representation of the decision maker’s problem.

In contrast, the instance weighting approach assigns a single cost to all misclassiﬁcations of

objects of class i . BFOS report that pruning the tree using the full cost matrix is essential to

successful cost-sensitive learning.

10.8 Stopping rules, pruning, tree sequences, and tree selection

The earliest work on decision trees did not allow for pruning. Instead, trees were grown

until they encountered some stopping condition and the resulting tree was considered ﬁnal.

In the CART monograph the authors argued that no rule intended to stop tree growth can

guarantee that it will not miss important data structure (e.g., consider the two-dimensional

XOR problem). They therefore elected to grow trees without stopping. The resulting overly

large tree provides the raw material from which a ﬁnal optimal model is extracted.

The pruning mechanism is based strictly on the training data and begins with a cost-

complexity measure deﬁned as

R

a

(T ) = R(T ) + a|T |,

(31)

where R

(T ) is the training sample cost of the tree, |T | is the number of terminal nodes in the

tree and a is a penalty imposed on each node. If a

= 0 then the minimum cost-complexity

tree is clearly the largest possible. If a is allowed to progressively increase the minimum

cost-complexity tree will become smaller since the splits at the bottom of the tree that reduce

123

X. Wu et al.

(T ) the least will be cut away. The parameter a is progressively increased from 0 to a value

sufﬁcient to prune away all splits. BFOS prove that any tree of size Q extracted in this way

will exhibit a cost R

(Q) that is minimum within the class of all trees with Q terminal nodes.

The optimal tree is deﬁned as that tree in the pruned sequence that achieves minimum cost

on test data. Because test misclassiﬁcation cost measurement is subject to sampling error,

uncertainty always remains regarding which tree in the pruning sequence is optimal. BFOS

recommend selecting the “1 SE” tree that is the smallest tree with an estimated cost within

1 standard error of the minimum cost (or “0 SE”) tree.

10.9 Probability trees

Probability trees have been recently discussed in a series of insightful articles elucidating

their properties and seeking to improve their performance (see Provost and Domingos 2000).

The CART monograph includes what appears to be the ﬁrst detailed discussion of probabil-

ity trees and the CART software offers a dedicated splitting rule for the growing of “class

probability trees.” A key difference between classiﬁcation trees and probability trees is that

the latter want to keep splits that generate terminal node children assigned to the same class

whereas the former will not (such a split accomplishes nothing so far as classiﬁcation accu-

racy is concerned). A probability tree will also be pruned differently than its counterpart

classiﬁcation tree, therefore, the ﬁnal structure of the two optimal trees can be somewhat

different (although the differences are usually modest). The primary drawback of probabil-

ity trees is that the probability estimates based on training data in the terminal nodes tend

to be biased (e.g., towards 0 or 1 in the case of the binary target) with the bias increasing

with the depth of the node. In the recent ML literature the use of the LaPlace adjustment has

been recommended to reduce this bias (Provost and Domingos 2002). The CART monograph

offers a somewhat more complex method to adjust the terminal node estimates that has rarely

been discussed in the literature. Dubbed the “Breiman adjustment”, it adjusts the estimated

misclassiﬁcation rate r *

(t) of any terminal node upwards by

r

∗

(t) = r(t) + e/(q(t) + S)

(32)

where r

(t) is the train sample estimate within the node, q(t) is the fraction of the train-

ing sample in the node and S and e are parameters that are solved for as a function of the

difference between the train and test error rates for a given tree. In contrast to the LaPlace

method, the Breiman adjustment does not depend on the raw predicted probability in the

node and the adjustment can be very small if the test data show that the tree is not overﬁt.

Bloch et al. [

] reported very good performance for the Breiman adjustment in a series of

empirical experiments.

10.10 Theoretical foundations

The earliest work on decision trees was entirely atheoretical. Trees were proposed as methods

that appeared to be useful and conclusions regarding their properties were based on observing

tree performance on a handful of empirical examples. While this approach remains popular

in Machine Learning, the recent tendency in the discipline has been to reach for stronger

theoretical foundations. The CART monograph tackles theory with sophistication, offering

important technical insights and proofs for several key results. For example, the authors

derive the expected misclassiﬁcation rate for the maximal (largest possible) tree, showing

that it is bounded from above by twice the Bayes rate. The authors also discuss the bias

variance tradeoff in trees and show how the bias is affected by the number of attributes.

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16