S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	14/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

Top 10 algorithms in data mining

distribution in two child nodes:

(split) = .25(q(1 − q))

u

k

(k) − p

(k)|

,

(29)

where k indexes the target classes, p

L

( ) and p

( ) are the probability distributions of the

target in the left and right child nodes respectively, and the power term u embeds a user-

trollable penalty on splits generating unequal-sized child nodes. (This splitter is a modiﬁed

version of Messenger and Mandell [

].) They also introduce a variant of the twoing split

criterion that treats the classes of the target as ordered; ordered twoing attempts to ensure

target classes represented on the left of a split are ranked below those represented on the right.

In our experience the twoing criterion is often a superior performer on multi-class targets

as well as on inherently difﬁcult-to-predict (e.g. noisy) binary targets. For regression (con-

tinuous targets), CART offers a choice of Least Squares (LS) and Least Absolute Deviation

(LAD) criteria as the basis for measuring the improvement of a split. Three other splitting

rules for cost-sensitive learning and probability trees are discussed separately below.

10.3 Prior probabilities and class balancing

In its default classiﬁcation mode CART always calculates class frequencies in any node rel-

ative to the class frequencies in the root. This is equivalent to automatically reweighting the

data to balance the classes, and ensures that the tree selected as optimal minimizes balanced

class error. The reweighting is implicit in the calculation of all probabilities and improve-

ments and requires no user intervention; the reported sample counts in each node thus reﬂect

the unweighted data. For a binary (0/1) target any node is classiﬁed as class 1 if, and only if,

(node)/N

(root) > N

(node)/N

(root).

(30)

This default mode is referred to as “priors equal” in the monograph. It has allowed CART

users to work readily with any unbalanced data, requiring no special measures regarding class

rebalancing or the introduction of manually constructed weights. To work effectively with

unbalanced data it is sufﬁcient to run CART using its default settings. Implicit reweighting

can be turned off by selecting the “priors data” option, and the modeler can also elect to

specify an arbitrary set of priors to reﬂect costs, or potential differences between training

data and future data target class distributions.

10.4 Missing value handling

Missing values appear frequently in real world, and especially business-related databases,

and the need to deal with them is a vexing challenge for all modelers. One of the major

contributions of CART was to include a fully automated and highly effective mechanism

for handling missing values. Decision trees require a missing value-handling mechanism at

three levels: (a) during splitter evaluation, (b) when moving the training data through a node,

and (c) when moving test data through a node for ﬁnal class assignment. (See [

] for a clear

discussion of these points.) Regarding (a), the ﬁrst version of CART evaluated each splitter

strictly on its performance on the subset of data for which the splitter is available. Later

versions offer a family of penalties that reduce the split improvement measure as a function

of the degree of missingness. For (b) and (c), the CART mechanism discovers “surrogate”

or substitute splitters for every node of the tree, whether missing values occur in the training

data or not. The surrogates are thus available should the tree be applied to new data that

123

X. Wu et al.

does include missing values. This is in contrast to machines that can only learn about miss-

ing value handling from training data that include missing values. Friedman [

] suggested

moving instances with missing splitter attributes into both left and right child nodes and

making a ﬁnal class assignment by pooling all nodes in which an instance appears. Quinlan

[

63

] opted for a weighted variant of Friedman’s approach in his study of alternative miss-

ing value-handling methods. Our own assessments of the effectiveness of CART surrogate

performance in the presence of missing data are largely favorable, while Quinlan remains

agnostic on the basis of the approximate surrogates he implements for test purposes [

Friedman et al. [

] noted that 50% of the CART code was devoted to missing value handling;

it is thus unlikely that Quinlan’s experimental version properly replicated the entire CART

surrogate mechanism.

In CART the missing value handling mechanism is fully automatic and locally adaptive

at every node. At each node in the tree the chosen splitter induces a binary partition of the

data (e.g., X1

<= c1 and X1 > c1). A surrogate splitter is a single attribute Z that can predict

this partition where the surrogate itself is in the form of a binary splitter (e.g., Z

<= d and

> d). In other words, every splitter becomes a new target which is to be predicted with

a single split binary tree. Surrogates are ranked by an association score that measures the

advantage of the surrogate over the default rule predicting that all cases go to the larger child

node. To qualify as a surrogate, the variable must outperform this default rule (and thus it

may not always be possible to ﬁnd surrogates). When a missing value is encountered in a

CART tree the instance is moved to the left or the right according to the top-ranked surro-

gate. If this surrogate is also missing then the second ranked surrogate is used instead, (and

so on). If all surrogates are missing the default rule assigns the instance to the larger child

node (possibly adjusting node sizes for priors). Ties are broken by moving an instance to the

left.

10.5 Attribute importance

The importance of an attribute is based on the sum of the improvements in all nodes in which

the attribute appears as a splitter (weighted by the fraction of the training data in each node

split). Surrogates are also included in the importance calculations, which means that even a

variable that never splits a node may be assigned a large importance score. This allows the

variable importance rankings to reveal variable masking and nonlinear correlation among

the attributes. Importance scores may optionally be conﬁned to splitters and comparing the

splitters-only and the full importance rankings is a useful diagnostic.

10.6 Dynamic feature construction

Friedman [

] discussed the automatic construction of new features within each node and,

for the binary target, recommends adding the single feature

x

∗ w,

where x is the original attribute vector and

w is a scaled difference of means vector across

the two classes (the direction of the Fisher linear discriminant). This is similar to running

a logistic regression on all available attributes in the node and using the estimated logit as

a predictor. In the CART monograph, the authors discuss the automatic construction of lin-

ear combinations that include feature selection; this capability has been available from the

ﬁrst release of the CART software. BFOS also present a method for constructing Boolean

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16