S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	2/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 2 3 4 5 6 7 8 9 ... 16

X. Wu et al.

There are usually many tests that could be chosen in this last step. C4.5 uses two heuristic

criteria to rank possible tests: information gain, which minimizes the total entropy of the

subsets {S

i

} (but is heavily biased towards tests with numerous outcomes), and the default

gain ratio that divides information gain by the information provided by the test outcomes.

Attributes can be either numeric or nominal and this determines the format of the test

outcomes. For a numeric attribute A they are { A

≤ h, A > h} where the threshold h is found

by sorting S on the values of A and choosing the split between successive values that max-

imizes the criterion above. An attribute A with discrete values has by default one outcome

for each value, but an option allows the values to be grouped into two or more subsets with

one outcome for each subset.

The initial tree is then pruned to avoid overﬁtting. The pruning algorithm is based on a

pessimistic estimate of the error rate associated with a set of N cases, E of which do not

belong to the most frequent class. Instead of E

/N, C4.5 determines the upper limit of the

binomial probability when E events have been observed in N trials, using a user-speciﬁed

conﬁdence whose default value is 0.25.

Pruning is carried out from the leaves to the root. The estimated error at a leaf with N

cases and E errors is N times the pessimistic error rate as above. For a subtree, C4.5 adds

the estimated errors of the branches and compares this to the estimated error if the subtree is

replaced by a leaf; if the latter is no higher than the former, the subtree is pruned. Similarly,

C4.5 checks the estimated error if the subtree is replaced by one of its branches and when

this appears beneﬁcial the tree is modiﬁed accordingly. The pruning process is completed in

one pass through the tree.

C4.5’s tree-construction algorithm differs in several respects from CART [

], for instance:

• Tests in CART are always binary, but C4.5 allows two or more outcomes.

• CART uses the Gini diversity index to rank tests, whereas C4.5 uses information-based

criteria.

• CART prunes trees using a cost-complexity model whose parameters are estimated by

cross-validation; C4.5 uses a single-pass algorithm derived from binomial conﬁdence

limits.

• This brief discussion has not mentioned what happens when some of a case’s values are

unknown. CART looks for surrogate tests that approximate the outcomes when the tested

attribute has an unknown value, but C4.5 apportions the case probabilistically among the

outcomes.

1.3 Ruleset classiﬁers

Complex decision trees can be difﬁcult to understand, for instance because information about

one class is usually distributed throughout the tree. C4.5 introduced an alternative formalism

consisting of a list of rules of the form “if A and B and C and ... then class X”, where rules for

each class are grouped together. A case is classiﬁed by ﬁnding the ﬁrst rule whose conditions

are satisﬁed by the case; if no rule is satisﬁed, the case is assigned to a default class.

C4.5 rulesets are formed from the initial (unpruned) decision tree. Each path from the root

of the tree to a leaf becomes a prototype rule whose conditions are the outcomes along the path

and whose class is the label of the leaf. This rule is then simpliﬁed by determining the effect of

discarding each condition in turn. Dropping a condition may increase the number N of cases

covered by the rule, and also the number E of cases that do not belong to the class nominated

by the rule, and may lower the pessimistic error rate determined as above. A hill-climbing

algorithm is used to drop conditions until the lowest pessimistic error rate is found.

123

Top 10 algorithms in data mining

To complete the process, a subset of simpliﬁed rules is selected for each class in turn.

These class subsets are ordered to minimize the error on the training cases and a default

class is chosen. The ﬁnal ruleset usually has far fewer rules than the number of leaves on the

pruned decision tree.

The principal disadvantage of C4.5’s rulesets is the amount of CPU time and memory that

they require. In one experiment, samples ranging from 10,000 to 100,000 cases were drawn

from a large dataset. For decision trees, moving from 10 to 100K cases increased CPU time

on a PC from 1.4 to 61 s, a factor of 44. The time required for rulesets, however, increased

from 32 to 9,715 s, a factor of 300.

1.4 See5/C5.0

C4.5 was superseded in 1997 by a commercial system See5/C5.0 (or C5.0 for short). The

changes encompass new capabilities as well as much-improved efﬁciency, and include:

• A variant of boosting [

], which constructs an ensemble of classiﬁers that are then voted

to give a ﬁnal classiﬁcation. Boosting often leads to a dramatic improvement in predictive

accuracy.

• New data types (e.g., dates), “not applicable” values, variable misclassiﬁcation costs, and

mechanisms to pre-ﬁlter attributes.

• Unordered rulesets—when a case is classiﬁed, all applicable rules are found and voted.

This improves both the interpretability of rulesets and their predictive accuracy.

• Greatly improved scalability of both decision trees and (particularly) rulesets. Scalabili-

ty is enhanced by multi-threading; C5.0 can take advantage of computers with multiple

CPUs and/or cores.

More details are available from

http://rulequest.com/see5-comparison.html.

1.5 Research issues

We have frequently heard colleagues express the view that decision trees are a “solved prob-

lem.” We do not agree with this proposition and will close with a couple of open research

problems.

Stable trees. It is well known that the error rate of a tree on the cases from which it was con-

structed (the resubstitution error rate) is much lower than the error rate on unseen cases (the

predictive error rate). For example, on a well-known letter recognition dataset with 20,000

cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out

(20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from

20,000 often affects the tree that is constructed!

Suppose now that we could develop a non-trivial tree-construction algorithm that was

hardly ever affected by omitting a single case. For such stable trees, the resubstitution error

rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree

is of the “right” size.

Decomposing complex trees. Ensemble classiﬁers, whether generated by boosting, bag-

ging, weight randomization, or other techniques, usually offer improved predictive accuracy.

Now, given a small number of decision trees, it is possible to generate a single (very complex)

tree that is exactly equivalent to voting the original trees, but can we go the other way? That

is, can a complex tree be broken down to a small collection of simple trees that, when voted

together, give the same result as the complex tree? Such decomposition would be of great

help in producing comprehensible decision trees.

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 16