Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə18/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   14   15   16   17   18   19   20   21   ...   219

PRP 

= -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX 

+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX.

(The abbreviated variable names are given in the second row of the table.) This

is called a regression equation, and the process of determining the weights is

called  regression, a well-known procedure in statistics that we will review in

Chapter 4. However, the basic regression method is incapable of discovering

nonlinear relationships (although variants do exist—indeed, one will be

described in Section 6.3), and in Chapter 3 we will examine different represen-

tations that can be used for predicting numeric quantities.

In the iris and central processing unit (CPU) performance data, all the 

attributes have numeric values. Practical situations frequently present a mixture

of numeric and nonnumeric attributes.

Labor negotiations: A more realistic example

The labor negotiations dataset in Table 1.6 summarizes the outcome of Cana-

dian contract negotiations in 1987 and 1988. It includes all collective agreements

reached in the business and personal services sector for organizations with at

least 500 members (teachers, nurses, university staff, police, etc.). Each case con-

cerns one contract, and the outcome is whether the contract is deemed accept-



able or  unacceptable. The acceptable contracts are ones in which agreements

were accepted by both labor and management. The unacceptable ones are either

known offers that fell through because one party would not accept them or

acceptable contracts that had been significantly perturbed to the extent that, in

the view of experts, they would not have been accepted.

There are 40 examples in the dataset (plus another 17 which are normally

reserved for test purposes). Unlike the other tables here, Table 1.6 presents the

examples as columns rather than as rows; otherwise, it would have to be

stretched over several pages. Many of the values are unknown or missing, as

indicated by question marks.

This is a much more realistic dataset than the others we have seen. It con-

tains many missing values, and it seems unlikely that an exact classification can

be obtained.

Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a)

is simple and approximate: it doesn’t represent the data exactly. For example, it

will predict bad for some contracts that are actually marked good. But it does

make intuitive sense: a contract is bad (for the employee!) if the wage increase

in the first year is too small (less than 2.5%). If the first-year wage increase is

larger than this, it is good if there are lots of statutory holidays (more than 10

days). Even if there are fewer statutory holidays, it is good if the first-year wage

increase is large enough (more than 4%).

1 . 2


S I M P L E   E X A M P L E S : T H E  W E AT H E R   P RO B L E M  A N D   OT H E R S

1 7


P088407-Ch001.qxd  4/30/05  11:11 AM  Page 17


Figure 1.3(b) is a more complex decision tree that represents the same

dataset. In fact, this is a more accurate representation of the actual dataset that

was used to create the tree. But it is not necessarily a more accurate representa-

tion of the underlying concept of good versus bad contracts. Look down the left

branch. It doesn’t seem to make sense intuitively that, if the working hours

exceed 36, a contract is bad if there is no health-plan contribution or a full

health-plan contribution but is good if there is a half health-plan contribution.

It is certainly reasonable that the health-plan contribution plays a role in the

decision but not if half is good and both full and none are bad. It seems likely

that this is an artifact of the particular values used to create the decision tree

rather than a genuine feature of the good versus bad distinction.

The tree in Figure 1.3(b) is more accurate on the data that was used to train

the classifier but will probably perform less well on an independent set of test

data. It is “overfitted” to the training data—it follows it too slavishly. The tree

in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of

pruning, which we will learn more about in Chapter 6.



Soybean classification: A classic machine learning success

An often-quoted early success story in the application of machine learning to

practical problems is the identification of rules for diagnosing soybean diseases.

The data is taken from questionnaires describing plant diseases. There are about

1 8

C H A P T E R   1



|

W H AT ’ S   I T   A L L   A B O U T ?



Table 1.6

The labor negotiations data.

Attribute

Type

1

2



3

. . .


40

duration


years

1

2



3

2

wage increase 1st year



percentage

2%

4%



4.3%

4.5


wage increase 2nd year

percentage

?

5%

4.4%



4.0

wage increase 3rd year

percentage

?

?



?

?

cost of living adjustment



{none, tcf, tc}

none


tcf

?

none



working hours per week

hours


28

35

38



40

pension


{none, ret-allw, empl-cntr}

none


?

?

?



standby pay

percentage

?

13%


?

?

shift-work supplement



percentage

?

5%



4%

4

education allowance



{yes, no}

yes


?

?

?



statutory holidays

days


11

15

12



12

vacation


{below-avg, avg, gen}

avg


gen

gen


avg

long-term disability assistance

{yes, no}

no

?



?

yes


dental plan contribution

{none, half, full}

none

?

full



full

bereavement assistance

{yes, no}

no

?



?

yes


health plan contribution

{none, half, full}

none

?

full



half

acceptability of contract

{good, bad}

bad


good

good


good

P088407-Ch001.qxd  4/30/05  11:11 AM  Page 18




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   14   15   16   17   18   19   20   21   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə