Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	18/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 14 15 16 17 18 19 20 21 ... 219

Labor negotiations: A more realistic example
Soybean classiﬁcation: A classic machine learning success
Table 1.6 The labor negotiations data.

PRP

= -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX

+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX.

(The abbreviated variable names are given in the second row of the table.) This

is called a regression equation, and the process of determining the weights is

called regression, a well-known procedure in statistics that we will review in

Chapter 4. However, the basic regression method is incapable of discovering

nonlinear relationships (although variants do exist—indeed, one will be

described in Section 6.3), and in Chapter 3 we will examine different represen-

tations that can be used for predicting numeric quantities.

In the iris and central processing unit (CPU) performance data, all the

attributes have numeric values. Practical situations frequently present a mixture

of numeric and nonnumeric attributes.

Labor negotiations: A more realistic example

The labor negotiations dataset in Table 1.6 summarizes the outcome of Cana-

dian contract negotiations in 1987 and 1988. It includes all collective agreements

reached in the business and personal services sector for organizations with at

least 500 members (teachers, nurses, university staff, police, etc.). Each case con-

cerns one contract, and the outcome is whether the contract is deemed accept-

able or unacceptable. The acceptable contracts are ones in which agreements

were accepted by both labor and management. The unacceptable ones are either

known offers that fell through because one party would not accept them or

acceptable contracts that had been signiﬁcantly perturbed to the extent that, in

the view of experts, they would not have been accepted.

There are 40 examples in the dataset (plus another 17 which are normally

reserved for test purposes). Unlike the other tables here, Table 1.6 presents the

examples as columns rather than as rows; otherwise, it would have to be

stretched over several pages. Many of the values are unknown or missing, as

indicated by question marks.

This is a much more realistic dataset than the others we have seen. It con-

tains many missing values, and it seems unlikely that an exact classiﬁcation can

be obtained.

Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a)

is simple and approximate: it doesn’t represent the data exactly. For example, it

will predict bad for some contracts that are actually marked good. But it does

make intuitive sense: a contract is bad (for the employee!) if the wage increase

in the ﬁrst year is too small (less than 2.5%). If the ﬁrst-year wage increase is

larger than this, it is good if there are lots of statutory holidays (more than 10

days). Even if there are fewer statutory holidays, it is good if the ﬁrst-year wage

increase is large enough (more than 4%).

1 . 2

S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S

1 7

P088407-Ch001.qxd 4/30/05 11:11 AM Page 17

Figure 1.3(b) is a more complex decision tree that represents the same

dataset. In fact, this is a more accurate representation of the actual dataset that

was used to create the tree. But it is not necessarily a more accurate representa-

tion of the underlying concept of good versus bad contracts. Look down the left

branch. It doesn’t seem to make sense intuitively that, if the working hours

exceed 36, a contract is bad if there is no health-plan contribution or a full

health-plan contribution but is good if there is a half health-plan contribution.

It is certainly reasonable that the health-plan contribution plays a role in the

decision but not if half is good and both full and none are bad. It seems likely

that this is an artifact of the particular values used to create the decision tree

rather than a genuine feature of the good versus bad distinction.

The tree in Figure 1.3(b) is more accurate on the data that was used to train

the classiﬁer but will probably perform less well on an independent set of test

data. It is “overﬁtted” to the training data—it follows it too slavishly. The tree

in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of

pruning, which we will learn more about in Chapter 6.

Soybean classiﬁcation: A classic machine learning success

An often-quoted early success story in the application of machine learning to

practical problems is the identiﬁcation of rules for diagnosing soybean diseases.

The data is taken from questionnaires describing plant diseases. There are about

1 8

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

Table 1.6

The labor negotiations data.

Attribute

Type

1

2

. . .

duration

years

wage increase 1st year

percentage

4.3%

4.5

wage increase 2nd year

percentage

5%

4.4%

4.0

wage increase 3rd year

percentage

cost of living adjustment

{none, tcf, tc}

none

tcf

none

working hours per week

hours

pension

{none, ret-allw, empl-cntr}

none

standby pay

percentage

13%

shift-work supplement

percentage

education allowance

{yes, no}

yes

statutory holidays

days

vacation

{below-avg, avg, gen}

avg

gen

avg

long-term disability assistance

{yes, no}

yes

dental plan contribution

{none, half, full}

none

?

full

full

bereavement assistance

{yes, no}

yes

health plan contribution

{none, half, full}

none

?

full

half

acceptability of contract

{good, bad}

bad

good

P088407-Ch001.qxd 4/30/05 11:11 AM Page 18

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 14 15 16 17 18 19 20 21 ... 219