PRP
= -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX.
(The abbreviated variable names are given in the second row of the table.) This
is called a regression equation, and the process of determining the weights is
called regression, a well-known procedure in statistics that we will review in
Chapter 4. However, the basic regression method is incapable of discovering
nonlinear relationships (although variants do exist—indeed, one will be
described in Section 6.3), and in Chapter 3 we will examine different represen-
tations that can be used for predicting numeric quantities.
In the iris and central processing unit (CPU) performance data, all the
attributes have numeric values. Practical situations frequently present a mixture
of numeric and nonnumeric attributes.
Labor negotiations: A more realistic example
The labor negotiations dataset in Table 1.6 summarizes the outcome of Cana-
dian contract negotiations in 1987 and 1988. It includes all collective agreements
reached in the business and personal services sector for organizations with at
least 500 members (teachers, nurses, university staff, police, etc.). Each case con-
cerns one contract, and the outcome is whether the contract is deemed accept-
able or
unacceptable. The acceptable contracts are ones in which agreements
were accepted by both labor and management. The unacceptable ones are either
known offers that fell through because one party would not accept them or
acceptable contracts that had been significantly perturbed to the extent that, in
the view of experts, they would not have been accepted.
There are 40 examples in the dataset (plus another 17 which are normally
reserved for test purposes). Unlike the other tables here, Table 1.6 presents the
examples as columns rather than as rows; otherwise, it would have to be
stretched over several pages. Many of the values are unknown or missing, as
indicated by question marks.
This is a much more realistic dataset than the others we have seen. It con-
tains many missing values, and it seems unlikely that an exact classification can
be obtained.
Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a)
is simple and approximate: it doesn’t represent the data exactly. For example, it
will predict bad for some contracts that are actually marked good. But it does
make intuitive sense: a contract is bad (for the employee!) if the wage increase
in the first year is too small (less than 2.5%). If the first-year wage increase is
larger than this, it is good if there are lots of statutory holidays (more than 10
days). Even if there are fewer statutory holidays, it is good if the first-year wage
increase is large enough (more than 4%).
1 . 2
S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S
1 7
P088407-Ch001.qxd 4/30/05 11:11 AM Page 17
Figure 1.3(b) is a more complex decision tree that represents the same
dataset. In fact, this is a more accurate representation of the actual dataset that
was used to create the tree. But it is not necessarily a more accurate representa-
tion of the underlying concept of good versus bad contracts. Look down the left
branch. It doesn’t seem to make sense intuitively that, if the working hours
exceed 36, a contract is bad if there is no health-plan contribution or a full
health-plan contribution but is good if there is a half health-plan contribution.
It is certainly reasonable that the health-plan contribution plays a role in the
decision but not if half is good and both full and none are bad. It seems likely
that this is an artifact of the particular values used to create the decision tree
rather than a genuine feature of the good versus bad distinction.
The tree in Figure 1.3(b) is more accurate on the data that was used to train
the classifier but will probably perform less well on an independent set of test
data. It is “overfitted” to the training data—it follows it too slavishly. The tree
in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of
pruning, which we will learn more about in Chapter 6.
Soybean classification: A classic machine learning success
An often-quoted early success story in the application of machine learning to
practical problems is the identification of rules for diagnosing soybean diseases.
The data is taken from questionnaires describing plant diseases. There are about
1 8
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
Table 1.6
The labor negotiations data.
Attribute
Type
1
2
3
. . .
40
duration
years
1
2
3
2
wage increase 1st year
percentage
2%
4%
4.3%
4.5
wage increase 2nd year
percentage
?
5%
4.4%
4.0
wage increase 3rd year
percentage
?
?
?
?
cost of living adjustment
{none, tcf, tc}
none
tcf
?
none
working
hours per week
hours
28
35
38
40
pension
{none, ret-allw, empl-cntr}
none
?
?
?
standby pay
percentage
?
13%
?
?
shift-work supplement
percentage
?
5%
4%
4
education allowance
{yes, no}
yes
?
?
?
statutory holidays
days
11
15
12
12
vacation
{below-avg, avg, gen}
avg
gen
gen
avg
long-term disability assistance
{yes, no}
no
?
?
yes
dental
plan contribution
{none, half, full}
none
?
full
full
bereavement assistance
{yes, no}
no
?
?
yes
health plan contribution
{none, half, full}
none
?
full
half
acceptability of contract
{good, bad}
bad
good
good
good
P088407-Ch001.qxd 4/30/05 11:11 AM Page 18