may be associated with the rules themselves to indicate that some are more
important, or more reliable, than others.
You might be wondering whether there is a smaller rule set that performs as
well. If so, would you be better off using the smaller rule set and, if so, why?
These are exactly the kinds of questions that will occupy us in this book. Because
the examples form a complete set for the problem space, the rules do no more
than summarize all the information that is given, expressing it in a different and
more concise way. Even though it involves no generalization, this is often a very
useful thing to do! People frequently use machine learning techniques to gain
insight into the structure of their data rather than to make predictions for new
cases. In fact, a prominent and successful line of research in machine learning
began as an attempt to compress a huge database of possible chess endgames
and their outcomes into a data structure of reasonable size. The data structure
chosen for this enterprise was not a set of rules but a decision tree.
Figure 1.2 shows a structural description for the contact lens data in the form
of a decision tree, which for many purposes is a more concise and perspicuous
representation of the rules and has the advantage that it can be visualized more
easily. (However, this decision tree—in contrast to the rule set given in Figure
1.1—classifies two examples incorrectly.) The tree calls first for a test on tear
production rate, and the first two branches correspond to the two possible out-
comes. If tear production rate is reduced (the left branch), the outcome is none.
If it is normal (the right branch), a second test is made, this time on astigma-
tism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached
1 4
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
normal
tear production rate
reduced
hypermetrope
myope
none
astigmatism
soft
hard
none
spectacle prescription
yes
no
Figure 1.2 Decision tree for the
contact lens data.
P088407-Ch001.qxd 4/30/05 11:11 AM Page 14
that dictates the contact lens recommendation for that case. The question of
what is the most natural and easily understood format for the output from a
machine learning scheme is one that we will return to in Chapter 3.
Irises: A classic numeric dataset
The iris dataset, which dates back to seminal work by the eminent statistician
R.A. Fisher in the mid-1930s and is arguably the most famous dataset used in
data mining, contains 50 examples each of three types of plant: Iris setosa, Iris
versicolor, and
Iris virginica. It
is excerpted in Table 1.4. There are four attrib-
utes: sepal length, sepal width, petal length, and petal width (all measured in cen-
timeters). Unlike previous datasets, all attributes have values that are numeric.
The following set of rules might be learned from this dataset:
If petal length
< 2.45 then Iris setosa
If sepal width
< 2.10 then Iris versicolor
If sepal width
< 2.45
and petal length < 4.55 then Iris versicolor
If sepal width
< 2.95 and petal width < 1.35 then Iris versicolor
If petal length
≥ 2.45 and petal length < 4.45 then Iris versicolor
If sepal length
≥ 5.85 and petal length < 4.75 then Iris versicolor
1 . 2
S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S
1 5
Table 1.4
The iris data.
Sepal
Sepal width
Petal length
Petal width
length (cm)
(cm)
(cm)
(cm)
Type
1
5.1
3.5
1.4
0.2
Iris setosa
2
4.9
3.0
1.4
0.2
Iris setosa
3
4.7
3.2
1.3
0.2
Iris setosa
4
4.6
3.1
1.5
0.2
Iris setosa
5
5.0
3.6
1.4
0.2
Iris setosa
. . .
51
7.0
3.2
4.7
1.4
Iris versicolor
52
6.4
3.2
4.5
1.5
Iris versicolor
53
6.9
3.1
4.9
1.5
Iris versicolor
54
5.5
2.3
4.0
1.3
Iris versicolor
55
6.5
2.8
4.6
1.5
Iris versicolor
. . .
101
6.3
3.3
6.0
2.5
Iris virginica
102
5.8
2.7
5.1
1.9
Iris virginica
103
7.1
3.0
5.9
2.1
Iris virginica
104
6.3
2.9
5.6
1.8
Iris virginica
105
6.5
3.0
5.8
2.2
Iris virginica
. . .
P088407-Ch001.qxd 4/30/05 11:11 AM Page 15
If
sepal width
< 2.55 and petal length < 4.95 and
petal width
< 1.55 then Iris versicolor
If petal length
≥ 2.45 and petal length < 4.95 and
petal width
< 1.55 then Iris versicolor
If sepal length
≥ 6.55 and petal length < 5.05 then Iris versicolor
If sepal width
< 2.75 and petal width < 1.65 and
sepal length
< 6.05 then Iris versicolor
If sepal length
≥ 5.85 and sepal length < 5.95 and
petal length
< 4.85 then Iris versicolor
If petal length
≥ 5.15 then Iris virginica
If petal width
≥ 1.85 then Iris virginica
If petal width
≥ 1.75 and sepal width < 3.05 then Iris virginica
If petal length
≥ 4.95 and petal width < 1.55 then Iris virginica
These rules are very cumbersome, and we will see in Chapter 3 how more
compact rules can be expressed that convey the same information.
CPU performance: Introducing numeric prediction
Although the iris dataset involves numeric attributes, the outcome—the type of
iris—is a category, not a numeric value. Table 1.5 shows some data for which
the outcome and the attributes are numeric. It concerns the relative perform-
ance of computer processing power on the basis of a number of relevant
attributes; each row represents 1 of 209 different computer configurations.
The classic way of dealing with continuous prediction is to write the outcome
as a linear sum of the attribute values with appropriate weights, for example:
1 6
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
Table 1.5
The CPU performance data.
Main
Cycle
memory (KB)
Cache
Channels
time (ns)
Min.
Max.
(KB)
Min.
Max.
Performance
MYCT
MMIN
MMAX
CACH
CHMIN
CHMAX
PRP
1
125
256
6000
256
16
128
198
2
29
8000
32000
32
8
32
269
3
29
8000
32000
32
8
32
220
4
29
8000
32000
32
8
32
172
5
29
8000
16000
32
8
16
132
. . .
207
125
2000
8000
0
2
14
52
208
480
512
8000
32
0
0
67
209
480
1000
4000
0
0
0
45
P088407-Ch001.qxd 4/30/05 11:11 AM Page 16