standard datasets that we will come back to repeatedly. Different
datasets tend
to expose new issues and challenges, and it is interesting and instructive to have
in mind a variety of problems when considering learning methods. In fact, the
need to work with different datasets is so important that a corpus containing
around 100 example problems has been gathered together so that different algo-
rithms can be tested and compared on the same set of problems.
The illustrations in this section are all unrealistically simple. Serious appli-
cation of data mining involves thousands, hundreds of thousands, or even mil-
lions of individual cases. But when explaining what algorithms do and how they
work, we need simple examples that capture the essence of the problem but are
small enough to be comprehensible in every detail. We will be working with the
illustrations in this section throughout the book, and they are intended to be
“academic” in the sense that they will help us to understand what is going on.
Some actual fielded applications of learning techniques are discussed in Section
1.3, and many more are covered in the books mentioned in the Further reading
section at the end of the chapter.
Another problem with actual real-life datasets is that they are often propri-
etary. No one is going to share their customer and product choice database with
you so that you can understand the details of their data mining application and
how it works. Corporate data is a valuable asset, one whose value has increased
enormously with the development of data mining techniques such as those
described in this book. Yet we are concerned here with understanding how the
methods used for data mining work and understanding the details of these
methods so that we can trace their operation on actual data. That is why our
illustrations are simple ones. But they are not simplistic: they exhibit the fea-
tures of real datasets.
The weather problem
The weather problem is a tiny dataset that we will use repeatedly to illustrate
machine learning methods. Entirely fictitious, it supposedly concerns the con-
ditions that are suitable for playing some unspecified game. In general, instances
in a dataset are characterized by the values of features, or attributes, that measure
different aspects of the instance. In this case there are four attributes: outlook,
temperature, humidity, and
windy. The outcome is whether to play or not.
In its simplest form, shown in Table 1.2, all four attributes have values that
are symbolic categories rather than numbers. Outlook can be sunny, overcast, or
rainy; temperature can be hot, mild, or cool; humidity can be high or normal;
and windy can be true or false. This creates 36 possible combinations (3
¥ 3 ¥
2
¥ 2 = 36), of which 14 are present in the set of input examples.
A set of rules learned from this information—not necessarily a very good
one—might look as follows:
1 0
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 10
If
outlook
= sunny and humidity = high then play = no
If outlook
= rainy and windy = true
then play
= no
If outlook
= overcast
then play
= yes
If humidity
= normal
then play
= yes
If none of the above
then play
= yes
These rules are meant to be interpreted in order: the first one, then if it doesn’t
apply the second, and so on. A set of rules that are intended to be interpreted
in sequence is called a decision list. Interpreted as a decision list, the rules
correctly classify all of the examples in the table, whereas taken individually, out
of context, some of the rules are incorrect. For example, the rule
if humidity =
normal then play = yes
gets one of the examples wrong (check which one).
The meaning of a set of rules depends on how it is interpreted—not
surprisingly!
In the slightly more complex form shown in Table 1.3, two of the attributes—
temperature and humidity—have numeric values. This means that any learn-
ing method must create inequalities involving these attributes rather than
simple equality tests, as in the former case. This is called a numeric-attribute
problem—in this case, a mixed-attribute problem because not all attributes are
numeric.
Now the first rule given earlier might take the following form:
If outlook
= sunny and humidity > 83 then play = no
A slightly more complex process is required to come up with rules that involve
numeric tests.
1 . 2
S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S
1 1
Table 1.2
The weather data.
Outlook
Temperature
Humidity
Windy
Play
sunny
hot
high
false
no
sunny
hot
high
true
no
overcast
hot
high
false
yes
rainy
mild
high
false
yes
rainy
cool
normal
false
yes
rainy
cool
normal
true
no
overcast
cool
normal
true
yes
sunny
mild
high
false
no
sunny
cool
normal
false
yes
rainy
mild
normal
false
yes
sunny
mild
normal
true
yes
overcast
mild
high
true
yes
overcast
hot
normal
false
yes
rainy
mild
high
true
no
P088407-Ch001.qxd 4/30/05 11:11 AM Page 11