Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	15/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 11 12 13 14 15 16 17 18 ... 219

The weather problem
Table 1.2 The weather data.

standard datasets that we will come back to repeatedly. Different datasets tend

to expose new issues and challenges, and it is interesting and instructive to have

in mind a variety of problems when considering learning methods. In fact, the

need to work with different datasets is so important that a corpus containing

around 100 example problems has been gathered together so that different algo-

rithms can be tested and compared on the same set of problems.

The illustrations in this section are all unrealistically simple. Serious appli-

cation of data mining involves thousands, hundreds of thousands, or even mil-

lions of individual cases. But when explaining what algorithms do and how they

work, we need simple examples that capture the essence of the problem but are

small enough to be comprehensible in every detail. We will be working with the

illustrations in this section throughout the book, and they are intended to be

“academic” in the sense that they will help us to understand what is going on.

Some actual ﬁelded applications of learning techniques are discussed in Section

1.3, and many more are covered in the books mentioned in the Further reading

section at the end of the chapter.

Another problem with actual real-life datasets is that they are often propri-

etary. No one is going to share their customer and product choice database with

you so that you can understand the details of their data mining application and

how it works. Corporate data is a valuable asset, one whose value has increased

enormously with the development of data mining techniques such as those

described in this book. Yet we are concerned here with understanding how the

methods used for data mining work and understanding the details of these

methods so that we can trace their operation on actual data. That is why our

illustrations are simple ones. But they are not simplistic: they exhibit the fea-

tures of real datasets.

The weather problem

The weather problem is a tiny dataset that we will use repeatedly to illustrate

machine learning methods. Entirely ﬁctitious, it supposedly concerns the con-

ditions that are suitable for playing some unspeciﬁed game. In general, instances

in a dataset are characterized by the values of features, or attributes, that measure

different aspects of the instance. In this case there are four attributes: outlook,

temperature, humidity, and windy. The outcome is whether to play or not.

In its simplest form, shown in Table 1.2, all four attributes have values that

are symbolic categories rather than numbers. Outlook can be sunny, overcast, or

rainy; temperature can be hot, mild, or cool; humidity can be high or normal;

and windy can be true or false. This creates 36 possible combinations (3

¥ 3 ¥

2

¥ 2 = 36), of which 14 are present in the set of input examples.

A set of rules learned from this information—not necessarily a very good

one—might look as follows:

1 0

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

P088407-Ch001.qxd 4/30/05 11:11 AM Page 10

If outlook

= sunny and humidity = high then play = no

If outlook

= rainy and windy = true

then play

= no

If outlook

= overcast

then play

= yes

If humidity

= normal

then play

= yes

If none of the above

then play

= yes

These rules are meant to be interpreted in order: the ﬁrst one, then if it doesn’t

apply the second, and so on. A set of rules that are intended to be interpreted

in sequence is called a decision list. Interpreted as a decision list, the rules

correctly classify all of the examples in the table, whereas taken individually, out

of context, some of the rules are incorrect. For example, the rule

if humidity =

normal then play = yes

gets one of the examples wrong (check which one).

The meaning of a set of rules depends on how it is interpreted—not

surprisingly!

In the slightly more complex form shown in Table 1.3, two of the attributes—

temperature and humidity—have numeric values. This means that any learn-

ing method must create inequalities involving these attributes rather than

simple equality tests, as in the former case. This is called a numeric-attribute

problem—in this case, a mixed-attribute problem because not all attributes are

numeric.

Now the ﬁrst rule given earlier might take the following form:

If outlook

= sunny and humidity > 83 then play = no

A slightly more complex process is required to come up with rules that involve

numeric tests.

1 . 2

S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S

1 1

Table 1.2

The weather data.

Outlook

Temperature

Humidity

Windy

Play

sunny

hot

high

false

sunny

hot

high

true

overcast

hot

high

false

yes

rainy

mild

high

false

yes

rainy

cool

normal

false

yes

rainy

cool

normal

true

overcast

cool

normal

true

yes

sunny

mild

high

false

sunny

cool

normal

false

yes

rainy

mild

normal

false

yes

sunny

mild

normal

true

yes

overcast

mild

high

true

yes

overcast

hot

normal

false

yes

rainy

mild

high

true

P088407-Ch001.qxd 4/30/05 11:11 AM Page 11

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 11 12 13 14 15 16 17 18 ... 219