Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə15/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   11   12   13   14   15   16   17   18   ...   219

standard datasets that we will come back to repeatedly. Different datasets tend

to expose new issues and challenges, and it is interesting and instructive to have

in mind a variety of problems when considering learning methods. In fact, the

need to work with different datasets is so important that a corpus containing

around 100 example problems has been gathered together so that different algo-

rithms can be tested and compared on the same set of problems.

The illustrations in this section are all unrealistically simple. Serious appli-

cation of data mining involves thousands, hundreds of thousands, or even mil-

lions of individual cases. But when explaining what algorithms do and how they

work, we need simple examples that capture the essence of the problem but are

small enough to be comprehensible in every detail. We will be working with the

illustrations in this section throughout the book, and they are intended to be

“academic” in the sense that they will help us to understand what is going on.

Some actual fielded applications of learning techniques are discussed in Section

1.3, and many more are covered in the books mentioned in the Further reading

section at the end of the chapter.

Another problem with actual real-life datasets is that they are often propri-

etary. No one is going to share their customer and product choice database with

you so that you can understand the details of their data mining application and

how it works. Corporate data is a valuable asset, one whose value has increased

enormously with the development of data mining techniques such as those

described in this book. Yet we are concerned here with understanding how the

methods used for data mining work and understanding the details of these

methods so that we can trace their operation on actual data. That is why our

illustrations are simple ones. But they are not simplistic: they exhibit the fea-

tures of real datasets.



The weather problem

The weather problem is a tiny dataset that we will use repeatedly to illustrate

machine learning methods. Entirely fictitious, it supposedly concerns the con-

ditions that are suitable for playing some unspecified game. In general, instances

in a dataset are characterized by the values of features, or attributes, that measure

different aspects of the instance. In this case there are four attributes: outlook,



temperature, humidity, and windy. The outcome is whether to play or not.

In its simplest form, shown in Table 1.2, all four attributes have values that

are symbolic categories rather than numbers. Outlook can be sunny, overcast, or

rainy; temperature can be hot, mild, or  cool; humidity can be high or  normal;

and windy can be true or false. This creates 36 possible combinations (3 

¥ 3 ¥



¥ 2 = 36), of which 14 are present in the set of input examples.



A set of rules learned from this information—not necessarily a very good

one—might look as follows:

1 0

C H A P T E R   1



|

W H AT ’ S   I T   A L L   A B O U T ?

P088407-Ch001.qxd  4/30/05  11:11 AM  Page 10



If outlook 

= sunny and humidity = high then play = no

If outlook 

= rainy and windy = true

then play 

= no


If outlook 

= overcast

then play 

= yes


If humidity 

= normal


then play 

= yes


If none of the above

then play 

= yes

These rules are meant to be interpreted in order: the first one, then if it doesn’t



apply the second, and so on. A set of rules that are intended to be interpreted

in sequence is called a decision list. Interpreted as a decision list, the rules 

correctly classify all of the examples in the table, whereas taken individually, out

of context, some of the rules are incorrect. For example, the rule 

if humidity =

normal then play = yes

gets one of the examples wrong (check which one).

The meaning of a set of rules depends on how it is interpreted—not 

surprisingly!

In the slightly more complex form shown in Table 1.3, two of the attributes—

temperature and humidity—have numeric values. This means that any learn-

ing method must create inequalities involving these attributes rather than

simple equality tests, as in the former case. This is called a numeric-attribute

problem—in this case, a mixed-attribute problem because not all attributes are

numeric.


Now the first rule given earlier might take the following form:

If outlook 

= sunny and humidity > 83 then play = no

A slightly more complex process is required to come up with rules that involve

numeric tests.

1 . 2


S I M P L E   E X A M P L E S : T H E  W E AT H E R   P RO B L E M  A N D   OT H E R S

1 1


Table 1.2

The weather data.

Outlook


Temperature

Humidity


Windy

Play


sunny

hot


high

false


no

sunny


hot

high


true

no

overcast



hot

high


false

yes


rainy

mild


high

false


yes

rainy


cool

normal


false

yes


rainy

cool


normal

true


no

overcast


cool

normal


true

yes


sunny

mild


high

false


no

sunny


cool

normal


false

yes


rainy

mild


normal

false


yes

sunny


mild

normal


true

yes


overcast

mild


high

true


yes

overcast


hot

normal


false

yes


rainy

mild


high

true


no

P088407-Ch001.qxd  4/30/05  11:11 AM  Page 11




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   11   12   13   14   15   16   17   18   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə