Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə17/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   13   14   15   16   17   18   19   20   ...   219

may be associated with the rules themselves to indicate that some are more

important, or more reliable, than others.

You might be wondering whether there is a smaller rule set that performs as

well. If so, would you be better off using the smaller rule set and, if so, why?

These are exactly the kinds of questions that will occupy us in this book. Because

the examples form a complete set for the problem space, the rules do no more

than summarize all the information that is given, expressing it in a different and

more concise way. Even though it involves no generalization, this is often a very

useful thing to do! People frequently use machine learning techniques to gain

insight into the structure of their data rather than to make predictions for new

cases. In fact, a prominent and successful line of research in machine learning

began as an attempt to compress a huge database of possible chess endgames

and their outcomes into a data structure of reasonable size. The data structure

chosen for this enterprise was not a set of rules but a decision tree.

Figure 1.2 shows a structural description for the contact lens data in the form

of a decision tree, which for many purposes is a more concise and perspicuous

representation of the rules and has the advantage that it can be visualized more

easily. (However, this decision tree—in contrast to the rule set given in Figure

1.1—classifies two examples incorrectly.) The tree calls first for a test on tear

production rate, and the first two branches correspond to the two possible out-

comes. If tear production rate is reduced (the left branch), the outcome is none.

If it is normal (the right branch), a second test is made, this time on astigma-

tism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached

1 4


C H A P T E R   1

|

W H AT ’ S   I T   A L L   A B O U T ?



normal

tear production rate

reduced

hypermetrope



myope

none


astigmatism

soft


hard

none


spectacle prescription

yes


no

Figure 1.2 Decision tree for the

contact lens data.

P088407-Ch001.qxd  4/30/05  11:11 AM  Page 14



that dictates the contact lens recommendation for that case. The question of

what is the most natural and easily understood format for the output from a

machine learning scheme is one that we will return to in Chapter 3.

Irises: A classic numeric dataset

The iris dataset, which dates back to seminal work by the eminent statistician

R.A. Fisher in the mid-1930s and is arguably the most famous dataset used in

data mining, contains 50 examples each of three types of plant: Iris setosa, Iris



versicolor, and Iris virginica. It is excerpted in Table 1.4. There are four attrib-

utes: sepal length, sepal width, petal length, and petal width (all measured in cen-

timeters). Unlike previous datasets, all attributes have values that are numeric.

The following set of rules might be learned from this dataset:

If petal length 

< 2.45 then Iris setosa

If sepal width 



< 2.10 then Iris versicolor

If sepal width 



< 2.45 and petal length < 4.55 then Iris versicolor

If sepal width 



< 2.95 and petal width < 1.35 then Iris versicolor

If petal length 

≥ 2.45 and petal length < 4.45 then Iris versicolor

If sepal length 

≥ 5.85 and petal length < 4.75 then Iris versicolor

1 . 2


S I M P L E   E X A M P L E S : T H E  W E AT H E R   P RO B L E M  A N D   OT H E R S

1 5


Table 1.4

The iris data.

Sepal 


Sepal width

Petal length 

Petal width

length (cm)

(cm)

(cm)


(cm)

Type


1

5.1


3.5

1.4


0.2

Iris setosa

2

4.9



3.0

1.4


0.2

Iris setosa

3

4.7



3.2

1.3


0.2

Iris setosa

4

4.6



3.1

1.5


0.2

Iris setosa

5

5.0



3.6

1.4


0.2

Iris setosa

. . .


51

7.0


3.2

4.7


1.4

Iris versicolor

52

6.4



3.2

4.5


1.5

Iris versicolor

53

6.9



3.1

4.9


1.5

Iris versicolor

54

5.5



2.3

4.0


1.3

Iris versicolor

55

6.5



2.8

4.6


1.5

Iris versicolor

. . .


101

6.3


3.3

6.0


2.5

Iris virginica

102


5.8

2.7


5.1

1.9


Iris virginica

103


7.1

3.0


5.9

2.1


Iris virginica

104


6.3

2.9


5.6

1.8


Iris virginica

105


6.5

3.0


5.8

2.2


Iris virginica

. . .


P088407-Ch001.qxd  4/30/05  11:11 AM  Page 15


If sepal width 

< 2.55 and petal length < 4.95 and 

petal width 



< 1.55 then Iris versicolor

If petal length 

≥ 2.45 and petal length < 4.95 and 

petal width 



< 1.55 then Iris versicolor

If sepal length 

≥ 6.55 and petal length < 5.05 then Iris versicolor

If sepal width 



< 2.75 and petal width < 1.65 and 

sepal length 



< 6.05 then Iris versicolor

If sepal length 

≥ 5.85 and sepal length < 5.95 and 

petal length 



< 4.85 then Iris versicolor

If petal length 

≥ 5.15 then Iris virginica

If petal width 

≥ 1.85 then Iris virginica

If petal width 

≥ 1.75 and sepal width < 3.05 then Iris virginica

If petal length 

≥ 4.95 and petal width < 1.55 then Iris virginica

These rules are very cumbersome, and we will see in Chapter 3 how more

compact rules can be expressed that convey the same information.

CPU performance: Introducing numeric prediction

Although the iris dataset involves numeric attributes, the outcome—the type of

iris—is a category, not a numeric value. Table 1.5 shows some data for which

the outcome and the attributes are numeric. It concerns the relative perform-

ance of computer processing power on the basis of a number of relevant 

attributes; each row represents 1 of 209 different computer configurations.

The classic way of dealing with continuous prediction is to write the outcome

as a linear sum of the attribute values with appropriate weights, for example:

1 6

C H A P T E R   1



|

W H AT ’ S   I T   A L L   A B O U T ?



Table 1.5

The CPU performance data.

Main


Cycle

memory (KB)

Cache

Channels


time (ns)

Min.


Max.

(KB)


Min.

Max.


Performance

MYCT


MMIN

MMAX


CACH

CHMIN


CHMAX

PRP


1

125


256

6000


256

16

128



198

2

29



8000

32000


32

8

32



269

3

29



8000

32000


32

8

32



220

4

29



8000

32000


32

8

32



172

5

29



8000

16000


32

8

16



132

. . .


207

125


2000

8000


0

2

14



52

208


480

512


8000

32

0



0

67

209



480

1000


4000

0

0



0

45

P088407-Ch001.qxd  4/30/05  11:11 AM  Page 16




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   13   14   15   16   17   18   19   20   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə