Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	17/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 13 14 15 16 17 18 19 20 ... 219

Irises: A classic numeric dataset
Table 1.4 The iris data.
CPU performance: Introducing numeric prediction
Table 1.5 The CPU performance data.

may be associated with the rules themselves to indicate that some are more

important, or more reliable, than others.

You might be wondering whether there is a smaller rule set that performs as

well. If so, would you be better off using the smaller rule set and, if so, why?

These are exactly the kinds of questions that will occupy us in this book. Because

the examples form a complete set for the problem space, the rules do no more

than summarize all the information that is given, expressing it in a different and

more concise way. Even though it involves no generalization, this is often a very

useful thing to do! People frequently use machine learning techniques to gain

insight into the structure of their data rather than to make predictions for new

cases. In fact, a prominent and successful line of research in machine learning

began as an attempt to compress a huge database of possible chess endgames

and their outcomes into a data structure of reasonable size. The data structure

chosen for this enterprise was not a set of rules but a decision tree.

Figure 1.2 shows a structural description for the contact lens data in the form

of a decision tree, which for many purposes is a more concise and perspicuous

representation of the rules and has the advantage that it can be visualized more

easily. (However, this decision tree—in contrast to the rule set given in Figure

1.1—classiﬁes two examples incorrectly.) The tree calls ﬁrst for a test on tear

production rate, and the ﬁrst two branches correspond to the two possible out-

comes. If tear production rate is reduced (the left branch), the outcome is none.

If it is normal (the right branch), a second test is made, this time on astigma-

tism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached

1 4

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

normal

tear production rate

reduced

hypermetrope

myope

none

astigmatism

soft

hard

none

spectacle prescription

yes

no

Figure 1.2 Decision tree for the

contact lens data.

P088407-Ch001.qxd 4/30/05 11:11 AM Page 14

that dictates the contact lens recommendation for that case. The question of

what is the most natural and easily understood format for the output from a

machine learning scheme is one that we will return to in Chapter 3.

Irises: A classic numeric dataset

The iris dataset, which dates back to seminal work by the eminent statistician

R.A. Fisher in the mid-1930s and is arguably the most famous dataset used in

data mining, contains 50 examples each of three types of plant: Iris setosa, Iris

versicolor, and Iris virginica. It is excerpted in Table 1.4. There are four attrib-

utes: sepal length, sepal width, petal length, and petal width (all measured in cen-

timeters). Unlike previous datasets, all attributes have values that are numeric.

The following set of rules might be learned from this dataset:

If petal length

< 2.45 then Iris setosa

If sepal width

< 2.10 then Iris versicolor

If sepal width

< 2.45 and petal length < 4.55 then Iris versicolor

If sepal width

< 2.95 and petal width < 1.35 then Iris versicolor

If petal length

≥ 2.45 and petal length < 4.45 then Iris versicolor

If sepal length

≥ 5.85 and petal length < 4.75 then Iris versicolor

1 . 2

S I M P L E E X A M P L E S : T H E W E AT H E R P RO B L E M A N D OT H E R S

1 5

Table 1.4

The iris data.

Sepal

Sepal width

Petal length

Petal width

length (cm)

(cm)

Type

5.1

3.5

1.4

0.2

Iris setosa

4.9

3.0

1.4

0.2

Iris setosa

4.7

3.2

1.3

0.2

Iris setosa

4.6

3.1

1.5

0.2

Iris setosa

5.0

3.6

1.4

0.2

Iris setosa

. . .

7.0

3.2

4.7

1.4

Iris versicolor

6.4

3.2

4.5

1.5

Iris versicolor

6.9

3.1

4.9

1.5

Iris versicolor

5.5

2.3

4.0

1.3

Iris versicolor

6.5

2.8

4.6

1.5

Iris versicolor

. . .

101

6.3

3.3

6.0

2.5

Iris virginica

102

5.8

2.7

5.1

1.9

Iris virginica

103

7.1

3.0

5.9

2.1

Iris virginica

104

6.3

2.9

5.6

1.8

Iris virginica

105

6.5

3.0

5.8

2.2

Iris virginica

. . .

P088407-Ch001.qxd 4/30/05 11:11 AM Page 15

If sepal width

< 2.55 and petal length < 4.95 and

petal width

< 1.55 then Iris versicolor

If petal length

≥ 2.45 and petal length < 4.95 and

petal width

< 1.55 then Iris versicolor

If sepal length

≥ 6.55 and petal length < 5.05 then Iris versicolor

If sepal width

< 2.75 and petal width < 1.65 and

sepal length

< 6.05 then Iris versicolor

If sepal length

≥ 5.85 and sepal length < 5.95 and

petal length

< 4.85 then Iris versicolor

If petal length

≥ 5.15 then Iris virginica

If petal width

≥ 1.85 then Iris virginica

If petal width

≥ 1.75 and sepal width < 3.05 then Iris virginica

If petal length

≥ 4.95 and petal width < 1.55 then Iris virginica

These rules are very cumbersome, and we will see in Chapter 3 how more

compact rules can be expressed that convey the same information.

CPU performance: Introducing numeric prediction

Although the iris dataset involves numeric attributes, the outcome—the type of

iris—is a category, not a numeric value. Table 1.5 shows some data for which

the outcome and the attributes are numeric. It concerns the relative perform-

ance of computer processing power on the basis of a number of relevant

attributes; each row represents 1 of 209 different computer conﬁgurations.

The classic way of dealing with continuous prediction is to write the outcome

as a linear sum of the attribute values with appropriate weights, for example:

1 6

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

Table 1.5

The CPU performance data.

Main

Cycle

memory (KB)

Cache

Channels

time (ns)

Min.

Max.

(KB)

Min.

Max.

Performance

MYCT

MMIN

MMAX

CACH

CHMIN

CHMAX

PRP

125

256

6000

256

128

198

8000

32000

269

8000

32000

220

8000

32000

172

8000

16000

132

. . .

207

125

2000

8000

208

480

512

8000

209

480

1000

4000

P088407-Ch001.qxd 4/30/05 11:11 AM Page 16

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 ... 219