Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə30/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   26   27   28   29   30   31   32   33   ...   219

4 4

C H A P T E R   2

|

I N P U T: C O N C E P TS , I N S TA N C E S , A N D   AT T R I BU T E S



Table 2.1

Iris data as a clustering problem.

Sepal length

Sepal width

Petal length

Petal width

(cm)


(cm)

(cm)


(cm)

1

5.1



3.5

1.4


0.2

2

4.9



3.0

1.4


0.2

3

4.7



3.2

1.3


0.2

4

4.6



3.1

1.5


0.2

5

5.0



3.6

1.4


0.2

. . .


51

7.0


3.2

4.7


1.4

52

6.4



3.2

4.5


1.5

53

6.9



3.1

4.9


1.5

54

5.5



2.3

4.0


1.3

55

6.5



2.8

4.6


1.5

. . .


101

6.3


3.3

6.0


2.5

102


5.8

2.7


5.1

1.9


103

7.1


3.0

5.9


2.1

104


6.3

2.9


5.6

1.8


105

6.5


3.0

5.8


2.2

. . .


Table 2.2

Weather data with a numeric class.

Outlook


Temperature

Humidity


Windy

Play time (min.)

sunny

85

85



false

5

sunny



80

90

true



0

overcast


83

86

false



55

rainy


70

96

false



40

rainy


68

80

false



65

rainy


65

70

true



45

overcast


64

65

true



60

sunny


72

95

false



0

sunny


69

70

false



70

rainy


75

80

false



45

sunny


75

70

true



50

overcast


72

90

true



55

overcast


81

75

false



75

rainy


71

91

true



10

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 44




data in which what is to be predicted is not play or don’t play but rather is the

time (in minutes) to play. With numeric prediction problems, as with other

machine learning situations, the predicted value for new instances is often of

less interest than the structure of the description that is learned, expressed in

terms of what the important attributes are and how they relate to the numeric

outcome.


2.2 What’s in an example?

The input to a machine learning scheme is a set of instances. These instances

are the things that are to be classified, associated, or clustered. Although 

until now we have called them examples, henceforth we will use the more spe-

cific term instances to refer to the input. Each instance is an individual, inde-

pendent example of the concept to be learned. In addition, each one is

characterized by the values of a set of predetermined attributes. This was the

case in all the sample datasets described in the last chapter (the weather, contact

lens, iris, and labor negotiations problems). Each dataset is represented as a

matrix of instances versus attributes, which in database terms is a single rela-

tion, or a flat file.

Expressing the input data as a set of independent instances is by far the most

common situation for practical data mining. However, it is a rather restrictive

way of formulating problems, and it is worth spending some time reviewing

why. Problems often involve relationships between objects rather than separate,

independent instances. Suppose, to take a specific situation, a family tree is

given, and we want to learn the concept sister. Imagine your own family tree,

with your relatives (and their genders) placed at the nodes. This tree is the input

to the learning process, along with a list of pairs of people and an indication of

whether they are sisters or not.

Figure 2.1 shows part of a family tree, below which are two tables that each

define sisterhood in a slightly different way. A yes in the third column of the

tables means that the person in the second column is a sister of the person in

the first column (that’s just an arbitrary decision we’ve made in setting up this

example).

The first thing to notice is that there are a lot of nos in the third column of

the table on the left—because there are 12 people and 12 

¥ 12 = 144 pairs of

people in all, and most pairs of people aren’t sisters. The table on the right, which

gives the same information, records only the positive instances and assumes that

all others are negative. The idea of specifying only positive examples and adopt-

ing a standing assumption that the rest are negative is called the closed world



assumption. It is frequently assumed in theoretical studies; however, it is not of

2 . 2


W H AT ’ S   I N  A N   E X A M P L E ?

4 5


P088407-Ch002.qxd  4/30/05  11:10 AM  Page 45


much practical use in real-life problems because they rarely involve “closed”

worlds in which you can be certain that all cases are covered.

Neither table in Figure 2.1 is of any use without the family tree itself. This

tree can also be expressed in the form of a table, part of which is shown in Table

2.3. Now the problem is expressed in terms of two relationships. But these tables

do not contain independent sets of instances because values in the Name,

Parent1, and Parent2 columns of the sister-of relation refer to rows of the family

tree relation. We can make them into a single set of instances by collapsing the

two tables into the single one of Table 2.4.

We have at last succeeded in transforming the original relational problem

into the form of instances, each of which is an individual, independent example

4 6


C H A P T E R   2

|

I N P U T: C O N C E P TS , I N S TA N C E S , A N D   AT T R I BU T E S



first

person


second

person


Peter

M

Peggy



F

=

Grace



F

Ray


M

=

Pam



F

Ian


M

=

Steven



M

Graham


M

Pippa


F

Brian


M

Anna


F

Nikki


F

Peter


Peter

...


Steven

Steven


Steven

Steven


...

lan


...

Anna


...

Nikki


Peggy

Steven


......

Peter


Graham

Pam


Grace

......


Pippa

......


Nikki

.....


Anna

sister


of?

no

no



no

no

yes



no

yes


yes

yes


first

person


second

person


Steven

Graham


lan

Brian


Anna

Nikki


Pam

Pam


Pippa

Pippa


Nikki

Anna


sister

of?


yes

yes


yes

yes


yes

yes


no

All the rest

Figure 2.1 A family tree and two ways of expressing the sister-of relation.

P088407-Ch002.qxd  4/30/05  11:10 AM  Page 46




Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   26   27   28   29   30   31   32   33   ...   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə