4 4
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
Table 2.1
Iris data as a clustering problem.
Sepal length
Sepal width
Petal length
Petal width
(cm)
(cm)
(cm)
(cm)
1
5.1
3.5
1.4
0.2
2
4.9
3.0
1.4
0.2
3
4.7
3.2
1.3
0.2
4
4.6
3.1
1.5
0.2
5
5.0
3.6
1.4
0.2
. . .
51
7.0
3.2
4.7
1.4
52
6.4
3.2
4.5
1.5
53
6.9
3.1
4.9
1.5
54
5.5
2.3
4.0
1.3
55
6.5
2.8
4.6
1.5
. . .
101
6.3
3.3
6.0
2.5
102
5.8
2.7
5.1
1.9
103
7.1
3.0
5.9
2.1
104
6.3
2.9
5.6
1.8
105
6.5
3.0
5.8
2.2
. . .
Table 2.2
Weather data with a numeric class.
Outlook
Temperature
Humidity
Windy
Play time (min.)
sunny
85
85
false
5
sunny
80
90
true
0
overcast
83
86
false
55
rainy
70
96
false
40
rainy
68
80
false
65
rainy
65
70
true
45
overcast
64
65
true
60
sunny
72
95
false
0
sunny
69
70
false
70
rainy
75
80
false
45
sunny
75
70
true
50
overcast
72
90
true
55
overcast
81
75
false
75
rainy
71
91
true
10
P088407-Ch002.qxd 4/30/05 11:10 AM Page 44
data in which what is to be predicted is not play or don’t
play but rather is the
time (in minutes) to play. With numeric prediction problems, as with other
machine learning situations, the predicted value for new instances is often of
less interest than the structure of the description that is learned, expressed in
terms of what the important attributes are and how they relate to the numeric
outcome.
2.2 What’s in an example?
The input to a machine learning scheme is a set of instances. These instances
are the things that are to be classified, associated, or clustered. Although
until now we have called them examples, henceforth we will use the more spe-
cific term instances to refer to the input. Each instance is an individual, inde-
pendent example of the concept to be learned. In addition, each one is
characterized by the values of a set of predetermined attributes. This was the
case in all the sample datasets described in the last chapter (the weather, contact
lens, iris, and labor negotiations problems). Each dataset is represented as a
matrix of instances versus attributes, which in database terms is a single rela-
tion, or a flat file.
Expressing the input data as a set of independent instances is by far the most
common situation for practical data mining. However, it is a rather restrictive
way of formulating problems, and it is worth spending some time reviewing
why. Problems often involve relationships between objects rather than separate,
independent instances. Suppose, to take a specific situation, a family tree is
given, and we want to learn the concept sister. Imagine your own family tree,
with your relatives (and their genders) placed at the nodes. This tree is the input
to the learning process, along with a list of pairs of people and an indication of
whether they are sisters or not.
Figure 2.1 shows part of a family tree, below which are two tables that each
define sisterhood in a slightly different way. A yes in the third column of the
tables means that the person in the second column is a sister of the person in
the first column (that’s just an arbitrary decision we’ve made in setting up this
example).
The first thing to notice is that there are a lot of nos in the third column of
the table on the left—because there are 12 people and 12
¥ 12 = 144 pairs of
people in all, and most pairs of people aren’t sisters. The table on the right, which
gives the same information, records only the positive instances and assumes that
all others are negative. The idea of specifying only positive examples and adopt-
ing a standing assumption that the rest are negative is called the closed world
assumption. It is frequently assumed in theoretical studies; however, it is not of
2 . 2
W H AT ’ S I N A N E X A M P L E ?
4 5
P088407-Ch002.qxd 4/30/05 11:10 AM Page 45
much practical use in real-life problems because they rarely involve “closed”
worlds in which you can be certain that all cases are covered.
Neither table in Figure 2.1 is of any use without the family tree itself. This
tree can also be expressed in the form of a table, part of which is shown in Table
2.3. Now the problem is expressed in terms of two relationships. But these tables
do not contain independent sets of instances because values in the Name,
Parent1, and Parent2 columns of the sister-of relation refer to rows of the family
tree relation. We can make them into a single set of instances by collapsing the
two tables into the single one of Table 2.4.
We have at last succeeded in transforming the original relational problem
into the form of instances, each of which is an individual, independent example
4 6
C H A P T E R 2
|
I N P U T: C O N C E P TS , I N S TA N C E S , A N D AT T R I BU T E S
first
person
second
person
Peter
M
Peggy
F
=
Grace
F
Ray
M
=
Pam
F
Ian
M
=
Steven
M
Graham
M
Pippa
F
Brian
M
Anna
F
Nikki
F
Peter
Peter
...
Steven
Steven
Steven
Steven
...
lan
...
Anna
...
Nikki
Peggy
Steven
......
Peter
Graham
Pam
Grace
......
Pippa
......
Nikki
.....
Anna
sister
of?
no
no
no
no
yes
no
yes
yes
yes
first
person
second
person
Steven
Graham
lan
Brian
Anna
Nikki
Pam
Pam
Pippa
Pippa
Nikki
Anna
sister
of?
yes
yes
yes
yes
yes
yes
no
All the rest
Figure 2.1 A family tree and two ways of expressing the sister-of relation.
P088407-Ch002.qxd 4/30/05 11:10 AM Page 46