Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	12/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 8 9 10 11 12 13 14 15 ... 219

1.1 Data mining and machine learning

P088407-FM.qxd 4/30/05 10:55 AM Page xxxii

p a r t

Machine Learning Tools

and Techniques

P088407-Ch001.qxd 4/30/05 11:11 AM Page 1

P088407-Ch001.qxd 4/30/05 11:11 AM Page 2

Human in vitro fertilization involves collecting several eggs from a woman’s

ovaries, which, after fertilization with partner or donor sperm, produce several

embryos. Some of these are selected and transferred to the woman’s uterus. The

problem is to select the “best” embryos to use—the ones that are most likely to

survive. Selection is based on around 60 recorded features of the embryos—

characterizing their morphology, oocyte, follicle, and the sperm sample. The

number of features is sufﬁciently large that it is difﬁcult for an embryologist to

assess them all simultaneously and correlate historical data with the crucial

outcome of whether that embryo did or did not result in a live child. In a

research project in England, machine learning is being investigated as a tech-

nique for making the selection, using as training data historical records of

embryos and their outcome.

Every year, dairy farmers in New Zealand have to make a tough business deci-

sion: which cows to retain in their herd and which to sell off to an abattoir. Typi-

cally, one-ﬁfth of the cows in a dairy herd are culled each year near the end of

the milking season as feed reserves dwindle. Each cow’s breeding and milk pro-

c h a p t e r

What’s It All About?

P088407-Ch001.qxd 4/30/05 11:11 AM Page 3

duction history inﬂuences this decision. Other factors include age (a cow is

nearing the end of its productive life at 8 years), health problems, history of dif-

ﬁcult calving, undesirable temperament traits (kicking or jumping fences), and

not being in calf for the following season. About 700 attributes for each of

several million cows have been recorded over the years. Machine learning is

being investigated as a way of ascertaining what factors are taken into account

by successful farmers—not to automate the decision but to propagate their skills

and experience to others.

Life and death. From Europe to the antipodes. Family and business. Machine

learning is a burgeoning new technology for mining knowledge from data, a

technology that a lot of people are starting to take seriously.

1.1 Data mining and machine learning

We are overwhelmed with data. The amount of data in the world, in our lives,

seems to go on and on increasing—and there’s no end in sight. Omnipresent

personal computers make it too easy to save things that previously we would

have trashed. Inexpensive multigigabyte disks make it too easy to postpone deci-

sions about what to do with all this stuff—we simply buy another disk and keep

it all. Ubiquitous electronics record our decisions, our choices in the super-

market, our ﬁnancial habits, our comings and goings. We swipe our way through

the world, every swipe a record in a database. The World Wide Web overwhelms

us with information; meanwhile, every choice we make is recorded. And all these

are just personal choices: they have countless counterparts in the world of com-

merce and industry. We would all testify to the growing gap between the gener-

ation of data and our understanding of it. As the volume of data increases,

inexorably, the proportion of it that people understand decreases, alarmingly.

Lying hidden in all this data is information, potentially useful information, that

is rarely made explicit or taken advantage of.

This book is about looking for patterns in data. There is nothing new about

this. People have been seeking patterns in data since human life began. Hunters

seek patterns in animal migration behavior, farmers seek patterns in crop

growth, politicians seek patterns in voter opinion, and lovers seek patterns in

their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data,

to discover the patterns that govern how the physical world works and encap-

sulate them in theories that can be used for predicting what will happen in new

situations. The entrepreneur’s job is to identify opportunities, that is, patterns

in behavior that can be turned into a proﬁtable business, and exploit them.

In data mining, the data is stored electronically and the search is automated—

or at least augmented—by computer. Even this is not particularly new. Econo-

mists, statisticians, forecasters, and communication engineers have long worked

C H A P T E R 1

W H AT ’ S I T A L L A B O U T ?

P088407-Ch001.qxd 4/30/05 11:11 AM Page 4

with the idea that patterns in data can be sought automatically, identiﬁed,

validated, and used for prediction. What is new is the staggering increase in

opportunities for ﬁnding patterns in data. The unbridled growth of databases

in recent years, databases on such everyday activities as customer choices, brings

data mining to the forefront of new business technologies. It has been estimated

that the amount of data stored in the world’s databases doubles every 20

months, and although it would surely be difﬁcult to justify this ﬁgure in any

quantitative sense, we can all relate to the pace of growth qualitatively. As the

ﬂood of data swells and machines that can undertake the searching become

commonplace, the opportunities for data mining increase. As the world grows

in complexity, overwhelming us with the data it generates, data mining becomes

our only hope for elucidating the patterns that underlie it. Intelligently analyzed

data is a valuable resource. It can lead to new insights and, in commercial set-

tings, to competitive advantages.

Data mining is about solving problems by analyzing data already present in

databases. Suppose, to take a well-worn example, the problem is ﬁckle customer

loyalty in a highly competitive marketplace. A database of customer choices,

along with customer proﬁles, holds the key to this problem. Patterns of

behavior of former customers can be analyzed to identify distinguishing charac-

teristics of those likely to switch products and those likely to remain loyal. Once

such characteristics are found, they can be put to work to identify present cus-

tomers who are likely to jump ship. This group can be targeted for special treat-

ment, treatment too costly to apply to the customer base as a whole. More

positively, the same techniques can be used to identify customers who might be

attracted to another service the enterprise provides, one they are not presently

enjoying, to target them for special offers that promote this service. In today’s

highly competitive, customer-centered, service-oriented economy, data is the

raw material that fuels business growth—if only it can be mined.

Data mining is deﬁned as the process of discovering patterns in data. The

process must be automatic or (more usually) semiautomatic. The patterns

discovered must be meaningful in that they lead to some advantage, usually

an economic advantage. The data is invariably present in substantial

quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial

predictions on new data. There are two extremes for the expression of a pattern:

as a black box whose innards are effectively incomprehensible and as a trans-

parent box whose construction reveals the structure of the pattern. Both, we are

assuming, make good predictions. The difference is whether or not the patterns

that are mined are represented in terms of a structure that can be examined,

reasoned about, and used to inform future decisions. Such patterns we call struc-

tural because they capture the decision structure in an explicit way. In other

words, they help to explain something about the data.

1 . 1

DATA M I N I N G A N D M AC H I N E L E A R N I N G

P088407-Ch001.qxd 4/30/05 11:11 AM Page 5

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 ... 219