Human
in vitro fertilization involves collecting several eggs from a woman’s
ovaries, which, after fertilization with partner or donor sperm, produce several
embryos. Some of these are selected and transferred to the woman’s uterus. The
problem is to select the “best” embryos to use—the ones that are most likely to
survive. Selection is based on around 60 recorded features of the embryos—
characterizing their morphology, oocyte, follicle, and the sperm sample. The
number of features is sufficiently large that it is difficult for an embryologist to
assess them all simultaneously and correlate historical data with the crucial
outcome of whether that embryo did or did not result in a live child. In a
research project in England, machine learning is being investigated as a tech-
nique for making the selection, using as training data historical records of
embryos and their outcome.
Every year, dairy farmers in New Zealand have to make a tough business deci-
sion: which cows to retain in their herd and which to sell off to an abattoir. Typi-
cally, one-fifth of the cows in a dairy herd are culled each year near the end of
the milking season as feed reserves dwindle. Each cow’s breeding and milk pro-
c h a p t e r
1
What’s It All About?
3
P088407-Ch001.qxd 4/30/05 11:11 AM Page 3
duction history influences this decision. Other factors include age (a cow is
nearing the end of its productive life at 8 years), health problems, history of dif-
ficult calving, undesirable temperament traits (kicking or jumping fences), and
not being in calf for the following season. About 700 attributes for each of
several million cows have been recorded over the years. Machine learning is
being investigated as a way of ascertaining what factors are taken into account
by successful farmers—not to automate the decision but to propagate their skills
and experience to others.
Life and death. From Europe to the antipodes. Family and business. Machine
learning is a burgeoning new technology for mining knowledge from data, a
technology that a lot of people are starting to take seriously.
1.1 Data mining and machine learning
We are overwhelmed with data. The amount of data in the world, in our lives,
seems to go on and on increasing—and there’s no end in sight. Omnipresent
personal computers make it too easy to save things that previously we would
have trashed. Inexpensive multigigabyte disks make it too easy to postpone deci-
sions about what to do with all this stuff—we simply buy another disk and keep
it all. Ubiquitous electronics record our decisions, our choices in the super-
market, our financial habits, our comings and goings. We swipe our way through
the world, every swipe a record in a database. The World Wide Web overwhelms
us with information; meanwhile, every choice we make is recorded. And all these
are just personal choices: they have countless counterparts in the world of com-
merce and industry. We would all testify to the growing gap between the gener-
ation of data and our
understanding of it. As
the volume of data increases,
inexorably, the proportion of it that people understand decreases, alarmingly.
Lying hidden in all this data is information, potentially useful information, that
is rarely made explicit or taken advantage of.
This book is about looking for patterns in data. There is nothing new about
this. People have been seeking patterns in data since human life began. Hunters
seek patterns in animal migration behavior, farmers seek patterns in crop
growth, politicians seek patterns in voter opinion, and lovers seek patterns in
their partners’ responses. A scientist’s job (like a baby’s) is to make sense of data,
to discover the patterns that govern how the physical world works and encap-
sulate them in theories that can be used for predicting what will happen in new
situations. The entrepreneur’s job is to identify opportunities, that is, patterns
in behavior that can be turned into a profitable business, and exploit them.
In data mining, the data is stored electronically and the search is automated—
or at least augmented—by computer. Even this is not particularly new. Econo-
mists, statisticians, forecasters, and communication engineers have long worked
4
C H A P T E R 1
|
W H AT ’ S I T A L L A B O U T ?
P088407-Ch001.qxd 4/30/05 11:11 AM Page 4
with the idea that patterns in data can be sought automatically, identified,
validated, and used for prediction. What is new is the staggering increase in
opportunities for finding patterns in data. The unbridled growth of databases
in recent years, databases on such everyday activities as customer choices, brings
data mining to the forefront of new business technologies. It has been estimated
that the amount of data stored in the world’s databases doubles every 20
months, and although it would surely be difficult to justify this figure in any
quantitative sense, we can all relate to the pace of growth qualitatively. As the
flood of data swells and machines that can undertake the searching become
commonplace, the opportunities for data mining increase. As the world grows
in complexity, overwhelming us with the data it generates, data mining becomes
our only hope for elucidating the patterns that underlie it. Intelligently analyzed
data is a valuable resource. It can lead to new insights and, in commercial set-
tings, to competitive advantages.
Data mining is about solving problems by analyzing data already present in
databases. Suppose, to take a well-worn example, the problem is fickle customer
loyalty in a highly competitive marketplace. A database of customer choices,
along with customer profiles, holds the key to this problem. Patterns of
behavior of former customers can be analyzed to identify distinguishing charac-
teristics of those likely to switch products and those likely to remain loyal. Once
such characteristics are found, they can be put to work to identify present cus-
tomers who are likely to jump ship. This group can be targeted for special treat-
ment, treatment too costly to apply to the customer base as a whole. More
positively, the same techniques can be used to identify customers who might be
attracted to another service the enterprise provides, one they are not presently
enjoying, to target them for special offers that promote this service. In today’s
highly competitive, customer-centered, service-oriented economy, data is the
raw material that fuels business growth—if only it can be mined.
Data mining is defined as the process of discovering patterns in data. The
process must be automatic or (more usually) semiautomatic. The patterns
discovered must be meaningful in that they lead to some advantage, usually
an economic advantage. The data is invariably present in substantial
quantities.
How are the patterns expressed? Useful patterns allow us to make nontrivial
predictions on new data. There are two extremes for the expression of a pattern:
as a black box whose innards are effectively incomprehensible and as a trans-
parent box whose construction reveals the structure of the pattern. Both, we are
assuming, make good predictions. The difference is whether or not the patterns
that are mined are represented in terms of a structure that can be examined,
reasoned about, and used to inform future decisions. Such patterns we call struc-
tural because they capture the decision structure in an explicit way. In other
words, they help to explain something about the data.
1 . 1
DATA M I N I N G A N D M AC H I N E L E A R N I N G
5
P088407-Ch001.qxd 4/30/05 11:11 AM Page 5