Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	8/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 4 5 6 7 8 9 10 11 ... 219

Preface

The convergence of computing and communication has produced a society that

feeds on information. Yet most of the information is in its raw form: data. If

data is characterized as recorded facts, then information is the set of patterns,

or expectations, that underlie the data. There is a huge amount of information

locked up in databases—information that is potentially important but has not

yet been discovered or articulated. Our mission is to bring it forth.

Data mining is the extraction of implicit, previously unknown, and poten-

tially useful information from data. The idea is to build computer programs that

sift through databases automatically, seeking regularities or patterns. Strong pat-

terns, if found, will likely generalize to make accurate predictions on future data.

Of course, there will be problems. Many patterns will be banal and uninterest-

ing. Others will be spurious, contingent on accidental coincidences in the par-

ticular dataset used. In addition real data is imperfect: Some parts will be

garbled, and some will be missing. Anything discovered will be inexact: There

will be exceptions to every rule and cases not covered by any rule. Algorithms

need to be robust enough to cope with imperfect data and to extract regulari-

ties that are inexact but useful.

Machine learning provides the technical basis of data mining. It is used to

extract information from the raw data in databases—information that is

expressed in a comprehensible form and can be used for a variety of purposes.

The process is one of abstraction: taking the data, warts and all, and inferring

whatever structure underlies it. This book is about the tools and techniques of

machine learning used in practical data mining for ﬁnding, and describing,

structural patterns in data.

As with any burgeoning new technology that enjoys intense commercial

attention, the use of data mining is surrounded by a great deal of hype in the

technical—and sometimes the popular—press. Exaggerated reports appear of

the secrets that can be uncovered by setting learning algorithms loose on oceans

of data. But there is no magic in machine learning, no hidden power, no

x x i i i

P088407-FM.qxd 4/30/05 10:55 AM Page xxiii

alchemy. Instead, there is an identiﬁable body of simple and practical techniques

that can often extract useful information from raw data. This book describes

these techniques and shows how they work.

We interpret machine learning as the acquisition of structural descriptions

from examples. The kind of descriptions found can be used for prediction,

explanation, and understanding. Some data mining applications focus on pre-

diction: forecasting what will happen in new situations from data that describe

what happened in the past, often by guessing the classiﬁcation of new examples.

But we are equally—perhaps more—interested in applications in which the

result of “learning” is an actual description of a structure that can be used to

classify examples. This structural description supports explanation, under-

standing, and prediction. In our experience, insights gained by the applications’

users are of most interest in the majority of practical data mining applications;

indeed, this is one of machine learning’s major advantages over classical statis-

tical modeling.

The book explains a variety of machine learning methods. Some are peda-

gogically motivated: simple schemes designed to explain clearly how the basic

ideas work. Others are practical: real systems used in applications today. Many

are contemporary and have been developed only in the last few years.

A comprehensive software resource, written in the Java language, has been

created to illustrate the ideas in the book. Called the Waikato Environment for

Knowledge Analysis, or Weka

for short, it is available as source code on the

World Wide Web at http://www.cs.waikato.ac.nz/ml/weka. It is a full, industrial-

strength implementation of essentially all the techniques covered in this book.

It includes illustrative code and working implementations of machine learning

methods. It offers clean, spare implementations of the simplest techniques,

designed to aid understanding of the mechanisms involved. It also provides a

workbench that includes full, working, state-of-the-art implementations of

many popular learning schemes that can be used for practical data mining or

for research. Finally, it contains a framework, in the form of a Java class library,

that supports applications that use embedded machine learning and even the

implementation of new learning schemes.

The objective of this book is to introduce the tools and techniques for

machine learning that are used in data mining. After reading it, you will under-

stand what these techniques are and appreciate their strengths and applicabil-

ity. If you wish to experiment with your own data, you will be able to do this

easily with the Weka software.

x x i v

P R E FAC E

Found only on the islands of New Zealand, the weka (pronounced to rhyme with Mecca)

is a ﬂightless bird with an inquisitive nature.

P088407-FM.qxd 4/30/05 10:55 AM Page xxiv

The book spans the gulf between the intensely practical approach taken by

trade books that provide case studies on data mining and the more theoretical,

principle-driven exposition found in current textbooks on machine learning.

(A brief description of these books appears in the Further reading section at the

end of Chapter 1.) This gulf is rather wide. To apply machine learning tech-

niques productively, you need to understand something about how they work;

this is not a technology that you can apply blindly and expect to get good results.

Different problems yield to different techniques, but it is rarely obvious which

techniques are suitable for a given situation: you need to know something about

the range of possible solutions. We cover an extremely wide range of techniques.

We can do this because, unlike many trade books, this volume does not promote

any particular commercial software or approach. We include a large number of

examples, but they use illustrative datasets that are small enough to allow you

to follow what is going on. Real datasets are far too large to show this (and in

any case are usually company conﬁdential). Our datasets are chosen not to

illustrate actual large-scale practical problems but to help you understand what

the different techniques do, how they work, and what their range of application

is.

The book is aimed at the technically aware general reader interested in the

principles and ideas underlying the current practice of data mining. It will

also be of interest to information professionals who need to become acquainted

with this new technology and to all those who wish to gain a detailed technical

understanding of what machine learning involves. It is written for an eclectic

audience of information systems practitioners, programmers, consultants,

developers, information technology managers, speciﬁcation writers, patent

examiners, and curious laypeople—as well as students and professors—who

need an easy-to-read book with lots of illustrations that describes what the

major machine learning techniques are, what they do, how they are used, and

how they work. It is practically oriented, with a strong “how to” ﬂavor, and

includes algorithms, code, and implementations. All those involved in practical

data mining will beneﬁt directly from the techniques described. The book is

aimed at people who want to cut through to the reality that underlies the hype

about machine learning and who seek a practical, nonacademic, unpretentious

approach. We have avoided requiring any speciﬁc theoretical or mathematical

knowledge except in some sections marked by a light gray bar in the margin.

These contain optional material, often for the more technical or theoretically

inclined reader, and may be skipped without loss of continuity.

The book is organized in layers that make the ideas accessible to readers who

are interested in grasping the basics and to those who would like more depth of

treatment, along with full details on the techniques covered. We believe that con-

sumers of machine learning need to have some idea of how the algorithms they

use work. It is often observed that data models are only as good as the person

P R E FAC E

x x v

P088407-FM.qxd 5/3/05 2:24 PM Page xxv

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 219