Data Mining
Practical Machine Learning Tools and Techniques,
Second Edition
Ian H. Witten
Department of Computer Science
University of Waikato
Eibe Frank
Department of Computer Science
University of Waikato
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
MORGAN KAUFMANN PUBLISHERS IS AN IMPRINT OF ELSEVIER
P088407-FM.qxd 4/30/05 10:55 AM Page iii
Publisher:
Diane Cerra
Publishing Services Manager:
Simon Crump
Project Manager:
Brandy Lilly
Editorial Assistant:
Asma Stephan
Cover Design:
Yvo Riezebos Design
Cover Image:
Getty Images
Composition:
SNP Best-set Typesetter Ltd., Hong Kong
Technical Illustration:
Dartmouth Publishing, Inc.
Copyeditor:
Graphic World Inc.
Proofreader:
Graphic World Inc.
Indexer:
Graphic World Inc.
Interior printer:
The Maple-Vail Book Manufacturing Group
Cover printer:
Phoenix Color Corp
Morgan Kaufmann Publishers is an imprint of Elsevier.
500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
© 2005 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks
or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a
claim, the product names appear in initial capital or all capital letters. Readers, however, should
contact the appropriate companies for more complete information regarding trademarks and
registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—
without prior written permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in
Oxford, UK: phone: (
+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
permissions@elsevier.com.uk. You may also complete your request on-line via the Elsevier
homepage (http://elsevier.com) by selecting “Customer Support” and then “Obtaining
Permissions.”
Library of Congress Cataloging-in-Publication Data
Witten, I. H. (Ian H.)
Data mining : practical machine learning tools and techniques / Ian H. Witten, Eibe
Frank. – 2nd ed.
p. cm. – (Morgan Kaufmann series in data management systems)
Includes bibliographical references and index.
ISBN: 0-12-088407-0
1. Data mining.
I. Frank, Eibe.
II. Title.
III. Series.
QA76.9.D343W58 2005
006.3–dc22
2005043385
For information on all Morgan Kaufmann publications,
visit our Web site at www.mkp.com or www.books.elsevier.com
Printed in the United States of America
05 06 07 08 09
5 4 3 2 1
Working together to grow
libraries in developing countries
www.elsevier.com | www.bookaid.org | www.sabre.org
P088407-FM.qxd 5/3/05 2:22 PM Page iv
Foreword
Jim Gray, Series Editor
Microsoft Research
Technology now allows us to capture and store vast quantities of data. Finding
patterns, trends, and anomalies in these datasets, and summarizing them
with simple quantitative models, is one of the grand challenges of the infor-
mation age—turning data into information and turning information into
knowledge.
There has been stunning progress in data mining and machine learning. The
synthesis of statistics, machine learning, information theory, and computing has
created a solid science, with a firm mathematical base, and with very powerful
tools. Witten and Frank present much of this progress in this book and in the
companion implementation of the key algorithms. As such, this is a milestone
in the synthesis of data mining, data analysis, information theory, and machine
learning. If you have not been following this field for the last decade, this is a
great way to catch up on this exciting progress. If you have, then Witten and
Frank’s presentation and the companion open-source workbench, called Weka,
will be a useful addition to your toolkit.
They present the basic theory of automatically extracting models from data,
and then validating those models. The book does an excellent job of explaining
the various models (decision trees, association rules, linear models, clustering,
Bayes nets, neural nets) and how to apply them in practice. With this basis, they
then walk through the steps and pitfalls of various approaches. They describe
how to safely scrub datasets, how to build models, and how to evaluate a model’s
predictive quality. Most of the book is tutorial, but Part II broadly describes how
commercial systems work and gives a tour of the publicly available data mining
workbench that the authors provide through a website. This Weka workbench
has a graphical user interface that leads you through data mining tasks and has
excellent data visualization tools that help understand the models. It is a great
companion to the text and a useful and popular tool in its own right.
v
P088407-FM.qxd 5/3/05 2:23 PM Page v
This book presents this new discipline in a very accessible form: as a text
both to train the next generation of practitioners and researchers and to inform
lifelong learners like myself. Witten and Frank have a passion for simple and
elegant solutions. They approach each topic with this mindset, grounding all
concepts in concrete examples, and urging the reader to consider the simple
techniques first, and then progress to the more sophisticated ones if the simple
ones prove inadequate.
If you are interested in databases, and have not been following the machine
learning field, this book is a great way to catch up on this exciting progress. If
you have data that you want to analyze and understand, this book and the asso-
ciated Weka toolkit are an excellent way to start.
v i
F O R EWO R D
P088407-FM.qxd 5/3/05 2:23 PM Page vi