Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	10/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 6 7 8 9 10 11 12 13 ... 219

Acknowledgments

in Section 6.3. We have included more recent material on implementing

nonlinear decision boundaries using both the kernel perceptron and radial basis

function networks. There is a new section on Bayesian networks, again in

response to readers’ requests, with a description of how to learn classiﬁers based

on these networks and how to implement them efﬁciently using all-dimensions

trees.

The Weka machine learning workbench that accompanies the book, a widely

used and popular feature of the ﬁrst edition, has acquired a radical new look in

the form of an interactive interface—or rather, three separate interactive inter-

faces—that make it far easier to use. The primary one is the Explorer, which

gives access to all of Weka’s facilities using menu selection and form ﬁlling. The

others are the Knowledge Flow interface, which allows you to design conﬁgu-

rations for streamed data processing, and the Experimenter, with which you set

up automated experiments that run selected machine learning algorithms with

different parameter settings on a corpus of datasets, collect performance statis-

tics, and perform signiﬁcance tests on the results. These interfaces lower the bar

for becoming a practicing data miner, and we include a full description of how

to use them. However, the book continues to stand alone, independent of Weka,

and to underline this we have moved all material on the workbench into a sep-

arate Part II at the end of the book.

In addition to becoming far easier to use, Weka has grown over the last 5

years and matured enormously in its data mining capabilities. It now includes

an unparalleled range of machine learning algorithms and related techniques.

The growth has been partly stimulated by recent developments in the ﬁeld and

partly led by Weka users and driven by demand. This puts us in a position in

which we know a great deal about what actual users of data mining want, and

we have capitalized on this experience when deciding what to include in this

new edition.

The earlier chapters, containing more general and foundational material,

have suffered relatively little change. We have added more examples of ﬁelded

applications to Chapter 1, a new subsection on sparse data and a little on string

attributes and date attributes to Chapter 2, and a description of interactive deci-

sion tree construction, a useful and revealing technique to help you grapple with

your data using manually built decision trees, to Chapter 3.

In addition to introducing linear decision boundaries for classiﬁcation, the

infrastructure for neural networks, Chapter 4 includes new material on multi-

nomial Bayes models for document classiﬁcation and on logistic regression. The

last 5 years have seen great interest in data mining for text, and this is reﬂected

in our introduction to string attributes in Chapter 2, multinomial Bayes for doc-

ument classiﬁcation in Chapter 4, and text transformations in Chapter 7.

Chapter 4 includes a great deal of new material on efﬁcient data structures for

searching the instance space: kD-trees and the recently invented ball trees. These

x x v i i i

P R E FAC E

P088407-FM.qxd 4/30/05 10:55 AM Page xxviii

are used to ﬁnd nearest neighbors efﬁciently and to accelerate distance-based

clustering.

Chapter 5 describes the principles of statistical evaluation of machine learn-

ing, which have not changed. The main addition, apart from a note on the Kappa

statistic for measuring the success of a predictor, is a more detailed treatment

of cost-sensitive learning. We describe how to use a classiﬁer, built without

taking costs into consideration, to make predictions that are sensitive to cost;

alternatively, we explain how to take costs into account during the training

process to build a cost-sensitive model. We also cover the popular new tech-

nique of cost curves.

There are several additions to Chapter 6, apart from the previously men-

tioned material on neural networks and Bayesian network classiﬁers. More

details—gory details—are given of the heuristics used in the successful RIPPER

rule learner. We describe how to use model trees to generate rules for numeric

prediction. We show how to apply locally weighted regression to classiﬁcation

problems. Finally, we describe the X-means clustering algorithm, which is a big

improvement on traditional k-means.

Chapter 7 on engineering the input and output has changed most, because

this is where recent developments in practical machine learning have been con-

centrated. We describe new attribute selection schemes such as race search and

the use of support vector machines and new methods for combining models

such as additive regression, additive logistic regression, logistic model trees, and

option trees. We give a full account of LogitBoost (which was mentioned in the

ﬁrst edition but not described). There is a new section on useful transforma-

tions, including principal components analysis and transformations for text

mining and time series. We also cover recent developments in using unlabeled

data to improve classiﬁcation, including the co-training and co-EM methods.

The ﬁnal chapter of Part I on new directions and different perspectives has

been reworked to keep up with the times and now includes contemporary chal-

lenges such as adversarial learning and ubiquitous data mining.

Acknowledgments

Writing the acknowledgments is always the nicest part! A lot of people have

helped us, and we relish this opportunity to thank them. This book has arisen

out of the machine learning research project in the Computer Science Depart-

ment at the University of Waikato, New Zealand. We have received generous

encouragement and assistance from the academic staff members on that project:

John Cleary, Sally Jo Cunningham, Matt Humphrey, Lyn Hunt, Bob McQueen,

Lloyd Smith, and Tony Smith. Special thanks go to Mark Hall, Bernhard

Pfahringer, and above all Geoff Holmes, the project leader and source of inspi-

P R E FAC E

x x i x

P088407-FM.qxd 4/30/05 10:55 AM Page xxix

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 ... 219