5 1 8
I N D E X
numeric prediction (continued)
model tree, 244–251. See also model
tree
rules, 251
stacking, 334
trees, 76, 243
NumericToBinary, 399
NumericTransform, 397
O
O(n), 196
O(n
2
), 196
Obfuscate, 396, 400
object editor, 366, 381, 393
Occam’s razor, 180, 183
oil slick detection, 23
1R procedure, 84–88, 139
OneR, 408
OneRAttributeEval, 423
one-tailed, 148
online documentation, 368
Open DB, 382
optimizing performance in Weka, 417
OptionHandler, 451, 482
option nodes, 328
option trees, 328–331
orderings
circular, 349
partial, 349
order-independent rules, 67, 112
OrdinalClassClassifier, 418
ordinal attributes, 51
ordinal quantities, 50
orthogonal, 307
outer cross-validation, 286
outliers, 313, 342
output
data engineering, 287–288, 315–341. See also
engineering input and output
knowledge representation, 61–82. See also
knowledge representation
overfitting, 86
Bayesian clustering, 268
category utility, 261
forward stagewise additive regression, 326
MDL principle, 181
multilayer perceptrons, 233
1R, 87
statistical tests, 30
support vectors, 217–218
overfitting-avoidance bias, 34
overgeneralization, 239, 243
overlapping hyperrectangles, 239
overlay data, 53
P
pace regression in Weka, 410
PaceRegression, 410
paired t-test, 154, 294
pairwise classification, 123, 410
pairwise coupling, 123
pairwise plots, 60
parabola, 240
parallelization, 347
parameter tuning, 286
Part, 409
partial decision tree, 207–210
partial ordering, 51
partitioning instance space, 79
pattern recognition, 39
Percentage split, 377
perceptron
defined, 126
kernel, 223
learning rule, 124, 125
linear classification, 124–126
multilayer, 223–226, 233
voted, 223
perceptron learning rule, 124, 125
permutation tests, 362
PKIDiscretize, 396, 398
Poisson distribution, 268
Polygon, 389
Polyline, 389
polynomial kernel, 218
popular music, 359
postal ZIP code, 57
postpruning, 34, 192
precision, 171
predicate calculus, 82
P088407-INDEX.qxd 4/30/05 11:25 AM Page 518
I N D E X
5 1 9
predicting performance, 146–149. See also
evaluation
predicting probabilities, 157–161
PredictionAppender, 431
prediction nodes, 329
predictive accuracy in Weka, 420
PredictiveApriori, 420
Preprocess panel, 372, 380
prepruning, 34, 192
presbyopia, 13
preventive maintenance of electromechanical
devices, 25–26
principal components, 307–308
PrincipalComponents, 423
principal components analysis, 306–309
principle of multiple explanations, 183
prior knowledge, 349–351
prior probability, 90
PRISM, 110–111, 112, 213
Prism, 409
privacy, 357–358
probabilistic EM procedure, 265–266
probability-based clustering, 262–265
probability cost function, 175
probability density function, 93
programming. See Weka workbench
programming by demonstration, 360
promotional offers, 27
proportional k-interval discretization, 298
propositional calculus, 73, 82
propositional rules, 69
pruning
classification rules, 203, 205
decision tree, 192–193, 312
massive datasets, 348
model tree, 245–246
noisy exemplars, 236–237
overfitting-avoidance bias, 34
reduced-error, 203
pruning set, 202
pseudocode
basic rule learner, 111
model tree, 247–250
1R, 85
punctuation conventions, 310
Q
quadratic loss function, 158–159, 161
quadratic optimization, 217
Quinlan, J. Ross, 29, 105, 198
R
R. R. Donnelly, 28
RacedIncrementalLogitBoost, 416
race search, 295
RaceSearch, 424
radial basis function (RBF) kernel, 219, 234
radial basis function (RBF) network, 234
RandomCommittee, 415
RandomForest, 407
random forest metalearner in Weka, 416
randomization, 320–321
Randomize, 400
RandomProjection, 400
random projections, 309
RandomSearch, 424
RandomTree, 407
Ranker, 424–425
RankSearch, 424
ratio quantities, 51
RBF (Radial Basis Function) kernel, 219, 234
RBF (Radial Basis Function) network, 234
RBFNetwork, 410
real-life applications. See fielded applications
real-life datasets, 10
real-world implementations. See
implementation—real-world schemes
recall, 171
recall-precision curves, 171–172
Rectangle, 389
rectangular generalizations, 80
recurrent neural networks, 233
recursion, 48
recursive feature elimination, 291, 341
reduced-error pruning, 194, 203
redundant exemplars, 236
regression, 17, 76
RegressionByDiscretization, 418
regression equation, 17
regression tree, 76, 77, 243
reinforcement learning, 38
P088407-INDEX.qxd 4/30/05 11:25 AM Page 519
5 2 0
I N D E X
relational data, 49
relational rules, 74
relations, 73–75
relative absolute error, 177–179
relative error figures, 177–179
relative squared error, 177, 178
RELIEF, 341
ReliefFAttributeEval, 422
religious discrimination, illegal, 35
remoteEngine.jar, 446
remote.policy, 446
Remove, 382
RemoveFolds, 400
RemovePercentage, 401
RemoveRange, 401
RemoveType, 397
RemoveUseless, 397
RemoveWithValues, 401
repeated holdout, 150
ReplaceMissingValues, 396, 398
replicated subtree problem, 66–68
REPTree, 407–408
Resample, 400, 403
residuals, 325
resubstitution error, 145
Ridor, 409
RIPPER rule learner, 205–214
ripple-down rules, 214
robo-soccer, 358
robust regression, 313–314
ROC curve, 168–171, 172
root mean-squared error, 178, 179
root relative squared error, 178, 179
root squared error measures, 177–179
rote learning, 76, 354
row separation, 336
rule
antecedent, 65
association, 69–70, 112–119
classification. See classification rules
consequent, 65
decision lists, 111–112
double-consequent, 118
exceptions, with, 70–72, 210–213
good (worthwhile), 202–205
nearest-neighbor, 78–79
numeric prediction, 251
order of (decision list), 67
partial decision trees, 207–210
propositional, 73
relational, 74
relations, and, 73–75
single-consequent, 118
trees, and, 107, 198
Weka, 408–409
rule-based programming, 82
rules involving relations, 73–75
rules with exceptions, 70–73, 210–213
S
sample problems. See example problems
sampling with replacement, 152
satellite images, evaluating, 23
ScatterPlotMatrix, 430
schemata search, 295
scheme-independent attribute selection,
290–292
scheme-specific attribute selection, 294–296
scientific applications, 28
scoring networks, 277–280, 283
SDR (Standard Deviation Reduction), 245
search bias, 33–34
search engine spam, 357
search methods in Weka, 421, 423–425
segment-challenge.arff, 389
segment-test.arff, 389
Select attributes panel, 392–393
selective Naïve Bayes, 296
semantic relation, 349
semantic Web, 355
semisupervised learning, 337
sensitivity, 173
separate-and-conquer technique, 112, 200
sequential boosting-like scheme, 347
sequential minimal optimization (SMO)
algorithm, 410
setOptions(), 482
sexual discrimination, illegal, 35
shapes problem, 73
sigmoid function, 227, 228
P088407-INDEX.qxd 4/30/05 11:25 AM Page 520
I N D E X
5 2 1
sigmoid kernel, 219
Simple CLI, 371, 449, 450
SimpleKMeans, 418–419
simple linear regression, 326
SimpleLinearRegression, 409
SimpleLogistic, 410
simplest-first ordering, 34
simplicity-first methodology, 83, 183
single-attribute evaluators in Weka, 421,
422–423
single-consequent rules, 118
single holdout procedure, 150
sister-of-relation, 46–47
SMO, 410
smoothing
locally weighted linear regression, 252
model tree, 244, 251
SMOreg, 410
software programs. See Weka workbench
sorting, avoiding repeated, 190
soybean data, 18–22
spam, 356–357
sparse data, 55–56
sparse instance in Weka, 401
SparseToNonSparse, 401
specificity, 173
specific-to-general search bias, 34
splitData(), 480
splitter nodes, 329
splitting
clustering, 254–255, 257
decision tree, 62–63
entropy-based discretization, 301
massive datasets, 347
model tree, 245, 247
subexperiments, 447
surrogate, 247
SpreadSubsample, 403
squared-error loss function, 227
squared error measures, 177–179
stacked generalization, 332
stacking, 332–334
Stacking, 417
StackingC, 417
stale data, 60
standard deviation reduction (SDR), 245
standard deviations from the mean, 148
Standardize, 398
standardizing, 56
statistical modeling, 88–97
document classification, 94–96
missing values, 92–94
normal-distribution assumption, 92
numeric attributes, 92–94
statistics, 29–30
Status box, 380
step function, 227, 228
stochastic algorithms, 348
stochastic backpropagation, 232
stopping criterion, 293, 300, 326
stopwords, 310, 352
stratification, 149, 151
stratified holdout, 149
StratifiedRemoveFolds, 403
stratified cross-validation, 149
StreamableFilter, 456
string attributes, 54–55
string conversion in Weka, 399
string table, 55
StringToNominal, 399
StringToWordVector, 396, 399, 401, 462
StripChart, 431
structural patterns, 6
structure learning by conditional independence
tests, 280
student’s distribution with k–1 degrees of
freedom, 155
student’s t-test, 154, 184
subexperiments, 447
subsampling in Weka, 400
subset evaluators in Weka, 421, 422
subtree raising, 193, 197
subtree replacement, 192–193, 197
success rate, 173
supervised attribute filters in Weka, 402–403
supervised discretization, 297, 298
supervised filters in Weka, 401–403
supervised instance filters in Weka, 402, 403
supervised learning, 43
support, 69, 113
P088407-INDEX.qxd 4/30/05 11:25 AM Page 521
5 2 2
I N D E X
support vector, 216
support vector machine, 39, 188, 214, 340
support vector machine (SVM) classifier, 341
support vector machines with Gaussian
kernels, 234
support vector regression, 219–222
surrogate splitting, 247
SVMAttributeEval, 423
SVM classifier (Support Vector Machine), 341
SwapValues, 398
SymmetricalUncertAttributeEval, 423
symmetric uncertainty, 291
systematic data errors, 59–60
T
tabular input format, 119
TAN (Tree Augmented Naïve Bayes), 279
television preferences/channels, 28–29
tenfold cross-validation, 150, 151
Tertius, 420
test set, 145
TestSetMaker, 431
text mining, 351–356
text summarization, 352
text to attribute vectors, 309–311
TextViewer, 430
TF
¥ IDF, 311
theory, 180
threat detection systems, 357
3-point average recall, 172
threefold cross-validation, 150
ThresholdSelector, 418
time series, 311
TimeSeriesDelta, 400
TimeSeriesTranslate, 396, 399–400
timestamp, 311
TN (True Negatives), 162
tokenization, 310
tokenization in Weka, 399
top-down induction of decision trees, 105
toSource(), 453
toString(), 453, 481, 483
toy problems. See example problems
TP (True Positives), 162
training and testing, 144–146
training set, 296
TrainingSetMaker, 431
TrainTestSplitMaker, 431
transformations. See attribute transformations
transforming a multiclass problem into a two-
class one, 334–335
tree
AD (All Dimensions), 280–283
alternating decision, 329, 330, 343
ball, 133–135
decision. See decision tree
logistic model, 331
metric, 136
model, 76, 243. See also model tree
numeric prediction, 76
option, 328–331
regression, 76, 243
Tree Augmented Naïve Bayes (TAN), 279
tree classifier in Weka, 404, 406–408
tree diagrams, 82
Trees (subpackages), 451, 453
Tree Visualizer, 389, 390
true negative (TN), 162
true positive (TP), 162
true positive rate, 162–163
True positive rate, 378
t-statistic, 156
t-test, 154
TV preferences/channels, 28–29
two-class mixture model, 264
two-class problem, 73
two-tailed test, 156
two-way split, 63
typographic errors, 59
U
ubiquitous data mining, 358–361
unacceptable contracts, 17
Unclassified instances, 377
Undo, 383
unit, 224
univariate decision tree, 199
universal language, 32
P088407-INDEX.qxd 4/30/05 11:25 AM Page 522
I N D E X
5 2 3
unlabeled data, 337–341
clustering for classification, 337
co-training, 339–340
EM and co-training, 340–341
unmasking, 358
unsupervised attribute filters in Weka, 395–400
unsupervised discretization, 297–298
unsupervised instance filters in Weka, 400–401
unsupervised learning, 84
UpdateableClassifier, 456, 482
updateClassifier(), 482
User Classifier, 63–65, 388–391
UserClassifier, 388
user interfaces, 367–368
Use training set, 377
utility, category, 260–262
V
validation data, 146
variance, 154, 317
Venn diagram, 81
very large datasets, 346–349
“Very simple classification rules perform well
on most commonly used datasets” (Holte),
88
VFI, 414
visualization components in Weka, 430–431
Visualize classifier errors, 387
Visualize panel, 393
Visualize threshold curve, 378
Vote, 417
voted perceptron, 223
VotedPerceptron, 410
voting, 315, 321, 347
voting feature intervals, 136
W
weak learners, 325
weather problem example, 10–12
association rules for, 115–117
attribute space for, 292–293
as a classification problem, 42
as a clustering problem, 43–44
converting data to ARFF format, 370
cost matrix for, 457
evaluating attributes in, 85–86
infinite rules for, 30
item sets, 113–115
as a numeric prediction problem, 43–44
web mining, 355–356
weight decay, 233
weighted instances, 252
WeightedInstancesHandler, 482
weighting attributes, 237–238
weighting models, 316
weka.associations, 455
weka.attributeSelection, 455
weka.classifiers, 453
weka.classifiers.bayes.NaiveBayesSimple, 472
weka.classifiers.Classifier, 453
weka.classifiers.lazy.IB1, 472
weka.classifiers.lazy.IBk, 482, 483
weka.classifiers.rules.Prism, 472
weka.classifiers.trees, 453
weka.classifiers.trees.Id3, 471, 472
weka.clusterers, 455
weka.core, 451, 452, 482–483
weka.estimators, 455
weka.filters, 455
Weka workbench, 365–483
class hierarchy, 471–483
classifiers, 366, 471–483
command-line interface, 449–459. See also
command-line interface
elementary learning schemes, 472
embedded machine learning, 461–469
example application (classify text files into
two categories), 461–469
Experimenter, 437–447
Explorer, 369–425. See also Explorer
implementing classifiers, 471–483
introduction, 365–368
Knowledge Flow interface, 427–435
neural-network GUI, 411
object editor, 366
online documentation, 368
user interfaces, 367–368
William of Occam, 180
P088407-INDEX.qxd 4/30/05 11:25 AM Page 523
5 2 4
I N D E X
Winnow, 410
Winnow algorithm, 126–128
wisdom, defined, 37
Wolpert, David, 334
word conversions, 310
World Wide Web mining, 354–356
wrapper, 290, 341, 355
wrapper induction, 355
WrapperSubsetEval, 422
writing classifiers in Weka, 471–483
Z
0-1 loss function, 158
0.632 bootstrap, 152
1R method, 84–88
zero-frequency problem, 160
zero point, inherently defined, 51
ZeroR, 409
ZIP code, 57
P088407-INDEX.qxd 4/30/05 11:25 AM Page 524
About the Authors
Ian H. Witten is a professor of computer science at the University
of Waikato in New Zealand. He is a fellow of the Association for Computing
Machinery and the Royal Society of New Zealand. He received the 2004 IFIP
Namur Award, a biennial honor accorded for outstanding contribution with
international impact to the awareness of social implications of information and
communication technology. His books include Managing gigabytes (1999) and
How to build a digital library (2003), and he has written many journal articles
and conference papers.
Eibe Frank is a senior lecturer in computer science at the University of Waikato.
He has published extensively in the area of machine learning and sits on the edi-
torial boards of the Machine Learning Journal and the Journal of Artificial Intel-
ligence Research. He has also served on the programming committees of many
data mining and machine learning conferences. As one of the core developers
of the Weka machine learning software that accompanies this book, he enjoys
maintaining and improving it.
5 2 5
P088407-EM.qxd 4/30/05 11:23 AM Page 525
Dostları ilə paylaş: |