Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Yüklə 4,3 Mb.

Pdf görüntüsü

səhifə	219/219
tarix	08.10.2017
ölçüsü	4,3 Mb.
	#3816

1 ... 211 212 213 214 215 216 217 218 219

Ian H. Witten

5 1 8

I N D E X

numeric prediction (continued)

model tree, 244–251. See also model

tree

rules, 251

stacking, 334

trees, 76, 243

NumericToBinary, 399

NumericTransform, 397

O

O(n), 196

O(n

2

), 196

Obfuscate, 396, 400

object editor, 366, 381, 393

Occam’s razor, 180, 183

oil slick detection, 23

1R procedure, 84–88, 139

OneR, 408

OneRAttributeEval, 423

one-tailed, 148

online documentation, 368

Open DB, 382

optimizing performance in Weka, 417

OptionHandler, 451, 482

option nodes, 328

option trees, 328–331

orderings

circular, 349

partial, 349

order-independent rules, 67, 112

OrdinalClassClassiﬁer, 418

ordinal attributes, 51

ordinal quantities, 50

orthogonal, 307

outer cross-validation, 286

outliers, 313, 342

output

data engineering, 287–288, 315–341. See also

engineering input and output

knowledge representation, 61–82. See also

knowledge representation

overﬁtting, 86

Bayesian clustering, 268

category utility, 261

forward stagewise additive regression, 326

MDL principle, 181

multilayer perceptrons, 233

1R, 87

statistical tests, 30

support vectors, 217–218

overﬁtting-avoidance bias, 34

overgeneralization, 239, 243

overlapping hyperrectangles, 239

overlay data, 53

pace regression in Weka, 410

PaceRegression, 410

paired t-test, 154, 294

pairwise classiﬁcation, 123, 410

pairwise coupling, 123

pairwise plots, 60

parabola, 240

parallelization, 347

parameter tuning, 286

Part, 409

partial decision tree, 207–210

partial ordering, 51

partitioning instance space, 79

pattern recognition, 39

Percentage split, 377

perceptron

deﬁned, 126

kernel, 223

learning rule, 124, 125

linear classiﬁcation, 124–126

multilayer, 223–226, 233

voted, 223

perceptron learning rule, 124, 125

permutation tests, 362

PKIDiscretize, 396, 398

Poisson distribution, 268

Polygon, 389

Polyline, 389

polynomial kernel, 218

popular music, 359

postal ZIP code, 57

postpruning, 34, 192

precision, 171

predicate calculus, 82

P088407-INDEX.qxd 4/30/05 11:25 AM Page 518

I N D E X

5 1 9

predicting performance, 146–149. See also

evaluation

predicting probabilities, 157–161

PredictionAppender, 431

prediction nodes, 329

predictive accuracy in Weka, 420

PredictiveApriori, 420

Preprocess panel, 372, 380

prepruning, 34, 192

presbyopia, 13

preventive maintenance of electromechanical

devices, 25–26

principal components, 307–308

PrincipalComponents, 423

principal components analysis, 306–309

principle of multiple explanations, 183

prior knowledge, 349–351

prior probability, 90

PRISM, 110–111, 112, 213

Prism, 409

privacy, 357–358

probabilistic EM procedure, 265–266

probability-based clustering, 262–265

probability cost function, 175

probability density function, 93

programming. See Weka workbench

programming by demonstration, 360

promotional offers, 27

proportional k-interval discretization, 298

propositional calculus, 73, 82

propositional rules, 69

pruning

classiﬁcation rules, 203, 205

decision tree, 192–193, 312

massive datasets, 348

model tree, 245–246

noisy exemplars, 236–237

overﬁtting-avoidance bias, 34

reduced-error, 203

pruning set, 202

pseudocode

basic rule learner, 111

model tree, 247–250

1R, 85

punctuation conventions, 310

quadratic loss function, 158–159, 161

quadratic optimization, 217

Quinlan, J. Ross, 29, 105, 198

R. R. Donnelly, 28

RacedIncrementalLogitBoost, 416

race search, 295

RaceSearch, 424

radial basis function (RBF) kernel, 219, 234

radial basis function (RBF) network, 234

RandomCommittee, 415

RandomForest, 407

random forest metalearner in Weka, 416

randomization, 320–321

Randomize, 400

RandomProjection, 400

random projections, 309

RandomSearch, 424

RandomTree, 407

Ranker, 424–425

RankSearch, 424

ratio quantities, 51

RBF (Radial Basis Function) kernel, 219, 234

RBF (Radial Basis Function) network, 234

RBFNetwork, 410

real-life applications. See ﬁelded applications

real-life datasets, 10

real-world implementations. See

implementation—real-world schemes

recall, 171

recall-precision curves, 171–172

Rectangle, 389

rectangular generalizations, 80

recurrent neural networks, 233

recursion, 48

recursive feature elimination, 291, 341

reduced-error pruning, 194, 203

redundant exemplars, 236

regression, 17, 76

RegressionByDiscretization, 418

regression equation, 17

regression tree, 76, 77, 243

reinforcement learning, 38

P088407-INDEX.qxd 4/30/05 11:25 AM Page 519

5 2 0

I N D E X

relational data, 49

relational rules, 74

relations, 73–75

relative absolute error, 177–179

relative error ﬁgures, 177–179

relative squared error, 177, 178

RELIEF, 341

ReliefFAttributeEval, 422

religious discrimination, illegal, 35

remoteEngine.jar, 446

remote.policy, 446

Remove, 382

RemoveFolds, 400

RemovePercentage, 401

RemoveRange, 401

RemoveType, 397

RemoveUseless, 397

RemoveWithValues, 401

repeated holdout, 150

ReplaceMissingValues, 396, 398

replicated subtree problem, 66–68

REPTree, 407–408

Resample, 400, 403

residuals, 325

resubstitution error, 145

Ridor, 409

RIPPER rule learner, 205–214

ripple-down rules, 214

robo-soccer, 358

robust regression, 313–314

ROC curve, 168–171, 172

root mean-squared error, 178, 179

root relative squared error, 178, 179

root squared error measures, 177–179

rote learning, 76, 354

row separation, 336

rule

antecedent, 65

association, 69–70, 112–119

classiﬁcation. See classiﬁcation rules

consequent, 65

decision lists, 111–112

double-consequent, 118

exceptions, with, 70–72, 210–213

good (worthwhile), 202–205

nearest-neighbor, 78–79

numeric prediction, 251

order of (decision list), 67

partial decision trees, 207–210

propositional, 73

relational, 74

relations, and, 73–75

single-consequent, 118

trees, and, 107, 198

Weka, 408–409

rule-based programming, 82

rules involving relations, 73–75

rules with exceptions, 70–73, 210–213

S

sample problems. See example problems

sampling with replacement, 152

satellite images, evaluating, 23

ScatterPlotMatrix, 430

schemata search, 295

scheme-independent attribute selection,

290–292

scheme-speciﬁc attribute selection, 294–296

scientiﬁc applications, 28

scoring networks, 277–280, 283

SDR (Standard Deviation Reduction), 245

search bias, 33–34

search engine spam, 357

search methods in Weka, 421, 423–425

segment-challenge.arff, 389

segment-test.arff, 389

Select attributes panel, 392–393

selective Naïve Bayes, 296

semantic relation, 349

semantic Web, 355

semisupervised learning, 337

sensitivity, 173

separate-and-conquer technique, 112, 200

sequential boosting-like scheme, 347

sequential minimal optimization (SMO)

algorithm, 410

setOptions(), 482

sexual discrimination, illegal, 35

shapes problem, 73

sigmoid function, 227, 228

P088407-INDEX.qxd 4/30/05 11:25 AM Page 520

I N D E X

5 2 1

sigmoid kernel, 219

Simple CLI, 371, 449, 450

SimpleKMeans, 418–419

simple linear regression, 326

SimpleLinearRegression, 409

SimpleLogistic, 410

simplest-ﬁrst ordering, 34

simplicity-ﬁrst methodology, 83, 183

single-attribute evaluators in Weka, 421,

422–423

single-consequent rules, 118

single holdout procedure, 150

sister-of-relation, 46–47

SMO, 410

smoothing

locally weighted linear regression, 252

model tree, 244, 251

SMOreg, 410

software programs. See Weka workbench

sorting, avoiding repeated, 190

soybean data, 18–22

spam, 356–357

sparse data, 55–56

sparse instance in Weka, 401

SparseToNonSparse, 401

speciﬁcity, 173

speciﬁc-to-general search bias, 34

splitData(), 480

splitter nodes, 329

splitting

clustering, 254–255, 257

decision tree, 62–63

entropy-based discretization, 301

massive datasets, 347

model tree, 245, 247

subexperiments, 447

surrogate, 247

SpreadSubsample, 403

squared-error loss function, 227

squared error measures, 177–179

stacked generalization, 332

stacking, 332–334

Stacking, 417

StackingC, 417

stale data, 60

standard deviation reduction (SDR), 245

standard deviations from the mean, 148

Standardize, 398

standardizing, 56

statistical modeling, 88–97

document classiﬁcation, 94–96

missing values, 92–94

normal-distribution assumption, 92

numeric attributes, 92–94

statistics, 29–30

Status box, 380

step function, 227, 228

stochastic algorithms, 348

stochastic backpropagation, 232

stopping criterion, 293, 300, 326

stopwords, 310, 352

stratiﬁcation, 149, 151

stratiﬁed holdout, 149

StratiﬁedRemoveFolds, 403

stratiﬁed cross-validation, 149

StreamableFilter, 456

string attributes, 54–55

string conversion in Weka, 399

string table, 55

StringToNominal, 399

StringToWordVector, 396, 399, 401, 462

StripChart, 431

structural patterns, 6

structure learning by conditional independence

tests, 280

student’s distribution with k–1 degrees of

freedom, 155

student’s t-test, 154, 184

subexperiments, 447

subsampling in Weka, 400

subset evaluators in Weka, 421, 422

subtree raising, 193, 197

subtree replacement, 192–193, 197

success rate, 173

supervised attribute ﬁlters in Weka, 402–403

supervised discretization, 297, 298

supervised ﬁlters in Weka, 401–403

supervised instance ﬁlters in Weka, 402, 403

supervised learning, 43

support, 69, 113

P088407-INDEX.qxd 4/30/05 11:25 AM Page 521

5 2 2

I N D E X

support vector, 216

support vector machine, 39, 188, 214, 340

support vector machine (SVM) classiﬁer, 341

support vector machines with Gaussian

kernels, 234

support vector regression, 219–222

surrogate splitting, 247

SVMAttributeEval, 423

SVM classiﬁer (Support Vector Machine), 341

SwapValues, 398

SymmetricalUncertAttributeEval, 423

symmetric uncertainty, 291

systematic data errors, 59–60

T

tabular input format, 119

TAN (Tree Augmented Naïve Bayes), 279

television preferences/channels, 28–29

tenfold cross-validation, 150, 151

Tertius, 420

test set, 145

TestSetMaker, 431

text mining, 351–356

text summarization, 352

text to attribute vectors, 309–311

TextViewer, 430

¥ IDF, 311

theory, 180

threat detection systems, 357

3-point average recall, 172

threefold cross-validation, 150

ThresholdSelector, 418

time series, 311

TimeSeriesDelta, 400

TimeSeriesTranslate, 396, 399–400

timestamp, 311

TN (True Negatives), 162

tokenization, 310

tokenization in Weka, 399

top-down induction of decision trees, 105

toSource(), 453

toString(), 453, 481, 483

toy problems. See example problems

TP (True Positives), 162

training and testing, 144–146

training set, 296

TrainingSetMaker, 431

TrainTestSplitMaker, 431

transformations. See attribute transformations

transforming a multiclass problem into a two-

class one, 334–335

tree

AD (All Dimensions), 280–283

alternating decision, 329, 330, 343

ball, 133–135

decision. See decision tree

logistic model, 331

metric, 136

model, 76, 243. See also model tree

numeric prediction, 76

option, 328–331

regression, 76, 243

Tree Augmented Naïve Bayes (TAN), 279

tree classiﬁer in Weka, 404, 406–408

tree diagrams, 82

Trees (subpackages), 451, 453

Tree Visualizer, 389, 390

true negative (TN), 162

true positive (TP), 162

true positive rate, 162–163

True positive rate, 378

t-statistic, 156

t-test, 154

TV preferences/channels, 28–29

two-class mixture model, 264

two-class problem, 73

two-tailed test, 156

two-way split, 63

typographic errors, 59

U

ubiquitous data mining, 358–361

unacceptable contracts, 17

Unclassiﬁed instances, 377

Undo, 383

unit, 224

univariate decision tree, 199

universal language, 32

P088407-INDEX.qxd 4/30/05 11:25 AM Page 522

I N D E X

5 2 3

unlabeled data, 337–341

clustering for classiﬁcation, 337

co-training, 339–340

EM and co-training, 340–341

unmasking, 358

unsupervised attribute ﬁlters in Weka, 395–400

unsupervised discretization, 297–298

unsupervised instance ﬁlters in Weka, 400–401

unsupervised learning, 84

UpdateableClassiﬁer, 456, 482

updateClassiﬁer(), 482

User Classiﬁer, 63–65, 388–391

UserClassiﬁer, 388

user interfaces, 367–368

Use training set, 377

utility, category, 260–262

validation data, 146

variance, 154, 317

Venn diagram, 81

very large datasets, 346–349

“Very simple classiﬁcation rules perform well

on most commonly used datasets” (Holte),

88

VFI, 414

visualization components in Weka, 430–431

Visualize classiﬁer errors, 387

Visualize panel, 393

Visualize threshold curve, 378

Vote, 417

voted perceptron, 223

VotedPerceptron, 410

voting, 315, 321, 347

voting feature intervals, 136

W

weak learners, 325

weather problem example, 10–12

association rules for, 115–117

attribute space for, 292–293

as a classiﬁcation problem, 42

as a clustering problem, 43–44

converting data to ARFF format, 370

cost matrix for, 457

evaluating attributes in, 85–86

inﬁnite rules for, 30

item sets, 113–115

as a numeric prediction problem, 43–44

web mining, 355–356

weight decay, 233

weighted instances, 252

WeightedInstancesHandler, 482

weighting attributes, 237–238

weighting models, 316

weka.associations, 455

weka.attributeSelection, 455

weka.classiﬁers, 453

weka.classiﬁers.bayes.NaiveBayesSimple, 472

weka.classiﬁers.Classiﬁer, 453

weka.classiﬁers.lazy.IB1, 472

weka.classiﬁers.lazy.IBk, 482, 483

weka.classiﬁers.rules.Prism, 472

weka.classiﬁers.trees, 453

weka.classiﬁers.trees.Id3, 471, 472

weka.clusterers, 455

weka.core, 451, 452, 482–483

weka.estimators, 455

weka.ﬁlters, 455

Weka workbench, 365–483

class hierarchy, 471–483

classiﬁers, 366, 471–483

command-line interface, 449–459. See also

command-line interface

elementary learning schemes, 472

embedded machine learning, 461–469

example application (classify text ﬁles into

two categories), 461–469

Experimenter, 437–447

Explorer, 369–425. See also Explorer

implementing classiﬁers, 471–483

introduction, 365–368

Knowledge Flow interface, 427–435

neural-network GUI, 411

object editor, 366

online documentation, 368

user interfaces, 367–368

William of Occam, 180

P088407-INDEX.qxd 4/30/05 11:25 AM Page 523

5 2 4

I N D E X

Winnow, 410

Winnow algorithm, 126–128

wisdom, deﬁned, 37

Wolpert, David, 334

word conversions, 310

World Wide Web mining, 354–356

wrapper, 290, 341, 355

wrapper induction, 355

WrapperSubsetEval, 422

writing classiﬁers in Weka, 471–483

0-1 loss function, 158

0.632 bootstrap, 152

1R method, 84–88

zero-frequency problem, 160

zero point, inherently deﬁned, 51

ZeroR, 409

ZIP code, 57

P088407-INDEX.qxd 4/30/05 11:25 AM Page 524

About the Authors

Ian H. Witten is a professor of computer science at the University

of Waikato in New Zealand. He is a fellow of the Association for Computing

Machinery and the Royal Society of New Zealand. He received the 2004 IFIP

Namur Award, a biennial honor accorded for outstanding contribution with

international impact to the awareness of social implications of information and

communication technology. His books include Managing gigabytes (1999) and

How to build a digital library (2003), and he has written many journal articles

and conference papers.

Eibe Frank is a senior lecturer in computer science at the University of Waikato.

He has published extensively in the area of machine learning and sits on the edi-

torial boards of the Machine Learning Journal and the Journal of Artiﬁcial Intel-

ligence Research. He has also served on the programming committees of many

data mining and machine learning conferences. As one of the core developers

of the Weka machine learning software that accompanies this book, he enjoys

maintaining and improving it.

5 2 5

P088407-EM.qxd 4/30/05 11:23 AM Page 525

Yüklə 4,3 Mb.

Dostları ilə paylaş:

1 ... 211 212 213 214 215 216 217 218 219