Data Mining: Practical Machine Learning Tools and Techniques, Second Edition



Yüklə 4,3 Mb.
Pdf görüntüsü
səhifə219/219
tarix08.10.2017
ölçüsü4,3 Mb.
#3816
1   ...   211   212   213   214   215   216   217   218   219

5 1 8

I N D E X

numeric prediction (continued)

model tree, 244–251. See also model 

tree

rules, 251



stacking, 334

trees, 76, 243



NumericToBinary, 399

NumericTransform, 397

O

O(n), 196

O(n

2

), 196



Obfuscate, 396, 400

object editor, 366, 381, 393

Occam’s razor, 180, 183

oil slick detection, 23

1R procedure, 84–88, 139

OneR, 408

OneRAttributeEval, 423

one-tailed, 148

online documentation, 368

Open DB, 382

optimizing performance in Weka, 417



OptionHandler, 451, 482

option nodes, 328

option trees, 328–331

orderings

circular, 349

partial, 349

order-independent rules, 67, 112

OrdinalClassClassifier, 418

ordinal attributes, 51

ordinal quantities, 50

orthogonal, 307

outer cross-validation, 286

outliers, 313, 342

output

data engineering, 287–288, 315–341. See also



engineering input and output

knowledge representation, 61–82. See also

knowledge representation

overfitting, 86

Bayesian clustering, 268

category utility, 261

forward stagewise additive regression, 326

MDL principle, 181

multilayer perceptrons, 233

1R, 87


statistical tests, 30

support vectors, 217–218

overfitting-avoidance bias, 34

overgeneralization, 239, 243

overlapping hyperrectangles, 239

overlay data, 53



P

pace regression in Weka, 410



PaceRegression, 410

paired t-test, 154, 294

pairwise classification, 123, 410

pairwise coupling, 123

pairwise plots, 60

parabola, 240

parallelization, 347

parameter tuning, 286



Part, 409

partial decision tree, 207–210

partial ordering, 51

partitioning instance space, 79

pattern recognition, 39

Percentage split, 377

perceptron

defined, 126

kernel, 223

learning rule, 124, 125

linear classification, 124–126

multilayer, 223–226, 233

voted, 223

perceptron learning rule, 124, 125

permutation tests, 362



PKIDiscretize, 396, 398

Poisson distribution, 268



Polygon, 389

Polyline, 389

polynomial kernel, 218

popular music, 359

postal ZIP code, 57

postpruning, 34, 192

precision, 171

predicate calculus, 82

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 518




I N D E X

5 1 9


predicting performance, 146–149. See also

evaluation

predicting probabilities, 157–161

PredictionAppender, 431

prediction nodes, 329

predictive accuracy in Weka, 420

PredictiveApriori, 420

Preprocess panel, 372, 380

prepruning, 34, 192

presbyopia, 13

preventive maintenance of electromechanical

devices, 25–26

principal components, 307–308



PrincipalComponents, 423

principal components analysis, 306–309

principle of multiple explanations, 183

prior knowledge, 349–351

prior probability, 90

PRISM, 110–111, 112, 213



Prism, 409

privacy, 357–358

probabilistic EM procedure, 265–266

probability-based clustering, 262–265

probability cost function, 175

probability density function, 93

programming. See Weka workbench

programming by demonstration, 360

promotional offers, 27

proportional k-interval discretization, 298

propositional calculus, 73, 82

propositional rules, 69

pruning

classification rules, 203, 205



decision tree, 192–193, 312

massive datasets, 348

model tree, 245–246

noisy exemplars, 236–237

overfitting-avoidance bias, 34

reduced-error, 203

pruning set, 202

pseudocode

basic rule learner, 111

model tree, 247–250

1R, 85

punctuation conventions, 310



Q

quadratic loss function, 158–159, 161

quadratic optimization, 217

Quinlan, J. Ross, 29, 105, 198



R

R. R. Donnelly, 28



RacedIncrementalLogitBoost, 416

race search, 295



RaceSearch, 424

radial basis function (RBF) kernel, 219, 234

radial basis function (RBF) network, 234

RandomCommittee, 415

RandomForest, 407

random forest metalearner in Weka, 416

randomization, 320–321

Randomize, 400

RandomProjection, 400

random projections, 309



RandomSearch, 424

RandomTree, 407

Ranker, 424–425

RankSearch, 424

ratio quantities, 51

RBF (Radial Basis Function) kernel, 219, 234

RBF (Radial Basis Function) network, 234



RBFNetwork, 410

real-life applications. See fielded applications

real-life datasets, 10

real-world implementations. See

implementation—real-world schemes

recall, 171

recall-precision curves, 171–172

Rectangle, 389

rectangular generalizations, 80

recurrent neural networks, 233

recursion, 48

recursive feature elimination, 291, 341

reduced-error pruning, 194, 203

redundant exemplars, 236

regression, 17, 76



RegressionByDiscretization, 418

regression equation, 17

regression tree, 76, 77, 243

reinforcement learning, 38

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 519



5 2 0

I N D E X

relational data, 49

relational rules, 74

relations, 73–75

relative absolute error, 177–179

relative error figures, 177–179

relative squared error, 177, 178

RELIEF, 341

ReliefFAttributeEval, 422

religious discrimination, illegal, 35



remoteEngine.jar, 446

remote.policy, 446

Remove, 382

RemoveFolds, 400

RemovePercentage, 401

RemoveRange, 401

RemoveType, 397

RemoveUseless, 397

RemoveWithValues, 401

repeated holdout, 150



ReplaceMissingValues, 396, 398

replicated subtree problem, 66–68



REPTree, 407–408

Resample, 400, 403

residuals, 325

resubstitution error, 145

Ridor, 409

RIPPER rule learner, 205–214

ripple-down rules, 214

robo-soccer, 358

robust regression, 313–314

ROC curve, 168–171, 172

root mean-squared error, 178, 179

root relative squared error, 178, 179

root squared error measures, 177–179

rote learning, 76, 354

row separation, 336

rule


antecedent, 65

association, 69–70, 112–119

classification. See classification rules

consequent, 65

decision lists, 111–112

double-consequent, 118

exceptions, with, 70–72, 210–213

good (worthwhile), 202–205

nearest-neighbor, 78–79

numeric prediction, 251

order of (decision list), 67

partial decision trees, 207–210

propositional, 73

relational, 74

relations, and, 73–75

single-consequent, 118

trees, and, 107, 198

Weka, 408–409

rule-based programming, 82

rules involving relations, 73–75

rules with exceptions, 70–73, 210–213

S

sample problems. See example problems

sampling with replacement, 152

satellite images, evaluating, 23



ScatterPlotMatrix, 430

schemata search, 295

scheme-independent attribute selection,

290–292


scheme-specific attribute selection, 294–296

scientific applications, 28

scoring networks, 277–280, 283

SDR (Standard Deviation Reduction), 245

search bias, 33–34

search engine spam, 357

search methods in Weka, 421, 423–425

segment-challenge.arff, 389

segment-test.arff, 389

Select attributes panel, 392–393

selective Naïve Bayes, 296

semantic relation, 349

semantic Web, 355

semisupervised learning, 337

sensitivity, 173

separate-and-conquer technique, 112, 200

sequential boosting-like scheme, 347

sequential minimal optimization (SMO)

algorithm, 410



setOptions(), 482

sexual discrimination, illegal, 35

shapes problem, 73

sigmoid function, 227, 228

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 520



I N D E X

5 2 1


sigmoid kernel, 219

Simple CLI, 371, 449, 450

SimpleKMeans, 418–419

simple linear regression, 326



SimpleLinearRegression, 409

SimpleLogistic, 410

simplest-first ordering, 34

simplicity-first methodology, 83, 183

single-attribute evaluators in Weka, 421,

422–423

single-consequent rules, 118



single holdout procedure, 150

sister-of-relation, 46–47



SMO, 410

smoothing

locally weighted linear regression, 252

model tree, 244, 251



SMOreg, 410

software programs. See Weka workbench

sorting, avoiding repeated, 190

soybean data, 18–22

spam, 356–357

sparse data, 55–56

sparse instance in Weka, 401

SparseToNonSparse, 401

specificity, 173

specific-to-general search bias, 34

splitData(), 480

splitter nodes, 329

splitting

clustering, 254–255, 257

decision tree, 62–63

entropy-based discretization, 301

massive datasets, 347

model tree, 245, 247

subexperiments, 447

surrogate, 247



SpreadSubsample, 403

squared-error loss function, 227

squared error measures, 177–179

stacked generalization, 332

stacking, 332–334

Stacking, 417

StackingC, 417

stale data, 60

standard deviation reduction (SDR), 245

standard deviations from the mean, 148



Standardize, 398

standardizing, 56

statistical modeling, 88–97

document classification, 94–96

missing values, 92–94

normal-distribution assumption, 92

numeric attributes, 92–94

statistics, 29–30



Status box, 380

step function, 227, 228

stochastic algorithms, 348

stochastic backpropagation, 232

stopping criterion, 293, 300, 326

stopwords, 310, 352

stratification, 149, 151

stratified holdout, 149



StratifiedRemoveFolds, 403

stratified cross-validation, 149



StreamableFilter, 456

string attributes, 54–55

string conversion in Weka, 399

string table, 55



StringToNominal, 399

StringToWordVector, 396, 399, 401, 462

StripChart, 431

structural patterns, 6

structure learning by conditional independence

tests, 280

student’s distribution with k–1 degrees of

freedom, 155

student’s t-test, 154, 184

subexperiments, 447

subsampling in Weka, 400

subset evaluators in Weka, 421, 422

subtree raising, 193, 197

subtree replacement, 192–193, 197

success rate, 173

supervised attribute filters in Weka, 402–403

supervised discretization, 297, 298

supervised filters in Weka, 401–403

supervised instance filters in Weka, 402, 403

supervised learning, 43

support, 69, 113

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 521




5 2 2

I N D E X

support vector, 216

support vector machine, 39, 188, 214, 340

support vector machine (SVM) classifier, 341

support vector machines with Gaussian

kernels, 234

support vector regression, 219–222

surrogate splitting, 247

SVMAttributeEval, 423

SVM classifier (Support Vector Machine), 341



SwapValues, 398

SymmetricalUncertAttributeEval, 423

symmetric uncertainty, 291

systematic data errors, 59–60

T

tabular input format, 119

TAN (Tree Augmented Naïve Bayes), 279

television preferences/channels, 28–29

tenfold cross-validation, 150, 151

Tertius, 420

test set, 145



TestSetMaker, 431

text mining, 351–356

text summarization, 352

text to attribute vectors, 309–311



TextViewer, 430

TF 


¥ IDF, 311

theory, 180

threat detection systems, 357

3-point average recall, 172

threefold cross-validation, 150

ThresholdSelector, 418

time series, 311



TimeSeriesDelta, 400

TimeSeriesTranslate, 396, 399–400

timestamp, 311

TN (True Negatives), 162

tokenization, 310

tokenization in Weka, 399

top-down induction of decision trees, 105



toSource(), 453

toString(), 453, 481, 483

toy problems. See example problems

TP (True Positives), 162

training and testing, 144–146

training set, 296

TrainingSetMaker, 431

TrainTestSplitMaker, 431

transformations. See attribute transformations

transforming a multiclass problem into a two-

class one, 334–335

tree

AD (All Dimensions), 280–283



alternating decision, 329, 330, 343

ball, 133–135

decision. See decision tree

logistic model, 331

metric, 136

model, 76, 243. See also model tree

numeric prediction, 76

option, 328–331

regression, 76, 243

Tree Augmented Naïve Bayes (TAN), 279

tree classifier in Weka, 404, 406–408

tree diagrams, 82



Trees (subpackages), 451, 453

Tree Visualizer, 389, 390

true negative (TN), 162

true positive (TP), 162

true positive rate, 162–163



True positive rate, 378

t-statistic, 156

t-test, 154

TV preferences/channels, 28–29

two-class mixture model, 264

two-class problem, 73

two-tailed test, 156

two-way split, 63

typographic errors, 59

U

ubiquitous data mining, 358–361

unacceptable contracts, 17

Unclassified instances, 377

Undo, 383

unit, 224

univariate decision tree, 199

universal language, 32

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 522



I N D E X

5 2 3


unlabeled data, 337–341

clustering for classification, 337

co-training, 339–340

EM and co-training, 340–341

unmasking, 358

unsupervised attribute filters in Weka, 395–400

unsupervised discretization, 297–298

unsupervised instance filters in Weka, 400–401

unsupervised learning, 84

UpdateableClassifier, 456, 482

updateClassifier(), 482

User Classifier, 63–65, 388–391



UserClassifier, 388

user interfaces, 367–368



Use training set, 377

utility, category, 260–262



V

validation data, 146

variance, 154, 317

Venn diagram, 81

very large datasets, 346–349

“Very simple classification rules perform well

on most commonly used datasets” (Holte),

88

VFI, 414

visualization components in Weka, 430–431

Visualize classifier errors, 387

Visualize panel, 393

Visualize threshold curve, 378

Vote, 417

voted perceptron, 223



VotedPerceptron, 410

voting, 315, 321, 347

voting feature intervals, 136

W

weak learners, 325

weather problem example, 10–12

association rules for, 115–117

attribute space for, 292–293

as a classification problem, 42

as a clustering problem, 43–44

converting data to ARFF format, 370

cost matrix for, 457

evaluating attributes in, 85–86

infinite rules for, 30

item sets, 113–115

as a numeric prediction problem, 43–44

web mining, 355–356

weight decay, 233

weighted instances, 252



WeightedInstancesHandler, 482

weighting attributes, 237–238

weighting models, 316

weka.associations, 455

weka.attributeSelection, 455

weka.classifiers, 453

weka.classifiers.bayes.NaiveBayesSimple, 472

weka.classifiers.Classifier, 453

weka.classifiers.lazy.IB1, 472

weka.classifiers.lazy.IBk, 482, 483

weka.classifiers.rules.Prism, 472

weka.classifiers.trees, 453

weka.classifiers.trees.Id3, 471, 472

weka.clusterers, 455

weka.core, 451, 452, 482–483

weka.estimators, 455

weka.filters, 455

Weka workbench, 365–483

class hierarchy, 471–483

classifiers, 366, 471–483

command-line interface, 449–459. See also

command-line interface

elementary learning schemes, 472

embedded machine learning, 461–469

example application (classify text files into

two categories), 461–469

Experimenter, 437–447

Explorer, 369–425. See also Explorer

implementing classifiers, 471–483

introduction, 365–368

Knowledge Flow interface, 427–435

neural-network GUI, 411

object editor, 366

online documentation, 368

user interfaces, 367–368

William of Occam, 180

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 523



5 2 4

I N D E X



Winnow, 410

Winnow algorithm, 126–128

wisdom, defined, 37

Wolpert, David, 334

word conversions, 310

World Wide Web mining, 354–356

wrapper, 290, 341, 355

wrapper induction, 355



WrapperSubsetEval, 422

writing classifiers in Weka, 471–483



Z

0-1 loss function, 158

0.632 bootstrap, 152

1R method, 84–88

zero-frequency problem, 160

zero point, inherently defined, 51



ZeroR, 409

ZIP code, 57

P088407-INDEX.qxd  4/30/05  11:25 AM  Page 524



About the Authors

Ian H. Witten is a professor of computer science at the University 

of Waikato in New Zealand. He is a fellow of the Association for Computing

Machinery and the Royal Society of New Zealand. He received the 2004 IFIP

Namur Award, a biennial honor accorded for outstanding contribution with

international impact to the awareness of social implications of information and

communication technology. His books include Managing gigabytes (1999) and



How to build a digital library (2003), and he has written many journal articles

and conference papers.



Eibe Frank is a senior lecturer in computer science at the University of Waikato.

He has published extensively in the area of machine learning and sits on the edi-

torial boards of the Machine Learning Journal and the Journal of Artificial Intel-

ligence Research. He has also served on the programming committees of many

data mining and machine learning conferences. As one of the core developers

of the Weka machine learning software that accompanies this book, he enjoys

maintaining and improving it.



5 2 5

P088407-EM.qxd  4/30/05  11:23 AM  Page 525



Yüklə 4,3 Mb.

Dostları ilə paylaş:
1   ...   211   212   213   214   215   216   217   218   219




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə