5 1 0
I N D E X
CSVLoader, 381
cumulative margin distribution in Weka, 458
curves
cost, 173
lift, 166
recall-precision, 171
ROC, 168
customer support and service, 28
cutoff parameter, 260
CVParameterSelection, 417
cybersecurity, 29
D
dairy farmers (New Zealand), 3–4, 37, 161–162
data assembly, 52–53
data cleaning, 52–60. See also automatic data
cleansing
data engineering. See engineering input and
output
data integration, 52
data mining, 4–5, 9
data ownership rights, 35
data preparation, 52–60
data transformation. See attribute
transformations
DataVisualizer, 389, 390, 430
data warehouse, 52–53
date attributes, 55
decision list, 11, 67
decision nodes, 328
decision stump, 325
DecisionStump, 407, 453, 454
decision table, 62, 295
DecisionTable, 408
decision tree, 14, 62–65, 97–105
complexity of induction, 196
converting to rules, 198
data cleaning, 312–313
error rates, 192–196
highly branching attributes, 102–105
missing values, 63, 191–192
multiclass case, 107
multivariate, 199
nominal attribute, 62
numeric attribute, 62, 189–191
partial, 207–210
pruning, 192–193, 312
replicated subtree, 66
rules, 198
subtree raising, 193, 197
subtree replacement, 192–193, 197
three-way split, 63
top-down induction, 97–105, 196–198
two-way split, 62
univariate, 199
Weka, 406–408
Weka’s User Classifer facility, 63–65
Decorate, 416
deduction, 350
default rule, 110
degrees of freedom, 93, 155
delta, 311
dendrograms, 82
denormalization, 47
density function, 93
diagnosis, 25–26
dichotomy, 51
directed acyclic graph, 272
direct marketing, 27
discrete attributes, 50. See also nominal
attributes
Discretize, 396, 398, 402
discretizing numeric attributes, 287, 296–305
chi-squared test, 302
converting discrete to numeric attributes,
304–305
entropy-based discretization, 298–302
error-based discretization, 302–304
global discretization, 297
local discretization, 297
supervised discretization, 297, 298
unsupervised discretization, 297–298
Weka, 398
disjunction, 32, 65
disjunctive normal form, 69
distance functions, 128–129, 239–242
distributed experiments in Weka, 445
distribution, 304
distributionForInstance(), 453, 481
divide-and-conquer. See decision tree
P088407-INDEX.qxd 4/30/05 11:25 AM Page 510
I N D E X
5 1 1
document classification, 94–96, 352–353
document clustering, 353
domain knowledge, 20, 33, 349–351
double-consequent rules, 118
duplicate data, 59
dynamic programming, 302
E
early stopping, 233
easy instances, 322
ecological applications, 23, 28
eigenvalue, 307
eigenvector, 307
Einstein, Albert, 180
electricity supply, 24–25
electromechanical diagnosis application,
144
11-point average recall, 172
EM, 418
EM algorithm, 265–266
EM and co-training, 340–341
EM procedure, 337–338
embedded machine learning, 461–469
engineering input and output, 285–343
attribute selection, 288–296
combining multiple models, 315–336
data cleansing, 312–315
discretizing numeric attributes, 296–305
unlabeled data, 337–341
See also individual subject headings
entity extraction, 353
entropy, 102
entropy-based discretization, 298–302
enumerated attributes, 50. See also nominal
attributes
enumerating the concept space, 31–32
Epicurus, 183
epoch, 412
equal-frequency binning, 298
equal-interval binning, 298
equal-width binning, 342
erroneous values, 59
error-based discretization, 302–304
error-correcting output codes, 334–336
error log, 378
error rate
bias, 317
cost of errors. See cost of errors
decision tree, 192–196
defined, 144
training data, 145
“Essay towards solving a problem in the
doctrine of chances, An” (Bayes), 141
ethics, 35–37
Euclidean distance, 78, 128, 129, 237
evaluation, 143–185
bootstrap procedure, 152–153
comparing data mining methods, 153–157
cost of errors, 161–176. See also cost of
errors
cross-validation, 149–152
leave-one-out cross-validation, 151–152
MDL principle, 179–184
numeric prediction, 176–179
predicting performance, 146–149
predicting probabilities, 157–161
training and testing, 144–146
evaluation(), 482
evaluation components in Weka, 430, 431
Evaluation panel, 431
example problems
contact lens data, 6, 13–15
CPU performance data, 16–17
iris dataset, 15–16
labor negotiations data, 17–18, 19
soybean data, 18–22
weather problem, 10–12
exceptions, 70–73, 210–213
exclusive-or problem, 67
exemplar
defined, 236
generalized, 238–239
noisy, 236–237
redundant, 236
exemplar generalization, 238–239, 243
ExhaustiveSearch, 424
Expand all paths, 408
expectation, 265, 267
expected error, 174
expected success rate, 147
P088407-INDEX.qxd 4/30/05 11:25 AM Page 511
Dostları ilə paylaş: |