5
Credibility: Evaluating what’s been learned
143
5.1
Training and testing
144
5.2
Predicting performance
146
5.3
Cross-validation
149
5.4
Other estimates
151
Leave-one-out
151
The bootstrap
152
5.5
Comparing data mining methods
153
5.6
Predicting probabilities
157
Quadratic loss function
158
Informational loss function
159
Discussion
160
5.7
Counting the cost
161
Cost-sensitive classification
164
Cost-sensitive learning
165
Lift charts
166
ROC curves
168
Recall–precision curves
171
Discussion
172
Cost curves
173
5.8
Evaluating numeric prediction
176
5.9
The minimum description length principle
179
5.10
Applying the MDL principle to clustering
183
5.11
Further reading
184
6
Implementations: Real machine learning schemes
187
6.1
Decision trees
189
Numeric attributes
189
Missing values
191
Pruning
192
Estimating error rates
193
Complexity of decision tree induction
196
From trees to rules
198
C4.5: Choices and options
198
Discussion
199
6.2
Classification rules
200
Criteria for choosing tests
200
Missing values, numeric attributes
201
x
C O N T E N TS
P088407-FM.qxd 4/30/05 10:55 AM Page x
Generating good rules
202
Using global optimization
205
Obtaining rules from partial decision trees
207
Rules with exceptions
210
Discussion
213
6.3
Extending linear models
214
The maximum margin hyperplane
215
Nonlinear class boundaries
217
Support vector regression
219
The kernel perceptron
222
Multilayer perceptrons
223
Discussion
235
6.4
Instance-based learning
235
Reducing the number of exemplars
236
Pruning noisy exemplars
236
Weighting attributes
237
Generalizing exemplars
238
Distance functions for generalized exemplars
239
Generalized distance functions
241
Discussion
242
6.5
Numeric prediction
243
Model trees
244
Building the tree
245
Pruning the tree
245
Nominal attributes
246
Missing values
246
Pseudocode for model tree induction
247
Rules from model trees
250
Locally weighted linear regression
251
Discussion
253
6.6
Clustering
254
Choosing the number of clusters
254
Incremental clustering
255
Category utility
260
Probability-based clustering
262
The EM algorithm
265
Extending the mixture model
266
Bayesian clustering
268
Discussion
270
6.7
Bayesian networks
271
Making predictions
272
Learning Bayesian networks
276
C O N T E N TS
x i
P088407-FM.qxd 4/30/05 10:55 AM Page xi
Specific algorithms
278
Data structures for fast learning
280
Discussion
283
7
Transformations: Engineering the input and output
285
7.1
Attribute selection
288
Scheme-independent selection
290
Searching the attribute space
292
Scheme-specific selection
294
7.2
Discretizing numeric attributes
296
Unsupervised discretization
297
Entropy-based discretization
298
Other discretization methods
302
Entropy-based versus error-based discretization
302
Converting discrete to numeric attributes
304
7.3
Some useful transformations
305
Principal components analysis
306
Random projections
309
Text to attribute vectors
309
Time series
311
7.4
Automatic data cleansing
312
Improving decision trees
312
Robust regression
313
Detecting anomalies
314
7.5
Combining multiple models
315
Bagging
316
Bagging with costs
319
Randomization
320
Boosting
321
Additive regression
325
Additive logistic regression
327
Option trees
328
Logistic model trees
331
Stacking
332
Error-correcting output codes
334
7.6
Using unlabeled data
337
Clustering for classification
337
Co-training
339
EM and co-training
340
7.7
Further reading
341
x i i
C O N T E N TS
P088407-FM.qxd 4/30/05 10:55 AM Page xii