Chapter 13:
Evaluation and Deployment
219
CHAPTER THIRTEEN:
EVALUATION AND DEPLOYMENT
HOW FAR WE’VE COME
The purpose of this book, which was explained in Chapter 1, is to introduce non-experts and non-
computer scientists to some of the methods and tools of data mining. Certainly there have been a
number of processes, tools, operators, data manipulation techniques, etc., demonstrated in this
book, but perhaps the most important lesson to take away from this broad treatment of data
mining is that the field has become huge, complex, and dynamic. You have learned about the
CRISP-DM process, and had it shown to you numerous times as you have seen data mining
models that classified, predicted and did both. You have seen a number of data processing tools
and techniques, and as you have done this, you have hopefully noticed thy myriad other operators
in RapidMiner that we did not use or discuss. Although you may be feeling like you’re getting
good at data mining (and we hope you do), please recognize that there is a world of data mining
that
this book has not touched on—so there is still much for you to learn.
This chapter and the next will discuss some cautions that should be taken before putting any real-
world data mining results into practice. This chapter will demonstrate a method for using
RapidMiner to conduct some validation for data mining models; while Chapter 14 will discuss the
choices you will make as a data miner, and some ways to guide those choices in good directions.
Remember from Chapter 1 that CRISP-DM is cyclical—you should always be learning from the
work you are doing, and feeding what you’ve learned from your work back into your next data
mining activity.
For example, suppose you used a Replace Missing Values operator in a data mining model to set all
missing values in a data set to the average for each attribute. Suppose further that you used results
of that data mining model in making decisions for your company, and that those decisions turned
out to be less than ideal. What if you traced those decisions back to your
data mining activities and
found that by using the average, you made some general assumptions that weren’t really very
Data Mining for the Masses
220
realistic. Perhaps you don’t need to throw out the data mining model entirely, but for the next run
of that model you should be sure to change it to either remove observations with missing values,
or use a more appropriate replacement value based upon what you have learned. Even if you used
your data mining results and had excellent outcomes, remember that your business is constantly
moving, and through the day-to-day operations of your organization, you are gathering more data.
Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune
your data mining models in accordance with your experience and the expertise you are developing.
Consider Sarah, our hypothetical sales manager from Chapters 4 and 8. Certainly now that we’ve
helped her predict heating oil usage by home through a linear regression model, Sarah can track
these homes’
actual heating oil orders to see how well their actual use matches our predictions.
Once these customers have established several months or years of actual heating oil consumption,
their data can be fed into Sarah’s model’s training data set, helping it to be even more accurate in
its predictions.
One of the benefits of connecting RapidMiner to a database or data warehouse, rather than
importing data via a file (CSV, etc.) is that data can be added to the data sets in real time and fed
straight into the RapidMiner models. If you were to acquire some new training data, as Sarah
could in the scenario just proposed in the previous paragraph, it could be immediately
incorporated into the RapidMiner model if the data were in a connected database.
With a CSV file,
the new training data would have to be added into the file, and then re-imported into the
RapidMiner repository.
As we tune and hone our models, they perform better for us. In addition to using our growing
expertise and adding more training data, there are some built-in ways that we can check a model’s
performance in RapidMiner.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain what cross-validation is, and discuss its role in the Evaluation and Deployment
phases of CRISP-DM.
Define false positives and explain why their existence is not all bad in data mining.
Perform a cross-validation on a training data set in RapidMiner.
Chapter 13: Evaluation and Deployment
221
Interpret and discuss the results of cross-validation matrix.
CROSS-VALIDATION
Cross-validation is the process of checking for the likelihood of false positives in predictive
models in RapidMiner. Most data mining software products will have operators for cross-
validation and for other forms of false positive detection. A
false positive is when a value is
predicted incorrectly. We will give one example here, using the decision tree we built for our
hypothetical client Richard, back in Chapter 10. Complete the following steps:
1)
Open RapidMiner and start a new, blank data mining process.
2)
Go to the Repositories tab and locate the Chapter 10 training data set. This was the one
that had attributes regarding peoples’ buying habits on Richard’s employer’s web site, along
with their category of eReader adoption. Drag this data set into your main process
window. You can rename it if you would like. In Figure 13-1, we have renamed it eReader
Train.
Figure 13-1. Adding the Chapter 10 training data to
a new model in order to
cross-validate its predictive capabilities.
3)
Add a Set Role operator to the stream. We’ll learn a new trick here with this operator. Set
the User_ID attribute to be ‘id’. We know we still need to set eReader_Adoption to be