Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	58/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 54 55 56 57 58 59 60 61 ... 65

CHAPTER THIRTEEN: EVALUATION AND DEPLOYMENT HOW FAR WE’VE COME
LEARNING OBJECTIVES
CROSS-VALIDATION Cross-validation

Data Mining for the Masses
218

Chapter 13: Evaluation and Deployment
219

CHAPTER THIRTEEN:
EVALUATION AND DEPLOYMENT

HOW FAR WE’VE COME

The purpose of this book, which was explained in Chapter 1, is to introduce non-experts and non-
computer scientists to some of the methods and tools of data mining.  Certainly there have been a
number  of  processes,  tools,  operators,  data  manipulation  techniques,  etc.,  demonstrated  in  this
book,  but  perhaps  the  most  important  lesson  to  take  away  from  this  broad  treatment  of  data
mining  is  that  the  field  has  become  huge,  complex,  and  dynamic.    You  have  learned  about  the
CRISP-DM  process,  and  had  it  shown  to  you  numerous  times  as  you  have  seen  data  mining
models that classified, predicted and did both.  You have seen a number of data processing tools
and techniques, and as you have done this, you have hopefully noticed thy myriad other operators
in  RapidMiner  that  we  did  not  use  or  discuss.    Although  you  may  be  feeling  like  you’re  getting
good at data mining (and we hope you do), please recognize that there is a world of data mining
that this book has not touched on—so there is still much for you to learn.

This chapter and the next will discuss some cautions that should be taken before putting any real-
world  data  mining  results  into  practice.    This  chapter  will  demonstrate  a  method  for  using
RapidMiner to conduct some validation for data mining models; while Chapter 14 will discuss the
choices you will make as a data miner, and some ways to guide those choices in good directions.
Remember from Chapter 1 that CRISP-DM is cyclical—you should always be learning from the
work  you  are  doing,  and  feeding  what  you’ve  learned  from  your  work  back  into  your  next  data
mining activity.

For example, suppose you used a Replace Missing Values operator in a data mining model to set all
missing values in a data set to the average for each attribute.  Suppose further that you used results
of that data mining model in making decisions for your company, and that those decisions turned
out to be less than ideal. What if you traced those decisions back to your data mining activities and
found  that  by  using  the  average,  you  made  some  general  assumptions  that  weren’t  really  very

Data Mining for the Masses
220
realistic.  Perhaps you don’t need to throw out the data mining model entirely, but for the next run
of that model you should be sure to change it to either remove observations with missing values,
or use a more appropriate replacement value based upon what you have learned.  Even if you used
your data mining results and had excellent outcomes, remember that your business is constantly
moving, and through the day-to-day operations of your organization, you are gathering more data.
Be sure to add this data to training data sets, compare actual outcomes to predictions, and tune
your data mining models in accordance with your experience and the expertise you are developing.
Consider Sarah, our hypothetical sales manager from Chapters 4 and 8.  Certainly now that we’ve
helped her predict heating oil usage by home through a linear regression model, Sarah can track
these  homes’  actual  heating  oil  orders  to  see  how  well  their  actual  use  matches  our  predictions.
Once these customers have established several months or years of actual heating oil consumption,
their data can be fed into Sarah’s model’s training data set, helping it to be even more accurate in
its predictions.

One  of  the  benefits  of  connecting  RapidMiner  to  a  database  or  data  warehouse,  rather  than
importing data via a file (CSV, etc.) is that data can be added to the data sets in real time and fed
straight  into  the  RapidMiner  models.    If  you  were  to  acquire  some  new  training  data,  as  Sarah
could  in  the  scenario  just  proposed  in  the  previous  paragraph,  it  could  be  immediately
incorporated into the RapidMiner model if the data were in a connected database. With a CSV file,
the  new  training  data  would  have  to  be  added  into  the  file,  and  then  re-imported  into  the
RapidMiner repository.

As we tune and hone our models, they perform better for us.  In addition to using our growing
expertise and adding more training data, there are some built-in ways that we can check a model’s
performance in RapidMiner.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what cross-validation is, and discuss its role in the Evaluation and Deployment
phases of CRISP-DM.


Define false positives and explain why their existence is not all bad in data mining.


Perform a cross-validation on a training data set in RapidMiner.

Chapter 13: Evaluation and Deployment
221


Interpret and discuss the results of cross-validation matrix.

CROSS-VALIDATION

Cross-validation  is  the  process  of  checking  for  the  likelihood  of  false  positives  in  predictive
models  in  RapidMiner.    Most  data  mining  software  products  will  have  operators  for  cross-
validation  and  for  other  forms  of  false  positive  detection.    A  false  positive  is  when  a  value  is
predicted  incorrectly.    We  will  give  one  example  here,  using  the  decision  tree  we  built  for  our
hypothetical client Richard, back in Chapter 10. Complete the following steps:

1)

Open RapidMiner and start a new, blank data mining process.
2)

Go to the Repositories tab and locate the Chapter 10 training data set.  This was the one
that had attributes regarding peoples’ buying habits on Richard’s employer’s web site, along
with  their  category  of  eReader  adoption.    Drag  this  data  set  into  your  main  process
window.  You can rename it if you would like.  In Figure 13-1, we have renamed it eReader
Train.

Figure 13-1. Adding the Chapter 10 training data to a new model in order to
cross-validate its predictive capabilities.

3)

Add a Set Role operator to the stream.  We’ll learn a new trick here with this operator.  Set
the User_ID attribute to be ‘id’.  We know we still need to set eReader_Adoption to be

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 54 55 56 57 58 59 60 61 ... 65