Data Mining in Market Research What is data mining?

Yüklə 468 b.

tarix	08.10.2017
ölçüsü	468 b.
	#3829

Data Mining in Market Research

What is data mining?

Methods for finding interesting structure in large databases

E.g. patterns, prediction rules, unusual cases

Focus on efficient, scalable algorithms

Contrasts with emphasis on correct inference in statistics

Related to data warehousing, machine learning

Why is data mining important?

Well marketed; now a large industry; pays well
Handles large databases directly
Can make data analysis more accessible to end users

Semi-automation of analysis
Results can be easier to interpret than e.g. regression models
Strong focus on decisions and their implementation

CRISP-DM Process Model

Data Mining Software

Many providers of data mining software

SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS SQL Server, Polyanalyst, KnowledgeSTUDIO, …
See http://www.kdnuggets.com/software/suites.html for a list
Good algorithms important, but also need good facilities for handling data and meta-data

We’ll use:

WEKA (Waikato Environment for Knowledge Analysis)

Free (GPLed) Java package with GUI
Online at www.cs.waikato.ac.nz/ml/weka
Witten and Frank, 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.

R packages

E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree, mlbench, randomForest, subselect

Data Mining Terms

Different names for familiar statistical concepts, from database and AI communities

Observation = case, record, instance
Variable = field, attribute
Analysis of dependence vs interdependence = Supervised vs unsupervised learning
Relationship = association, concept
Dependent variable = response, output
Independent variable = predictor, input

Common Data Mining Techniques

Predictive modeling

Classification

Derive classification rules
Decision trees

Numeric prediction

Regression trees, model trees

Association rules
Meta-learning methods

Cross-validation, bagging, boosting

Other data mining methods include:

artificial neural networks, genetic algorithms, density estimation, clustering, abstraction, discretisation, visualisation, detecting changes in data or models

Classification

Methods for predicting a discrete response

One kind of supervised learning
Note: in biological and other sciences, classification has long had a different meaning, referring to cluster analysis

Applications include:

Identifying good prospects for specific marketing or sales efforts

Cross-selling, up-selling – when to offer products
Customers likely to be especially profitable
Customers likely to defect

Identifying poor credit risks
Diagnosing customer problems

Weather/Game-Playing Data

Small dataset

14 instances
5 attributes

Outlook - nominal
Temperature - numeric
Humidity - numeric
Wind - nominal
Play

Whether or not a certain game would be played
This is what we want to understand and predict

ARFF file for the weather data.

German Credit Risk Dataset

1000 instances (people), 21 attributes

“class” attribute describes people as good or bad credit risks
Other attributes include financial information and demographics

E.g. checking_status, duration, credit_history, purpose, credit_amount, savings_status, employment, Age, housing, job, num_dependents, own_telephone, foreign_worker

Want to predict credit risk
Data available at UCI machine learning data repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

and on 747 web page

http://www.stat.auckland.ac.nz/~reilly/credit-g.arff

Classification Algorithms

Many methods available in WEKA

0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM, Instance-based learner (IB1, IBk), C4.5 (J48), PART, Support vector machine (SMO)

Usually train on part of the data, test on the rest
Simple method – Zero-rule, or 0R

Predict the most common category

Class ZeroR in WEKA

Too simple for practical use, but a useful baseline for evaluating performance of more complex methods

1-Rule (1R) Algorithm

Based on single predictor

Predict mode within each value of that predictor

Look at error rate for each predictor on training dataset, and choose best predictor
Called OneR in WEKA
Must group numerical predictor values for this method

Common method is to split at each change in the response
Collapse buckets until each contains at least 6 instances

1R Algorithm (continued)

Biased towards predictors with more categories

These can result in over-fitting to the training data

But found to perform surprisingly well

Study on 16 widely used datasets

Holte (1993), Machine Learning 11, 63-91

Often error rate only a few percentages points higher than more sophisticated methods (e.g. decision trees)
Produced rules that were much simpler and more easily understood

Naïve Bayes Method

Calculates probabilities of each response value, assuming independence of attribute effects
Response value with highest probability is predicted
Numeric attributes are assumed to follow a normal distribution within each response value

Contribution to probability calculated from normal density function
Instead can use kernel density estimate, or simply discretise the numerical attributes

Naïve Bayes Calculations

Observed counts and probabilities above

Temperature and humidity have been discretised

Consider new day

Outlook=sunny, temperature=cool, humidity=high, windy=true
Probability(play=yes) α 2/9 x 3/9 x 3/9 x 3/9 x 9/14= 0.0053
Probability(play=no) α 3/5 x 1/5 x 4/5 x 3/5 x 5/14= 0.0206
Probability(play=no) = 0.0206/(0.0053+0.0206) = 79.5%

“no” four times more likely than “yes”

Naïve Bayes Method

If any of the component probabilities are zero, the whole probability is zero

Effectively a veto on that response value
Add one to each cell’s count to get around this problem

Corresponds to weak positive prior information

Naïve Bayes effectively assumes that attributes are equally important

Several highly correlated attributes could drown out an important variable that would add new information

However this method often works well in practice

Decision Trees

Classification rules can be expressed in a tree structure

Move from the top of the tree, down through various nodes, to the leaves
At each node, a decision is made using a simple test based on attribute values
The leaf you reach holds the appropriate predicted value

Decision trees are appealing and easily used

However they can be verbose
Depending on the tests being used, they may obscure rather than reveal the true pattern

More info online at http://recursive-partitioning.com/

Decision tree with a replicated subtree

Problems with Univariate Splits

Constructing Decision Trees

Develop tree recursively

Start with all data in one root node
Need to choose attribute that defines first split

For now, we assume univariate splits are used

For accurate predictions, want leaf nodes to be as pure as possible
Choose the attribute that maximises the average purity of the daughter nodes

The measure of purity used is the entropy of the node
This is the amount of information needed to specify the value of an instance in that node, measured in bits

Tree stumps for the weather data

Weather Example

First node from outlook split is for “sunny”, with entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.971
Average entropy of nodes from outlook split is

5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693

Entropy of root node is 0.940 bits
Gain of 0.247 bits
Other splits yield:

Gain(temperature)=0.029 bits
Gain(humidity)=0.152 bits
Gain(windy)=0.048 bits

So “outlook” is the best attribute to split on

Expanded tree stumps for weather data

Decision tree for the weather data

Decision Tree Algorithms

The algorithm described in the preceding slides is known as ID3

Due to Quinlan (1986)

Tends to choose attributes with many values

Using information gain ratio helps solve this problem

Several more improvements have been made to handle numeric attributes (via univariate splits), missing values and noisy data (via pruning)

Resulting algorithm known as C4.5

Described by Quinlan (1993)

Widely used (as is the commercial version C5.0)
WEKA has a version called J4.8

Classification Trees

Described (along with regression trees) in:

L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984. Classification and Regression Trees.

More sophisticated method than ID3

However Quinlan’s (1993) C4.5 method caught up with CART in most areas

CART also incorporates methods for pruning, missing values and numeric attributes

Multivariate splits are possible, as well as univariate

Split on linear combination Σcjxj > d

CART typically uses Gini measure of node purity to determine best splits

This is of the form Σp(1-p)

But information/entropy measure also available

Regression Trees

Trees can also be used to predict numeric attributes

Predict using average value of the response in the appropriate node

Implemented in CART and C4.5 frameworks

Can use a model at each node instead

Implemented in Weka’s M5’ algorithm
Harder to interpret than regression trees

Classification and regression trees are implemented in R’s rpart package

See Ch 10 in Venables and Ripley, MASS 3rd Ed.

Problems with Trees

Can be unnecessarily verbose
Structure often unstable

“Greedy” hierarchical algorithm

Small variations can change chosen splits at high level nodes, which then changes subtree below
Conclusions about attribute importance can be unreliable

Direct methods tend to overfit training dataset

This problem can be reduced by pruning the tree

Another approach that often works well is to fit the tree, remove all training cases that are not correctly predicted, and refit the tree on the reduced dataset

Typically gives a smaller tree
This usually works almost as well on the training data
But generalises better, e.g. works better on test data

Bagging the tree algorithm also gives more stable results

Will discuss bagging later

Classification Tree Example

Use Weka’s J4.8 algorithm on German credit data (with default options)

1000 instances, 21 attributes

Produces a pruned tree with 140 nodes, 103 leaves

=== Run information ===

=== Run information ===
Scheme: weka.classifiers.j48.J48 -C 0.25 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
Number of Leaves : 103
Size of the tree : 140
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 739 73.9 %
Incorrectly Classified Instances 261 26.1 %
Kappa statistic 0.3153
Mean absolute error 0.3241
Root mean squared error 0.4604
Relative absolute error 77.134 %
Root relative squared error 100.4589 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.883 0.597 0.775 0.883 0.826 good
0.403 0.117 0.596 0.403 0.481 bad
=== Confusion Matrix ===
a b <-- classified as
618 82 | a = good
179 121 | b = bad

Cross-Validation

Due to over-fitting, cannot estimate prediction error directly on the training dataset
Cross-validation is a simple and widely used method for estimating prediction error
Simple approach

Set aside a test dataset
Train learner on the remainder (the training dataset)
Estimate prediction error by using the resulting prediction model on the test dataset

This is only feasible where there is enough data to set aside a test dataset and still have enough to reliably train the learning algorithm

k-fold Cross-Validation

For smaller datasets, use k-fold cross-validation

Split dataset into k roughly equal parts
For each part, train on the other k-1 parts and use this part as the test dataset
Do this for each of the k parts, and average the resulting prediction errors

This method measures the prediction error when training the learner on a fraction (k-1)/k of the data
If k is small, this will overestimate the prediction error

k=10 is usually enough

Regression Tree Example

data(car.test.frame)
z.auto <- rpart(Mileage ~ Weight, car.test.frame)
post(z.auto,FILE=“”)
summary(z.auto)

Call:

Call:
rpart(formula = Mileage ~ Weight, data = car.test.frame)
n= 60
CP nsplit rel error xerror xstd
1 0.59534912 0 1.0000000 1.0322233 0.17981796
2 0.13452819 1 0.4046509 0.6081645 0.11371656
3 0.01282843 2 0.2701227 0.4557341 0.09178782
4 0.01000000 3 0.2572943 0.4659556 0.09134201
Node number 1: 60 observations, complexity param=0.5953491
mean=24.58333, MSE=22.57639
left son=2 (45 obs) right son=3 (15 obs)
Primary splits:
Weight < 2567.5 to the right, improve=0.5953491, (0 missing)
Node number 2: 45 observations, complexity param=0.1345282
mean=22.46667, MSE=8.026667
left son=4 (22 obs) right son=5 (23 obs)
Primary splits:
Weight < 3087.5 to the right, improve=0.5045118, (0 missing)
…(continued on next page)…

Node number 3: 15 observations

Node number 3: 15 observations
mean=30.93333, MSE=12.46222
Node number 4: 22 observations
mean=20.40909, MSE=2.78719
Node number 5: 23 observations, complexity param=0.01282843
mean=24.43478, MSE=5.115312
left son=10 (15 obs) right son=11 (8 obs)
Primary splits:
Weight < 2747.5 to the right, improve=0.1476996, (0 missing)
Node number 10: 15 observations
mean=23.8, MSE=4.026667
Node number 11: 8 observations
mean=25.625, MSE=4.984375

Regression Tree Example (continued)

plotcp(z.auto)
z2.auto <- prune(z.auto,cp=0.1)
post(z2.auto, file="", cex=1)

Complexity Parameter Plot

Pruned Regression Tree

Classification Methods

Project the attribute space into decision regions

Decision trees: piecewise constant approximation
Logistic regression: linear log-odds approximation
Discriminant analysis and neural nets: linear & non-linear separators

Density estimation coupled with a decision rule

E.g. Naïve Bayes

Define a metric space and decide based on proximity

One type of instance-based learning
K-nearest neighbour methods

IBk algorithm in Weka

Would like to drop noisy and unnecessary points

Simple algorithm based on success rate confidence intervals available in Weka

Compares naïve prediction with predictions using that instance
Must choose suitable acceptance and rejection confidence levels

Many of these approaches can produce probability distributions as well as predictions

Depending on the application, this information may be useful

Such as when results reported to expert (e.g. loan officer) as input to their decision

Numeric Prediction Methods

Linear regression
Splines, including smoothing splines and multivariate adaptive regression splines (MARS)
Generalised additive models (GAM)
Locally weighted regression (lowess, loess)
Regression and Model Trees

CART, C4.5, M5’

Artificial neural networks (ANNs)

Artificial Neural Networks (ANNs)

An ANN is a network of many simple processors (or units), that are connected by communication channels that carry numeric data
ANNs are very flexible, encompassing nonlinear regression models, discriminant models, and data reduction models

They do require some expertise to set up
An appropriate architecture needs to be selected and tuned for each application

They can be useful tools for learning from examples to find patterns in data and predict outputs

However on their own, they tend to overfit the training data
Meta-learning tools are needed to choose the best fit

Various network architectures in common use

Multilayer perceptron (MLR)
Radial basis functions (RBF)
Self-organising maps (SOM)

ANNs have been applied to data editing and imputation, but not widely

Meta-Learning Methods - Bagging

General methods for improving the performance of most learning algorithms
Bootstrap aggregation, bagging for short

Select B bootstrap samples from the data

Selected with replacement, same # of instances

Can use parametric or non-parametric bootstrap

Fit the model/learner on each bootstrap sample
The bagged estimate is the average prediction from all these B models

E.g. for a tree learner, the bagged estimate is the average prediction from the resulting B trees
Note that this is not a tree

In general, bagging a model or learner does not produce a model or learner of the same form

Bagging reduces the variance of unstable procedures like regression trees, and can greatly improve prediction accuracy

However it does not always work for poor 0-1 predictors

Meta-Learning Methods - Boosting

Boosting is a powerful technique for improving accuracy
The “AdaBoost.M1” method (for classifiers):

Give each instance an initial weight of 1/n
For m=1 to M:

Fit model using the current weights, & store resulting model m
If prediction error rate “err” is zero or >= 0.5, terminate loop.
Otherwise calculate αm=log((1-err)/err)

This is the log odds of success

Then adjust weights for incorrectly classified cases by multiplying them by exp(αm), and repeat

Predict using a weighted majority vote: ΣαmGm(x), where Gm(x) is the prediction from model m

Meta-Learning Methods - Boosting

For example, for the German credit dataset:

using 100 iterations of AdaBoost.M1 with the DecisionStump algorithm,
10-fold cross-validation gives an error rate of 24.9% (compared to 26.1% for J4.8)

Association Rules

Data on n purchase baskets in form (id, item1, item2, …, itemk)

For example, purchases from a supermarket

Association rules are statements of the form:

“When people buy tea, they also often buy coffee.”

May be useful for product placement decisions or cross-selling recommendations
We say there is an association rule i1 ->i2 if

i1 and i2 occur together in at least s% of the n baskets (the support)
And at least c% of the baskets containing item i1 also contain i2 (the confidence)

The confidence criterion ensures that “often” is a large enough proportion of the antecedent cases to be interesting
The support criterion should be large enough that the resulting rules have practical importance

Also helps to ensure reliability of the conclusions

Association rules

The support/confidence approach is widely used

Efficiently implemented in the Apriori algorithm

First identify item sets with sufficient support
Then turn each item set into sets of rules with sufficient confidence

This method was originally developed in the database community, so there has been a focus on efficient methods for large databases

“Large” means up to around 100 million instances, and about ten thousand binary attributes

However this approach can find a vast number of rules, and it can be difficult to make sense of these
One useful extension is to identify only the rules with high enough lift (or odds ratio)

Classification vs Association Rules

Classification rules predict the value of a pre-specified attribute, e.g.

If outlook=sunny and humidity=high then play =no

Association rules predict the value of an arbitrary attribute (or combination of attributes)

E.g. If temperature=cool then humidity=normal
If humidity=normal and play=no then windy=true
If temperature=high and humidity=high then play=no

Clustering – EM Algorithm

Assume that the data is from a mixture of normal distributions

I.e. one normal component for each cluster

For simplicity, consider one attribute x and two components or clusters

Model has five parameters: (p, μ1, σ1, μ2, σ2) = θ

Log-likelihood:
This is hard to maximise directly

Use the expectation-maximisation (EM) algorithm instead

Clustering – EM Algorithm

Think of data as being augmented by a latent 0/1 variable di indicating membership of cluster 1
If the values of this variable were known, the log-likelihood would be:
Starting with initial values for the parameters, calculate the expected value of di
Then substitute this into the above log-likelihood and maximise to obtain new parameter values

This will have increased the log-likelihood

Repeat until the log-likelihood converges

Clustering – EM Algorithm

Resulting estimates may only be a local maximum

Run several times with different starting points to find global maximum (hopefully)

With parameter estimates, can calculate segment membership probabilities for each case

Clustering – EM Algorithm

Extending to more latent classes is easy

Information criteria such as AIC and BIC are often used to decide how many are appropriate

Extending to multiple attributes is easy if we assume they are independent, at least conditioning on segment membership

It is possible to introduce associations, but this can rapidly increase the number of parameters required

Nominal attributes can be accommodated by allowing different discrete distributions in each latent class, and assuming conditional independence between attributes
Can extend this approach to a handle joint clustering and prediction models, as mentioned in the MVA lectures

Clustering - Scalability Issues

k-means algorithm is also widely used
However this and the EM-algorithm are slow on large databases
So is hierarchical clustering - requires O(n2) time
Iterative clustering methods require full DB scan at each iteration
Scalable clustering algorithms are an area of active research
A few recent algorithms:

Distance-based/k-Means

Multi-Resolution kd-Tree for K-Means [PM99]
CLIQUE [AGGR98]
Scalable K-Means [BFR98a]
CLARANS [NH94]

Probabilistic/EM

Multi-Resolution kd-Tree for EM [Moore99]
Scalable EM [BRF98b]
CF Kernel Density Estimation [ZRL99]

Ethics of Data Mining

Data mining and data warehousing raise ethical and legal issues
Combining information via data warehousing could violate Privacy Act

Must tell people how their information will be used when the data is obtained

Data mining raises ethical issues mainly during application of results

E.g. using ethnicity as a factor in loan approval decisions
E.g. screening job applications based on age or sex (where not directly relevant)
E.g. declining insurance coverage based on neighbourhood if this is related to race (“red-lining” is illegal in much of the US)

Whether something is ethical depends on the application

E.g. probably ethical to use ethnicity to diagnose and choose treatments for a medical problem, but not to decline medical insurance

Yüklə 468 b.

Dostları ilə paylaş:

Data Mining in Market Research What is data mining?

Data Mining in Market Research

What is data mining?

Why is data mining important?

CRISP-DM Process Model

Data Mining Software

Many providers of data mining software

We’ll use:

Data Mining Terms

Different names for familiar statistical concepts, from database and AI communities

Common Data Mining Techniques

Predictive modeling

Association rules

Meta-learning methods

Other data mining methods include:

Classification

Methods for predicting a discrete response

Applications include:

Weather/Game-Playing Data

Small dataset

ARFF file for the weather data.

German Credit Risk Dataset

1000 instances (people), 21 attributes

Want to predict credit risk

Data available at UCI machine learning data repository

Classification Algorithms

Many methods available in WEKA

Usually train on part of the data, test on the rest

Simple method – Zero-rule, or 0R

1-Rule (1R) Algorithm

Based on single predictor

Look at error rate for each predictor on training dataset, and choose best predictor

Called OneR in WEKA

Must group numerical predictor values for this method

1R Algorithm (continued)

Biased towards predictors with more categories

But found to perform surprisingly well

Naïve Bayes Method

Calculates probabilities of each response value, assuming independence of attribute effects

Response value with highest probability is predicted

Numeric attributes are assumed to follow a normal distribution within each response value

Naïve Bayes Calculations

Observed counts and probabilities above

Consider new day

Naïve Bayes Method

If any of the component probabilities are zero, the whole probability is zero

Naïve Bayes effectively assumes that attributes are equally important

However this method often works well in practice

Decision Trees

Classification rules can be expressed in a tree structure

Decision trees are appealing and easily used

More info online at http://recursive-partitioning.com/

Decision tree with a replicated subtree

Problems with Univariate Splits

Constructing Decision Trees

Develop tree recursively

Tree stumps for the weather data

Weather Example

First node from outlook split is for “sunny”, with entropy – 2/5 * log2(2/5) – 3/5 * log2(3/5) = 0.971

Average entropy of nodes from outlook split is

Entropy of root node is 0.940 bits

Gain of 0.247 bits

Other splits yield:

So “outlook” is the best attribute to split on

Expanded tree stumps for weather data

Decision tree for the weather data

Decision Tree Algorithms

The algorithm described in the preceding slides is known as ID3

Tends to choose attributes with many values

Several more improvements have been made to handle numeric attributes (via univariate splits), missing values and noisy data (via pruning)

Classification Trees

Described (along with regression trees) in:

More sophisticated method than ID3

CART also incorporates methods for pruning, missing values and numeric attributes

Regression Trees

Trees can also be used to predict numeric attributes

Classification and regression trees are implemented in R’s rpart package

Problems with Trees

Can be unnecessarily verbose

Structure often unstable