Strategies of Data Mining

Data Mining Model Evaluation

Yüklə 160,83 Kb.

səhifə	6/7
tarix	08.10.2017
ölçüsü	160,83 Kb.
	#3807

1 2 3 4 5 6 7

Data Mining Model Evaluation

After the data mining model has been processed against the training data set, you should have a useful view of historical data. But how accurate is it?

The easiest way to evaluate a newly created data mining model is to perform a predictive analysis against an evaluation case set. This case set is constructed in a manner similar to that of the construction of a training case set—a set of data with a known outcome. The data used for the evaluation case set should be different from that used in the training case set; otherwise you will find it difficult to confirm the predictive accuracy of the data mining model; evaluation case sets are often referred to as holdout case sets, and are typically created when a training case set is created in order to use the same random sampling process.

Remove or isolate the outcome attributes from the evaluation case set, then analyze the case set by performing prediction queries against the data mining model, using the evaluation case set. After the analysis is completed, you should have a set of predicted outcomes for the evaluation case set that can be compared directly against the known outcomes for the same set to produce an estimate of prediction accuracy for the known outcomes. This comparison, misleadingly referred to as a confusion matrix, is a very simple way of communicating the benefits of a data mining model to business users. Conversely, the confusion matrix can also reveal problems with a data mining model if the comparison is unfavorable. Because a confusion matrix works with both actual and predicted outcomes on a case by case basis, using a confusion matrix will give you the ability to exactly pinpoint inaccuracies within a data mining model.

This step can be divided into two different steps, depending on the needs of the data mining model. Before evaluating the data mining model, additional training data can be applied to the model to improve its accuracy. This process, called refinement, uses another training case set, called a test case set, to reinforce similar patterns and dilute the interference of irrelevant patterns. Refinement is particularly effective when using neural network or other genetic algorithms to improve the efficacy of a data mining model. The evaluation case set can then be used to determine the amount of improvement provided by the refinement.

For more information on how to issue prediction queries against a data mining model in Analysis Services, see "Data Mining Model Feedback" later in this chapter.

Calculating Effectiveness

There are several different ways of calculating the effectiveness of a data mining model, based on analysis of the resulting prediction data as compared with actual data. Several of the most common forms of measurement are described in the following section.

•	Accuracy A brute-force measurement, accuracy is the percentage of total predictions that were correct. "Correct," in this case, means either that, for discrete prediction attributes, the correct value was returned, or, for continuous prediction attributes, a value was returned within a pre-defined threshold established as a criterion for accuracy. For example, predicting the total amount of store sales within a $5,000 threshold could be considered an accurate prediction.
•	Error Rate Another brute-force measurement, this measures the total predictions that were wrong. Typically calculated at 100—(accuracy in percent), error rates are often used when accuracy rates are too high to be viewed meaningfully. For instance, if the total amount of store sales was correctly calculated 98 percent of the time for the previous year, but calculated correctly 99 percent of the time for the current year, this measurement of accuracy does not have as much impact as being able to say that the error rate was reduced by 50 percent, although both measurements are true.
•	Mean-Squared Error A special form of error rate for prediction involving continuous, ordered attributes, the mean-squared error is the measurement of variation between the predicted value and the actual value. Subtracting the two values and squaring the result provides the rate of squared error. Then, this value is averaged over all predictions for the same attribute to provide an estimate of variation for a given prediction. The reason this number is squared is to ensure that all errors are positive and can be added together when the average is taken, as well as to more severely weight widely varying prediction values. For example, if the prediction for unit sales (in thousands) for one store is 50 and the actual unit sales (in thousands) for the store was 65, the mean squared error would be 65 - 50, or 15, raised to the power of 2, or 225. Mean-squared error can be used in an iterative manner to consistently establish the accuracy threshold of continuous ordered attributes.
•	Lift Simply put, lift is a measurement of how much better (or worse) the data mining model predicted results for a given case set over what would be achieved through random selection. Lift is typically calculated by dividing the percentage of expected response predicted by the data mining model by the percentage of expected response predicted by a random selection. For example, if the normal density of response to a direct mail campaign for a given case set was 10 percent, but by focusing in on the top quartile of the case set predicted to respond to the campaign by the data mining model the density of response increases to 30 percent, lift would be calculated at 3, or 30/10.
•	Profit While the best measurement of any business scenario, profit or returns on investment (ROI) is also the most subjective to calculate, because the variables used to calculated this measurement are different for each business scenario. Many business scenarios involving marketing or sales often have a calculation of ROI included; used in combination with lift, a comparison of ROI between the predicted values of the data mining model and the predicted values of random sampling will simplify any guess as to which subset of cases should be used for lift calculation.

Evaluating an Oversampled Model

The primary drawback of oversampling as a technique for selecting training cases is that the resulting data mining model does not directly correspond to the original data set. It instead provides an exaggerated view of the data, so the exaggerated prediction results must be scaled back to match the actual probability of the original data set. For example, the original data set for credit card transactions, in which 0.001 percent of transactions represent "no card" fraudulent transactions, contains 50 million cases. Statistically speaking, this means only 500 transactions within the original data set are fraudulent. So, a training case set is constructed with 100,000 transactions, in which all 500 fraudulent transactions are placed. The density of the fraudulent data has gone up from 0.001 percent to 0.5 percent – still too low, though, for our purposes. So, the training case set is pared down to just 5,000 transactions, raising the density of fraudulent transactions to 10 percent. The training case set now has a different ratio of representation for the non-fraudulent and fraudulent cases. The fraudulent cases still have a one to one relationship with the original data set, but now each case in the training data set represents 10,000 cases in the original data set. This ratio of cases must be reflected in the sampling of cases from a case set for lift calculation.

For example, the above credit card fraud training case set assumes a binary outcome—either fraudulent or non-fraudulent. We have increased the density of fraudulent cases from 0.001 percent to 10 percent, so this ratio should be taken into account when computing lift. If a selected segment consisting of the top 1 percent of cases within the case set represents a predicted density of 90 percent of fraudulent cases, with a data density of 10 percent for fraudulent cases in the training case set, then the lift for the top 1 percent of total cases, based on the oversampled training case set, is calculated as 9. Since the original data set had an actual data density of 0.001 percent for fraudulent cases, however, the ratio of oversampling, defined earlier as 1 to 10,000 cases, is multiplied by the percent of non-fraudulent cases in the top 1 percent of cases, or 10, added to the percent of fraudulent cases, and is then divided into the predicted density to establish a calculated predicted density of about 0.892 percent for this selected 1 percent of cases. This calculation is illustrated below, with the answer rounded to 10 decimal places.

90 /(90 + (10 * (0.001 / 10)) = 0.0089197225

Once this calculation is performed, you can then calculate the corresponding lift of the original data set by dividing the calculated density by the density of the original set. Since the density of fraudulent cases for the original data set is 0.001 percent, the lift for this selected 1 percent of cases jumps from 9 to about 892.

The calculated lift value for this selected segment of cases seems abnormally high. However, the selected percentage of cases also changes based on the same ratio of densities. Since the 90 percent predicted response rate occurs for the top 1 percent, then the size of this segment decreases because of the ratio of cases between the training case set and the original data set.

A similar calculation is performed to obtain the new size of the selected segment. The density of the fraudulent cases for the segment, 90 percent, is added to the density of the non-fraudulent cases, or 10 percent, multiplied by the ratio of cases between the training case set and the original case set, or 10000. The product is then divided by the same ratio, 10000, and is then multiplied by the actual size of the segment to get the new relative segment size. This calculation is illustrated below.

.01 * ((90 + (10 * 10000))) / 10000) = 0.10009

So, the lift figure of 892 only applies to the top 0.10009 percent, or 50,045 cases, of the original case set of 50 million cases, representing a very narrow band of cases at the high end of the lift curve.

As you can see, oversampling is very useful for obtaining information about rare occurrences within large data sets, but providing accurate figures can be quite difficult. Oversampling should only be used in specific situations to model extremely rare cases, but is an essential tool for modeling such situations.

Visualizing Data Mining Models

The visualization tools supplied with Analysis Services are ideal for the evaluation of data mining models. The Data Mining Model Browser and Dependency Network Browser both display the statistical information contained within a data mining model in an understandable graphic format.

The Data Mining Model Browser is used to inspect the structure of a generated data mining model from the viewpoint of a single predictable attribute, to provide insight into the effects input variables have in predicting output variables. Because the most significant input variables appear early within decision tree data mining models, for example, generating a decision tree model and then viewing the structure can provide insight into the most significant input variables to be used in other data mining models.

For example, using the Data Mining Model Browser to view the Member Card RDBMS data mining model presents the following decision tree.

The decision tree is shown from left to right, or from most significant split to least significant split. Just from looking at this decision tree, you should be able to determine that, when predicting the member card attribute, the most significant attribute is yearly income. However, the next most significant attribute varies slightly, depending on the value of the yearly income attribute. For those customers who make more than $150,000 for yearly income, the next most significant attribute is marital status. For all others, the next most significant attribute is num children at home.

The Dependency Network Browser, by contrast, constructs a network-like depiction of the relationships within a data mining model from the viewpoints of all predictable attributes, providing a better understanding of the relationships between attribute values within the domain of cases depicted by the data mining model. The Dependency Network Browser not only shows the relationships between attributes, but ranks the relationships according to the level of significance to a given attribute. The browser can be adjusted to display relationships of a specified significance level across the domain of the data mining model, allowing an informal exploration of the domain itself.

For example, using the Dependency Network Browser to view the Member Card RDBMS data mining model presents the following network of nodes.

All other attributes tend to predict the member card attribute, indicated by the direction of the arrows between nodes. The slider in the Dependency Network Browser can be used to determine which attributes most influence the member card attribute. Once examined in this fashion, you can determine that the member card attribute is most strongly influenced by the yearly income attribute, then by the num children at home attribute, then finally by the marital status attribute. Note, too, that this coincides with the previously presented view provided by the Data Mining Model Browser, in which the decision tree used to predict the member card attribute illustrates this same significance of attributes

The network represented in the previous example is based on only a single predictable attribute. The Dependency Network Browser is best used with very complex data mining models involving multiple predictable attributes to better understand the domain represented by the model. You can use the Dependency Network Browser to focus on a single predictable attribute, study its relationship to other attributes within the domain, then explore the decision tree used to predict the selected attribute and related attributes using the Data Mining Model browser.

Used in concert, both tools can provide valuable insight into the rules and patterns stored in a data mining model, allowing you to tune the data mining model to the specific needs of the data set to be modeled.

Yüklə 160,83 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7