Strategies of Data Mining

Data Mining Model Feedback

Yüklə 160,83 Kb.

səhifə	7/7
tarix	08.10.2017
ölçüsü	160,83 Kb.
	#3807

1 2 3 4 5 6 7

Predicting with Data Mining Models
Using Data Mining Functions

Data Mining Model Feedback

The true purpose of data mining is to provide information for decision support and, ultimately, for making business decisions based on the provided information. Although data mining is an excellent way to discover information in data, information without action invalidates the purpose of data mining. When designing a data mining model, remember that the goal of the model is to provide insight or predictions for a business scenario.

The use of data mining models to provide information generally falls into two different areas. The most common form of data mining, closed loop data mining, is used to provide long-term business decision support.

There are other business uses for data mining feedback, especially in financial organizations. The process of operational data mining, in which unknown data is viewed through a predictive model to determine the likelihood of a single discrete outcome, is commonly used for loan and credit card applications. In this case, feedback can be reduced to a simple "yes or no" answer. Operational data mining is unique in this respect—it occurs in a real-time situation, often on data that may or may not be first committed to a database.

These actions, however, fall outside the typical scope of the data mining analyst. The goal of the data mining analyst is to make data mining model feedback easily understandable to the business user.

Visualization plays an important role in both the evaluation and feedback of a data mining model—if you cannot relate the information gained from a data mining model to the people who need it, the information might as well not exist. Analysis Services supplies two visualization tools, Data Mining Model Browser and Dependency Network Browser, for data mining model visualization purposes. However, these tools may be incomprehensible to a typical business user, and are more suited for the data mining analyst. There are numerous visualization tools available from third-party vendors, and can provide views on data mining model feedback that are meaningful to the business user. For more information about understanding the information presented in the Data Mining Model Browser and Dependency Network Browser, see "Visualizing Data Mining Models" in this chapter.

Custom client applications developed for data mining visualization have an advantage over external visualization tools in that the method of visualization can be tailored specifically for the intended business audience. For more information about developing custom client applications, see Chapter 25, "Getting Data to the Client."

Predicting with Data Mining Models

The true purpose of a data mining model is to use it as a tool through which data with unknown outcomes can be viewed for the purposes of decision support. Once a data mining model has been constructed and evaluated, a special type of query, known as a prediction query, can be run against it to provide statistical information for unknown data.

However, the process of construction prediction queries is the least understood step of the data mining process in Analysis Services. The Data Mining Sample Application, shipped with SQL Server 2000 Resource Kit, is an invaluable tool for constructing and examining prediction queries. You can also use it as an educational tool, as the sample provides access to all of the syntax used for data mining.

Basically, the syntax for a prediction query is similar to that of a standard SQL SELECT query in that the data mining model is queried, from a syntactical point of view, as if it were a typical database view. There are, however, two main differences in the syntax used for a prediction query.

The first difference is the PREDICTION JOIN keyword. A data mining model can only predict on data if data is first supplied to it, and this keyword provides the mechanism used to join unknown data with a data mining model. The SELECT statement performs analysis on the data supplied by the prediction join and returns the results in the form of a recordset. Prediction joins can be used in a variety of ways to support both operational and closed loop data mining.

For example, the following prediction query uses the PREDICTION JOIN keyword to join a rowset, created by the OPENROWSET function from the Customer table in the FoodMart 2000 database, to predict the customers most likely to select a Golden member card.

SELECT

[MemberData].[customer_id] AS [Customer ID],

[MemberData].[education] AS [Education],

[MemberData].[gender] AS [Gender],

[MemberData].[marital_status] AS [Marital Status],

[MemberData].[num_children_at_home] AS [Children At Home],

[MemberData].[total_children] AS [Total Children],

[MemberData].[yearly_income] AS [Yearly Income]

FROM

[Member Card RDBMS]

PREDICTION JOIN

OPENROWSET

('Microsoft.Jet.OLEDB.4.0',

'Provider=Microsoft.Jet.OLEDB.4.0;

Data Source=C:\Program Files\Microsoft Analysis Services\samples\FoodMart

2000.mdb;

Persist Security Info=False',

'SELECT

[customer_id],

[education],

[gender],

[marital_status],

[num_children_at_home],

[total_children],

[yearly_income]

FROM

[customer]')

[MemberData]

[Member Card RDBMS].[gender] = [MemberData].[gender] AND

[Member Card RDBMS].[marital status] = [MemberData].[marital_status] AND

[Member Card RDBMS].[num children at home] = [MemberData].[num_children_at_home] AND

[Member Card RDBMS].[total children] = [MemberData].[total_children] AND

[Member Card RDBMS].[yearly income] = [MemberData].[yearly_income] AND

[Member Card RDBMS].[education] = [MemberData].[education]

WHERE

[Member Card RDBMS].[member card] = 'Golden' AND

PREDICTPROBABILITY([Member Card RDBMS].[member card])> 0.8

The ON keyword links columns from the rowset specified in the PREDICTION JOIN clause to the input attributes defined in the data mining model, in effect instructing the data mining model to use the joined columns as input attributes for the prediction process, while the WHERE clause is used to restrict the returned cases. In this prediction query, only those cases that are most likely to select the Golden member card are returned. The PredictProbability data mining function is used to establish a probability of correct prediction, also known as the confidence of the prediction, and further restrict the returned cases only to those whose confidence level is equal to or higher than 80 percent.

The following table represents the results returned from the previous prediction query. The cases represented by the table are the cases most likely to choose the Golden member card, with a confidence level of 80 percent or greater.

Customer ID	Education	Gender	Marital Status	Children At Home	Total Children	Yearly Income
105	Bachelor's Degree	M	M	3	3	$150K+
136	Bachelor's Degree	M	M	3	3	$150K+
317	High School Degree	M	M	0	0	$150K+
340	Bachelor's Degree	F	M	0	2	$150K+
343	Bachelor's Degree	F	M	1	2	$150K+
...	...	...	...	...	...	...

This prediction query is a typical example of closed loop data mining. The cases returned by the prediction query can be targeted, for example, for direct promotion of the Golden member card. Or, the actual results of the selected cases can be compared against the predicted results to determine if the data mining model is indeed achieving an 80 percent or better confidence level of prediction. This provides

information that can be used to evaluate the effectiveness of this particular data mining model, by constructing a confusion matrix or by computing the fit of the data mining model against this particular case set. The business decisions to be taken by the review of this data affect not just a single case, but a subset of a larger case set, and the effects of such business decisions may take weeks or months to manifest in terms of additional incoming data.

Data mining models can take data from a variety of sources, provided that the data structure of incoming cases is similar to the data structure of expected cases for the data mining model.

For example, the following prediction query uses the PREDICTION JOIN keyword to link a singleton query (a query that retrieves only one row), with both column and value information explicitly defined within the query, to the Member Card RDBMS data mining model in the FoodMart 2000 database, to predict the type of member card most likely to be selected by a specific customer, as well as the confidence of the prediction.

SELECT

[Member Card RDBMS].[member card] AS [Member Card],

(100 * PREDICTPROBABILITY([Member Card RDBMS].[member card])) AS [Confidence Percent]

FROM

[Member Card RDBMS]

PREDICTION JOIN

(SELECT 'F' as Gender, 'M' as [Marital Status], 3 as [num children at home],

'$130K - $150K' as [yearly income], 'Bachelors Degree' as education ) AS singleton

[Member Card RDBMS].[gender]=[singleton].[gender] AND

[Member Card RDBMS].[marital status] = [singleton].[marital status] AND

[Member Card RDBMS].[num children at home] = [singleton].[num children at home] AND

[Member Card RDBMS].[yearly income] = [singleton].[yearly income] AND

[Member Card RDBMS].[education] = [singleton].[education]

The following table illustrates the returned resultset from the previous prediction query. From the analysis provided by the data mining model on the case defined in the singleton query, the customer is most likely to choose a Golden member card, and the likelihood of that choice is about 63 percent.

Member Card	Confidence Percent
Golden	62.666666666666671

This prediction query is an excellent example of applied prediction in an operational data mining scenario. The case information supplied by the singleton query used in the PREDICTION JOIN clause of the prediction query is not supplied directly from a database; all columns and values are constructed within the singleton query. This information could just have easily been supplied from the user interface of a client application as from a single database record, and the immediate response of the data mining model allows the client application to respond to this information in real time, immediately affecting incoming data.

Using Data Mining Functions

In both of the prediction query examples presented earlier, the PredictProbability data mining function is used to provide confidence information on the predictions made by the queries. Other data mining functions are also available, which can be used to provide additional statistical information, such as variance or standard deviation, for cases analyzed through the data mining model.

For example, the previous query can instead use the PredictHistogram function to supply several common statistical measurements about the single case being examined, as demonstrated in the following query.

SELECT

[Member Card RDBMS].[member card] AS [Predicted Member Card],

PredictHistogram([Member Card RDBMS].[member card])

FROM

[Member Card RDBMS]

PREDICTION JOIN

(SELECT 'F' as Gender, 'M' as [Marital Status], 3 as [num children at home],

'$130K - $150K' as [yearly income], 'Bachelors Degree' as education ) AS singleton

[Member Card RDBMS].[gender]=[singleton].[gender] AND

[Member Card RDBMS].[marital status] = [singleton].[marital status] AND

[Member Card RDBMS].[num children at home] = [singleton].[num children at home] AND

[Member Card RDBMS].[yearly income] = [singleton].[yearly income] AND

[Member Card RDBMS].[education] = [singleton].[education]

This prediction query returns a recordset that contains the predicted member card, all of the possible member card choices, and the statistical information behind each choice, or histogram, as shown in the following table. The $ADJUSTEDPROBABILITY, $VARIANCE and $STDEV columns, representing the adjusted probability, variance and standard deviation values of the various member card choices, have not been shown in the table due to space limitations.

Predicted Member Card	member card	$SUPPORT	$PROBABILITY	...
Golden	Golden	46	0.62666666666666671
	Silver	14	0.20000000000000001
	Bronze	7	0.10666666666666667
	Normal	3	5.3333333333333337E-2
		0	1.3333333333333334E-2

Histogram information can be useful in both operational and data mining. For example, the previous prediction query indicates that this customer is more than three times as likely to choose the Golden member card instead of the Silver member card, but is twice as likely to select the Silver member card over the Bronze member card and about four times as likely to select the Silver member card over the Normal member card. The customer service representative, using a client application employing operational data mining, would then be able to rank the various member cards and offer each in turn to the customer based on this histogram information.

Yüklə 160,83 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7