Automation of Data Mining Using Integration Services

Run, Debug, and Audit Packages

Yüklə 145,11 Kb.

səhifə	9/9
tarix	08.10.2017
ölçüsü	145,11 Kb.
	#3798

1 2 3 4 5 6 7 8 9

Run, Debug, and Audit Packages

That’s it – the packages contain all the tools you need to dynamically create and update multiple related data mining models.

The packages for this scenario have been designed so that you can run them individually. We recommend that you run each package on its own at least once, to get a feel for what each package produces. Later you can add logging to the packages to track errors, or create a parent task to connect them by adding an Execute Package task.

Phase 5 - Analyze and Report

Now that you have a set of predictions for multiple models, you are probably anxious to see the trends, and to analyze differences.

Using the Data Mining Viewer

The quickest way to view individual models is by using the Data Mining viewers. The Microsoft Time Series Viewer (http://technet.microsoft.com/en-us/library/ms175331.aspx) is particularly handy because it combines the historical data with the predictions for each series, and it displays error bars for the predictions.

However, some users are not comfortable with using Business Intelligence Development Studio. Even if you use the Data Mining Add-ins for Microsoft Office, which provides a Microsoft Visio viewer and a time series browser, the amount of detail in the time series viewer can be overwhelming.

In contrast, analysts typically want even more detail, including statistics embedded in the model content, together with the metadata you captured about the source and the model parameters. It’s impossible to please everyone!

Fortunately Reporting Services lets you pick the data you want, add extra data sets and linked reports, filter, and group, so you can create reports that meet the needs of each set of users.

Using Reporting Services for Data Mining Results

Our requirements for the basic report were as follows:

All related models should be in one chart, for quick comparison.
To simplify comparisons, we can present the predictions for Amount and Quantity separately.
The results should be separable by model and region.
We need to compare predictions for the same periods, for multiple models.
Rather than working with multiple sources of data, we would like all data in a relational store.

Additional requirements might include:

A chart showing historical values along with predictions.
Statistics derived from comparison of prediction values.
Metadata about each model in a linked report.

As the analyst, you might want even more detail:

First and last dates used for training each model
List of the algorithm parameters and pattern formulas
Descriptive statistics that summarize the variability and range of source data in each of the series, or across series

However, for the purposes of this walkthrough, there is already plenty of detail for comparing models. You can always add this data later and then present it in linked reports

The following graphic shows the Reporting Services report that compares the prediction results for each model:

Notice that you can configure a report to show all kinds of information in ToolTips—in this example, as you pause the mouse over a prediction, you see the standard deviation and variance for the predictions.

The next shows a series of charts that have been copied into a matrix. By using a matrix, you can create a set of filtered charts. This series of graphs shows predictions for Amount for all models.

	M200	R750	T1000
Europe
North America
Pacific

Interpreting the Results

If you are familiar with time series modeling, a couple of trends begin to jump out at you just from scanning these charts:

There are some extreme series in ARIMA models – possibly the periodicity hint is to blame.
Predictions in ARTXP models cut off at a certain point in many series. This is expected, because ARTXP detects the instability of a model and does not make predictions if they are not reliable.
You would expect the MIXED models to generally perform better, because they combine the best characteristics of ARTXP and ARIMA. Indeed, they seem more reliable, though you would want to verify that.

The following trend lines are interesting, and they illustrate some problems you might see with models. The results might indicate that the data is bad, there is inadequate data, or the data is too variable to fit.

R750 Europe (Amount)	R750 Europe (Quantity)

R250 North America (Amount)	R250 North America (Quantity)

When you see wildly varying trends from models on the same data, you should of course re-examine the model parameters, but you might also use cross-prediction or aggregate your data differently, to avoid being influenced too strongly by a single data series:

With cross-prediction, you can build a reliable model from aggregated data or a series with solid data, and then make predictions based on that model for all series. ARTXP models and mixed models support cross-prediction.
If you do not have enough data to meaningfully analyze each region or product line separately, you might get better results by aggregating by product or region or both, and create predictions from the aggregate model.

Discussion

Data mining can be a labor-intensive process. From data acquisition and preparation to modeling, testing, and exploration of the results, much effort is needed to ensure that the data supports the intended analysis and that the output of the model is meaningful.

Some parts of the model-building process will always require human intervention – understanding the results, for example, requires careful review by an expert who can assess whether the numbers make sense.

However, by automating some part of the data mining process, Integration Services can not only speed the process, but also potentially improve the results. For example, if you don’t know which mixture of algorithms produces the best results, or what the possible time cycles are in your data, you can use automation to experiment.

Moreover, there are benefits beyond simple time saving.

The Case for Ensemble Models

Automation supports the creation of ensemble models. Ensemble models (roughly speaking) are models that are built on the same data, but use different methods of analysis.

Typically the results of multiple models are compared and/or combined, to yield results that are superior to those of any single model. Assessing multiple models for the same prediction task is now considered a best practice, even with the best data. Some reasons that are often cited for using ensemble models include:

Avoiding overfitting. When a model is too closely patterned on a specific training set, you get great predictions on a test data set that matches the training data, and a lot of variability when you try the model out on real-world data.
Variable selection. Each algorithm type processes data differently and delivers different insights. Rather than compare predictions, as we did here, you might use one algorithm to identify the most important variables or to prescreen for correlations that can mask other more interesting patterns.

There has been much research in recent years on the best methods for combining the estimates from ensemble models — merging, bagging, voting, averaging, weighting by posterior evidence, gating, and so forth. A discussion of ensemble models is beyond the scope of this paper, and you will note that we did not attempt to combine prediction results in this paper; we only presented them for comparison. However, we encourage you to read the linked resources to learn more about these techniques.

Closing the Loop: Interpreting and Getting Feedback on Models

Now that you have summarized the results in an easy-to-read report, what’s next? Typically, you just think of more questions to answer!

What about internal promotions or known events? Have we eliminated known correlations?
Local and cultural events can significantly affect sales of particular products. Rather than expect the same seasonality in multiple regions, should we separate the regions for modeling?
Should we choose a time series algorithm that can account for the effects of random or cyclical external events, or do we want to smooth data to find overall trends?
Can we compare these projections graphs with a projection done by the traditional business method of year-to-date comparisons?
How do these predictions compare to percentage increases targeted by the business?

Fortunately, because you have created an extensible framework for incorporating data mining in analysis using Integration Services, it will be relatively easy to collect more data, update models, and refine your presentation.

Conclusion

This paper introduced a framework for automation of data mining, with results saved to a relational data store, to encourage a systematic approach to predictive analytics.

This walkthrough showed that it is relatively easy to set up Integration Services packages that create data mining models and generate predictions from them. A framework like the one demonstrated here could be extended to support further parameterization, encourage the use of ensemble models, and incorporate data mining in other analytic workflows.

Resources

[1] Jamie MacLennan: Walkthrough of SQL Server 2005 Integration Services for data mining

http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=96&Id=338

[2] Microsoft Research: Ensemble models

http://academic.research.microsoft.com/Paper/588724.aspx

[3] Reporting Services tutorials

http://msdn.microsoft.com/en-us/library/bb522859.aspx

[4] Michael Ohler: Assessing forecast accuracy

http://www.isixsigma.com/index.php?option=com_k2&view=item&id=1550:assessing-forecast-accuracy-be-prepared-rain-or-shine&Itemid=1&tmpl=component&print=1

[5] Statistical methods for assessing mining models

http://ms-olap.blogspot.com/2010/12/do-you-trust-your-data-mining-results.html

[6] John Maindonald, “Data Mining from a Statistical Perspective”

http://maths.anu.edu.au/~johnm/dm/dmpaper.html.

Acknowledgements

I am indebted to my coworkers for their assistance and encouragement. Carla Sabotta (technical writer, Integration Services) provided invaluable feedback on the steps in each of the SSIS packages, ensuring that I didn’t leave out anything. Ranjeeta Nanda of the Integration Services test team kindly reviewed the code in the Script task. Mary Lingel (technical writer, Reporting Services) took my complex data source and developed a set of reports that made it look simple.

Code for Script Task

The following code can be added to the Script task to change values in the XML model definition. This very simple sample was written using VB.NET, but the Script task supports C# as well.

A number of message boxes have been added to verify that the task was processing the XML as expected. You would eventually comment these out. You would also want to add string length checking and other validation to prevent DMX injection.

Public Sub Main()

'get base XMLA and create new blank XMLA used for output

Dim strXMLABaseDef As String = Dts.Variables("strBaseXMLA").Value.ToString

Dim strXMLANewDef As String = strXMLABaseDef

'create local variables and fill them with values from the SQL query

Dim txtModelID As String = Dts.Variables("strModelID").Value.ToString

Dim txtModelName As String = Dts.Variables("strModelName").Value.ToString

Dim txtForecastMethod As String = Dts.Variables("strForecastMethod").Value.ToString

Dim txtPeriodicityHint As String = Dts.Variables("strPeriodicityHint").Value.ToString

'first update base XMLA with new model ID and model name

' ForecastingDefault

Dim txtNewID As String = "" & txtModelID & ""

Dim txtNewName As String = "" & txtModelName & ""

'insert values

strXMLANewDef = strXMLANewDef.Replace("ForecastingDefault", txtNewID)

strXMLANewDef = strXMLANewDef.Replace("ForecastingDefault", txtNewName)

'display model names – for troubleshooting only

MessageBox.Show(strXMLANewDef, "Verify new model ID and name")

'create temporary variables for replacement operations

Dim strParameterName As String = ""

Dim strParameterValue As String = ""

'update value for FORECAST METHOD. Because all possible values have exactly 5 chars, simply replace

strParameterName = "FORECAST_METHOD"

strParameterValue = "MIXED" 'default value

If strXMLABaseDef.Contains(strParameterValue) Then

'replace the default value MIXED with whatever is in the variable from the SQL Server query

strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtForecastMethod)

'display Forecast parameter value– for troubleshooting only

MessageBox.Show(strXMLANewDef, "Check Forecast Method", MessageBoxButtons.OK)

Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _

strParameterName, MessageBoxButtons.YesNoCancel)

End If

'look for a PERIODICITY_HINT value

strParameterName = "PERIODICITY_HINT"

strParameterValue = "{1}" 'default value

If strXMLABaseDef.Contains(strParameterName) Then

Dim StartString As Integer = strXMLABaseDef.IndexOf("{")

Dim EndString As Integer = strXMLABaseDef.IndexOf("}")

'replace the default value {1} with whatever is in the variable

strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtPeriodicityHint)

MessageBox.Show(strXMLANewDef, "Check Periodicity Hint", MessageBoxButtons.OK)

Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _

strParameterName, MessageBoxButtons.YesNoCancel)

End If

'save the completed definition to the package variable

Dts.Variables("strModelXMLA").Value = strXMLANewDef

Dts.TaskResult = ScriptResults.Success

End Sub

For more information:

http://www.microsoft.com/sqlserver/: SQL Server Web site

http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter

http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter

Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:

Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?

This feedback will help us improve the quality of white papers we release.

Send feedback.

Yüklə 145,11 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9