Strategies of Data Mining

Data Mining Model Construction

Yüklə 160,83 Kb.

səhifə	5/7
tarix	08.10.2017
ölçüsü	160,83 Kb.
	#3807

1 2 3 4 5 6 7

Data Mining Model Construction

The construction of a data mining model consists of selecting a data mining algorithm provider that matches the desired data mining approach, setting its parameters as desired, and executing the algorithm provider against a training case set. This, in turn, generates a set of values that reflects one or more statistical views on the behavior of the training case set. This statistical view is later used to provide insights into similar case sets with unknown outcomes.

This may sound simple, but the act of constructing a data mining model is much more than mere mechanical execution. The approach you use can decide the difference between an accurate but useless data mining model and a somewhat accurate but very useful data mining model.

Your domain expert, the business person who provides guidance into the data you are modeling, should be able to give you enough information to decide on an approach to data mining. The approach, in turn, assist in deciding the algorithm and cases to be modeled.

You should view the data mining model construction process as a process of exploration and discovery. There is no one formula for constructing a data mining model; experimentation and evaluation are key steps in the construction process, and a data mining process for a specific business scenario can go through several iterations before an effective data mining model is constructed.

Model-Driven and Data-Driven Data Mining

The two schools of thought on decision support techniques serve as the endpoints of a spectrum, with many decision support techniques incorporating principles from both schools. Data warehousing, OLAP, and data mining break down into multiple components. Depending on the methodology and purpose of the component, each has a place in this spectrum.

This section focuses on the various methods and purposes of data mining. The following diagram illustrates some of these components and their approximate place in this spectrum.

After data has been selected, actual data mining is usually broken down into the following tasks:

•	Classification Classification is the process of using the attributes of a case to assign it to a predefined class. For example, customers can be classified at various risk levels for mortgage loan applications. Classification is best used when a finite set of classes can be defined—classes defined as high risk, medium risk, or low risk can be used to classify all customers in the previous example.
•	Estimation While classification is used to answer questions from a finite set of classes, estimation is best used when the answer lies within an unknown, continuous set of answers. For example, using census tract information to predict household incomes. Classification and estimation techniques are often combined within a data mining model.
•	Association Association is the process of determining the affinity of cases within a case set, based on similarity of attributes. Simply put, association determines which cases belong together in a case set. Association can be used to determine which products should be grouped together on store shelves, or which services are most useful to package for cross-selling opportunities.
•	Clustering Clustering is the process of finding groups in scattered cases, breaking a single, diverse set of cases into several subsets of similar cases based on the similarity of attributes. Clustering is similar to classification, except that clustering does not require a finite set of predefined classes; clustering simply groups data according to the patterns and rules inherent in the data based on the similarity of its attributes.

Each of these tasks will be discussed in detail later in this chapter. Classification and estimation are typically represented as model-driven tasks, while association and clustering are associated more often with data-driven tasks. Visualization, the process of viewing data mining results in a meaningful and understandable manner, is used for all data mining techniques, and is discussed in a later section.

Model-Driven Data Mining

Model-driven data mining, also known as directed data mining, is the use of classification and estimation techniques to derive a model from data with a known outcome, which is then used to fulfill a specific business scenario. The model is then compared against data with an unknown outcome to determine the likelihood of such data to satisfy the same business scenario. For example, a common illustration of directed data mining is account "churning," the tendency of users to change or cancel accounts. Generally speaking, the data mining model drives the process in model-driven data mining. Classification and estimation are typically categorized as model-driven data mining techniques.

This approach is best employed when a clear business scenario can be employed against a large body of known historical data to construct a predictive data mining model. This tends to be the "I know what I don't know" approach: you have a good idea of the business scenarios to be modeled, and have solid data illustrating such scenarios, but are not sure about the outcome itself or the relationships that lead to this outcome. Model-driven data mining is treated as a "black box" operation, in which the user cares less about the model and more about the predictive results that can be obtained by viewing data through the model.

Data-Driven Data Mining

Data-driven data mining is used to discover the relationships between attributes in unknown data, with or without known data with which to compare the outcome. There may or may not be a specific business scenario. Clustering and association, for example, are primarily data-driven data mining techniques. In data-driven data mining, the data itself drives the data mining process.

This approach is best employed in situations in which true data discovery is needed to uncover rules and patterns in unknown data. This tends to be the "I don't know what I don't know" approach: you can discover significant attributes and patterns in a diverse set of data without using training data or a predefined business scenario. Data-driven data mining is treated as a "white box" operation, in which the user is concerned about both the process used by the data mining algorithm to create the model and the results generated by viewing data through the model.

Which One Is Better?

Asking this question is akin to asking whether a hammer is better than a wrench; the answer depends on the job. Data mining depends on both data-driven and model-driven data mining techniques to be truly effective, depending on what questions are asked and what data is analyzed. For example, a data-driven approach may be used on fraudulent credit card transactions to isolate clusters of similar transactions. Clustering uses a self-comparison approach to find significant groups, or clusters, of data elements. The attributes of each data element are matched across the attributes of all the other data elements in the same set, and are grouped with records that are most similar to the sampled data element. After they are discovered, these individual clusters of data can be modeled using a model-driven data mining technique to construct a data mining model of fraudulent credit card transactions that fit a certain set of attributes. The model can then be used as part of an estimation process, also model-driven, to predict the possibility of fraud in other, unknown credit card transactions.

The various tasks are not completely locked into either model-driven or data-driven data mining. For example, a decision tree data mining model can be used for either model-driven data mining, to predict unknown data from known data, or data-driven data mining, to discover new patterns relating to a specific data attribute.

Data-driven and model-driven data mining can be employed separately or together, in varying amounts, depending on your business requirements. There is no set formula for mining data; each data set has its own patterns and rules.

Data Mining Algorithm Provider Selection

In Analysis Services, a data mining model is a flexible structure that is designed to support the nearly infinite number of ways data can be modeled. The data mining algorithm gives the data mining model shape, form, and behavior.

The two algorithms included in Analysis Services, Microsoft® Decision Trees and Microsoft Clustering, are very different in behavior and produce very different models, as described below.

Both algorithms can be used together to select and model data for business scenarios. For more information on using both algorithms in concert, see "Model-Driven and Data-Driven Data Mining" earlier in this chapter.

Microsoft Decision Trees

The Microsoft Decision Trees algorithm is typically employed in classification and estimation tasks, because it focuses on providing histogram information for paths of rules and patterns within data. One of the benefits of this algorithm is the generation of easily understandable rules. By following the nodes along a single series of branches, a rule can be constructed to derive a single classification of cases.

One of the criteria used for evaluating the success of a data mining algorithm is referred to as fit. Fit is typically represented as a value between 0 and 1, and is calculated by taking the covariance between the predicted and actual values of evaluated cases and dividing by the standard deviations of the same predicted and actual values. This measurement, also referred to as r-squared, is returned—0 means that the model provides no predictive value at all, because none of the predicted values were even close to the actual values, while 1 means the model is a perfect fit, because the predicted values completely match the actual values of evaluated cases.

However, a perfect fit is not as desirable as it sounds. One of the difficulties encountered with data mining algorithms in general is this tendency to perfectly classify every single case in a training case set, referred to as overfitting. The goal of a data mining model, generally speaking, is to build a statistical model of the business scenario that generates the data, not to build an exact representation of the training data itself. Such a data mining model performs well when evaluating training data extracted from a particular data set, but performs poorly when evaluating other cases from the same data set. Even well-prepared training case sets can fall victim to overfitting, because of the nature of random selection.

For example, the following table illustrates a training case set with five cases, representing customers with cancelled accounts, extracted from a larger domain containing thousands of cases.

Customer Name	Gender	Age	Account Months
Beth	Female	28	6
Dennis	Male	45	12
Elaine	Female	45	24
Michelle	Female	47	18
John	Male	37	36

The following diagram illustrates a highly overfitted decision tree, generated from the training case set, created by a data mining model.

The decision tree perfectly describes the training data set, with a single leaf node per customer. Because the Age and Gender columns were used for input and the Account Months column was used as output, it correctly predicted for this training data set that every female customer with an age of 45 would close their account in 24 months, while every male customer with an age of 45 will close their account in 12 months. This model would be practically useless for predictive analysis—the training set has too few cases to model effectively, and the decision tree generated for this training set has far too many branches for the data.

There are two sets of techniques used to prevent such superfluous branches in a data mining model while maintaining a good fit for the model. The first set of techniques, referred to as pruning techniques, allows the decision tree to completely overfit the model and then removes branches within the decision tree to make the model more generalized. This set of techniques is knowledge-intensive, typically requiring both a data mining analyst and a domain expert to properly perform pruning techniques.

The second set of techniques, referred to as bonsai or stunting techniques, are used to stunt the growth of the tree by applying tests at each node to determine if a split is statistically significant. The Microsoft Decision Trees data mining algorithm automatically employs stunting techniques on data mining models, guided by adjustable data mining parameters, and prevents overfitting training case sets in data mining models that use the algorithm.

There are two data mining parameters that can be adjusted to fine tune the stunting techniques used by the Microsoft Decision Trees algorithm. The first, MINIMUM_LEAF_CASES, determines how many leaf cases are needed to generate a new split in the decision tree. To generate the data mining model in the above example, this parameter was set to 1, so that each case could be represented as a leaf node in the decision tree. Running the same training case set against the same data mining model, but with the MINIMUM_LEAF_CASES parameter set to 2, provides the following decision tree.

The above decision tree diagram is less overfitted; one leaf node is used to predict two cases, while the other leaf node is used to predict the other three cases in the training data set. The algorithm was instructed not to make a decision unless two or more leaf cases would result from the decision. This is a "brute force" way of ensuring that not every case ends up a leaf case in a data mining model, in that it has obvious and easily understood effects on a data mining model.

Using the second parameter, COMPLEXITY_PENALTY, involves more experimentation. The COMPLEXITY_PENALTY parameter adds cumulative weight to each decision made at a specific level in a decision tree, making it more difficult to continue to make decisions as the tree grows. The smaller the value provided to the COMPLEXITY_PENALTY parameter, the easier it is for the data mining algorithm to generate a decision. For example, the data mining model examples used to demonstrate the MINIMUM_LEAF_CASES parameter were created using a COMPLEXITY_PENALTY value of just 0.000001, to encourage a highly complex model with such a few number of cases. By setting the value to 0.50, the default used for data mining models with between 1 and 10 attributes, the complexity penalty is greatly increased. The following decision tree represents this penalization of mode complexity.

Because the individual cases do not differ significantly, based on the total number of cases included in the training case set, the complexity penalty prevents the algorithm from creating splits. Therefore, the data mining algorithm provider can supply only a single node to represent the training case set; the data mining model is now too generalized. The value used for COMPLEXITY_PENALTY differs from data mining model to data mining model, because of individuality of the data being modeled. The default values provided in the SQL Server Books Online are based on the total number of attributes being modeled, and provide a good basis on which to experiment.

When using data mining parameters to alter the process of generating data mining models, you should create several versions of the same model, each time changing the data mining parameters and observing the reaction in the data mining model. This iterative approach will provide a better understanding of the effects of the data mining parameters on training a data mining model when using the Microsoft Decision Trees algorithm.

The Microsoft Decision Trees algorithm works best with business scenarios involving the classification of cases or the prediction of specific outcomes based on a set of cases encompassing a few broad categories.

Microsoft Clustering

The Microsoft Clustering algorithm provider is typically employed in association and clustering tasks, because it focuses on providing distribution information for subsets of cases within data.

The Microsoft Clustering algorithm provider uses an expectation-maximization (EM) algorithm to segment data into clusters based on the similarity of attributes within cases.

The algorithm iteratively reviews the attributes of each case with respect to the attributes of all other cases, using weighted computation to determine the logical boundaries of each cluster. The algorithm continues this process until all cases belong to one (and only one) cluster, and each cluster is represented as a single node within the data mining model structure.

The Microsoft Clustering algorithm provider is best used in situations where possible natural groupings of cases may exist, but are not readily apparent. This algorithm is often used to identify and separate multiple patterns within large data sets for further data mining; clusters are self-defining, in that the variations of attributes within the domain of the case set determine the clusters themselves. No external data or pattern is applied to discover the clusters internal to the domain.

Creating Data Mining Models

Data mining models can be created a number of ways in Analysis Services, depending on the location of the data mining model. Data mining models created on the Analysis server can only be created through the Decision Support Objects (DSO) library. Analysis Manager uses DSO, through the Mining Model Wizard, used to create new relational or OLAP data mining models. Custom client applications can also use DSO to create relational or OLAP data mining models on the server.

Relational data mining models can be also created on the client through the use of PivotTable Service and the CREATE MINING MODEL statement. For example, the following statement can be used to recreate the Member Card RDBMS data mining model from the FoodMart 2000 database on the client.

CREATE MINING MODEL

[Member Card RDBMS]

([customer id] LONG KEY,

[gender] TEXT DISCRETE,

[marital status] TEXT DISCRETE,

[num children at home] LONG CONTINUOUS,

[total children] LONG DISCRETE,

[yearly income] TEXT DISCRETE,

[education] TEXT DISCRETE,

[member card] TEXT DISCRETE PREDICT)

USING

Microsoft_Decision_Trees

This statement can be used to create a temporary data mining model, created at the session level, as well as to create a permanent data mining model, stored on the client. To create a permanent data mining model on the client, the Mining Location PivotTable Service property is used to specify the directory in which the data mining model will be stored. The same property is also used to locate existing permanent data mining models for reference.

The Data Mining Sample Application, provided with the SQL Server 2000 Resource Kit, is a great tool for prototyping data mining models. You can test each data mining model at session scope; once a data mining model is approved, the same query can be used to construct it locally.

The CREATE MINING MODEL statement can be issued as an action query through any data access technology capable of supporting PivotTable Service, such as Microsoft ActiveX® Data Objects (ADO). The USING clause is used to assign a data mining algorithm provider to the data mining model.

For more information on the syntax and usage of the CREATE MINING MODEL statement, see PivotTable Service Programmer's Reference in SQL Server Books Online. For more information regarding the details of data mining column definition, see the OLE DB for Data Mining specification in the MSDN® Online Library.

Training Data Mining Models

Once a data mining model is created, the training case set is then supplied to the data mining model through the use of a training query.

Training case sets can be constructed either by physically separating the desired training data from the larger data set into a different data structure used as a staging area and then retrieving all of the training records with a training query, or by constructing a training query to extract only the desired training data from the larger data set, querying the larger data set directly. The first approach is recommended for performance reasons, and because the training query used for the data mining model does not need to be changed if the training case set changes – you can instead place alternate training data into the physically separated staging area. However, this approach can be impractical if the volume of data to be transferred is extremely large or sensitive, or if the original data set does not reside in an enterprise data warehouse. In such cases, the second approach is more suitable for data mining purposes.

Once the records are extracted, the data mining model is trained by the use of an INSERT INTO query executed against the data mining model, which instructs the data mining algorithm provider to analyze the extracted records and provide statistical data for the data mining model.

In Analysis Services, the training query of a data mining model is typically constructed automatically, using the first approach. The information used to supply input and predictable columns to the data mining model is also used to construct the training query, and the schema used to construct the data mining model is used to supply the training data as well.

For example, the training query used for the Member Card RDBMS relational data mining model in the FoodMart 2000 database is shown below.

INSERT INTO

[Member Card RDBMS'S]

(SKIP,

[gender],

[marital status],

[num children at home],

[total children],

[yearly income],

[education],

[member card])

OPENROWSET

('MSDASQL.1', 'Provider=MSDASQL.1;Persist Security Info=False;Data Source=FoodMart

2000',

'SELECT DISTINCT

"Customer"."customer_id" AS 'customer id',

"Customer"."gender" AS 'gender',

"Customer"."marital_status" AS 'marital status',

"Customer"."num_children_at_home" AS 'num children at home',

"Customer"."total_children" AS 'total children',

"Customer"."yearly_income" AS 'yearly income',

"Customer"."education" AS 'education',

"Customer"."member_card" AS 'member card'

FROM

"Customer"')

The MDX INSERT INTO statement is used to insert the data retrieved by the OPENROWSET command into the data mining model. The data mining model assumes that all records in the Customer table, which was used to define the data mining model, are to be used as the training case set for the data mining model.

The second approach, the construction of a custom training query, is more difficult to perform in Analysis Services. The property used to supply custom training queries is not directly available through the Analysis Manager or either of the data mining model editors.

There are two methods used to support the second approach. The first method involves the use of the Decision Support Objects (DSO) library in a custom application to change the training query used by the data mining model. The DSO MiningModel object provides the TrainingQuery property specifically for this purpose. If the default training query is used for a data mining model, this property is set to an empty string (" "); otherwise, you can supply an alternate training query for use with the mining model.

The second method involves the use of another data access technology, such as ADO, to directly supply a training query to a data mining model. In this case, the training query can be directly executed against the data mining model.

The following statement example is a custom training query for the Member Card RDBMS data mining model that selects only those customers who own houses for analysis. A WHERE clause is used in the OPENROWSET statement to restrict the selection of records from the Customer table.

INSERT INTO

[Member Card RDBMS'S]

(SKIP,

[gender],

[marital status],

[num children at home],

[total children],

[yearly income],

[education],

[member card])

OPENROWSET

('MSDASQL.1', 'Provider=MSDASQL.1;Persist Security Info=False;Data Source=FoodMart

2000',

'SELECT DISTINCT

"Customer"."customer_id" AS 'customer id',

"Customer"."gender" AS 'gender',

"Customer"."marital_status" AS 'marital status',

"Customer"."num_children_at_home" AS 'num children at home',

"Customer"."total_children" AS 'total children',

"Customer"."yearly_income" AS 'yearly income',

"Customer"."education" AS 'education',

"Customer"."member_card" AS 'member card'

FROM

"Customer"

WHERE

"Customer"."houseowner" = "Y"')

The resulting data mining model provides analysis on the same attributes, but with a different training case set. By using custom training queries, the same data mining model structure can be used to provide different outlooks on data without the need to completely redevelop a data mining model.

The Microsoft OLE DB for Data Mining provider supports a number of options in the INSERT INTO statement for selecting training data. The OPENROWSET statement, shown in the previous example, is the most common method used, but other methods are supported. For more information about the various supported options, see the OLE DB for Data Mining specification in the MSDN Online Library.

Also, the Data Mining Sample Application, shipped with the SQL Server 2000 Resource Kit, can be used to construct and examine a wide variety of training queries quickly and effectively.

Yüklə 160,83 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7