Strategies of Data Mining

Yüklə 160,83 Kb.

səhifə	4/7
tarix	08.10.2017
ölçüsü	160,83 Kb.
	#3807

1 2 3 4 5 6 7

Data Transformation

Data transformation, in terms of data mining, is the process of changing the form or structure of existing attributes. Data transformation is separate from data cleansing and data enrichment for data mining purposes because it does not correct existing attribute data or add new attributes, but instead grooms existing attributes for data mining purposes.

The guidelines for data transformation are similar to both data mining and data warehousing, and a large amount of reference material exists for data transformation in data warehousing environments. For more information about data transformation guidelines in data warehousing, see Chapter 19, "Data Extraction, Transformation, and Loading Techniques."

One of the most common forms of data transformation used in data mining is the conversion of continuous attributes into discrete attributes, referred to as discretization. Many data mining algorithms perform better when working with a small number of discrete attributes, such as salary ranges, rather than continuous attributes, such as actual salaries. This step, as with other data transformation steps, does not add information to the data, nor does it clean the data; instead, it makes data easier to model. Some data mining algorithm providers can discretize data automatically, using a variety of algorithms designed to create discrete ranges based on the distribution of data within a continuous attribute. If you intend to take advantage of such automatic discretization, ensure that your training case set has enough cases for the data mining algorithm to adequately determine representative discrete ranges.

Too many discrete values within a single attribute can overwhelm some data mining algorithms. For example, using postal codes from customer addresses to categorize customers by region is an excellent technique if you plan to examine a small region. If, by contrast, you plan on examining the customer patterns for the entire country, using postal codes can lead to 50,000 or more discrete values within a single attribute; you should use an attribute with a wider scope, such as the city or state information supplied by the address.

Training Case Set Preparation

The training case set is used to construct the initial set of rules and patterns that serve as the basis of a data mining model. Preparing a training case set is essential to the success of the data mining process. Generally, several different data mining models will be constructed from the same training case set, as part of the data mining model construction process. There are several basic guidelines used when selecting cases for the preparation of a training case set, but the usefulness of the selection is almost entirely based on the domain of the data itself.

Sampling and Oversampling

Typically, you want to select as many training cases as possible when creating a data mining model, ensuring that the training case set closely represents the density and distribution of the production case set. Select the largest possible training case set you can, to smooth the distribution of training case attributes. The process of creating such a representative set of data, called sampling, is best handled by selecting records completely at random. In theory, such random sampling should provide a truly unbiased view of data.

However, random sampling does not always provide for specific business scenarios, and a large training case set may not always be best. For example, if you are attempting to model a rare situation within your data, you want to ensure that the frequency of occurrences for the desired situation is statistically high enough to provide trend information.

This technique of increasing the density of rare occurrences in a sample, called oversampling, influences the statistical information conveyed by the training case set. Such influence can be of great benefit when attempting to model very rare cases, sensitive cases in which positive confirmation of the existence of a case must first be made, or when the cases to be modeled occur within a very short period of time. For example, "no card" credit card fraud, in which a fraudulent credit card transaction occurs without the use of a credit card, represents about 0.001 percent of all credit card transactions stored in a particular data set. Sampling would theoretically return 1 fraud case per 100,000 transaction cases—while accurate, the model would overwhelmingly provide information on successful transactions, because the standard deviation for fraud cases would be unacceptably high for modeling purposes. The data mining model would be 99.999 percent accurate, but would also be completely useless for the intended business scenario—finding patterns in no-card fraud transactions.

Instead, oversampling would be used to provide a larger number of fraudulent cases within the training case set. A higher number of fraudulent cases can provide better insight into the patterns behind fraudulent transactions. There are a few drawbacks with oversampling, though, so use this technique carefully. Evaluation of a data mining model created with oversampled data must be handled differently because of the change in ratios between rare and common occurrences in the training case set. For example, the above credit card fraud training set is constructed from five years of transaction data, or approximately 50 million records. This means that, out of the entire data set to be mined, only 500 fraudulent records exist. If random sampling was used to construct a training case set with 1 million records (a 2 percent representative sample), only 10 desired cases would be included. So, the training case set was instead oversampled, so that the fraudulent cases would represent 10 percent of the total number of training cases. We extract all 500 fraudulent cases, so an additional 4,500 cases are randomly selected to construct a training case set with 5,000 cases, of which 10 percent are fraudulent transactions. When creating a data mining model involving the probability of two likely outcomes, the training case set should have a ratio of rare outcomes to common outcomes at approximately 10 to 40 percent, with 20 to 30 percent considered ideal. This ratio can be achieved through oversampling, providing a better statistical sample focusing on the desired rare outcome.

The difficulty with this training case set is that one non-fraudulent case, in essence, represents 11,111 cases in the original data set. Evaluating a data mining model using this oversampled training case set means taking this ratio into account when computing, for example, the amount of lift provided by the data mining model when evaluating fraudulent transactions.

For more information on how to evaluate an oversampled data mining model, see "Data Mining Model Evaluation" later in this chapter.

Selecting Training Cases

When preparing a training case set, you should select data that is as unambiguous as possible in representing the expected outcome to be modeled. The ambiguousness of the selected training cases should be in direct proportion to the breadth of focus for the business scenario to be predicted. For example, if you are attempting to cluster products that failed to discover possible failure patterns, selecting all products that failed is appropriate to your training set. By contrast, if you are trying to predict product failure for specific products due to environmental conditions, you should select only those cases where the specific product directly failed as a result of environmental conditions, not simply all failed products.

This may seem like adding bias to the training case set, but one of the primary reasons for wide variances between predicted and actual results when working with data mining models is due to the fact that the patterns stored in the data mining model are not relevant to prediction of the desired business scenario, and irrelevant patterns are introduced in part by ambiguous training cases.

One of the difficulties encountered when selecting cases is the definition of a business scenario and desired outcome. For example, a common business scenario involves grouping cases according to a set of known attributes to discover hidden patterns. The clustering algorithm is used in just this way to discover hidden attributes; the clustering of cases based on exposed attributes can be used to reveal a hidden attribute, the key to the clustering behavior. So, the desired outcome may not have anything to do with the clusters themselves, but the hidden attribute discovered by the clustering behavior. Before you select cases, be sure you understand both the business scenario used to create the data mining model and the information produced by the created data mining model.

The training case set is not the only source of stored pattern and rule information for the data mining model. The data mining model evaluation step of the data mining process can allow you to refine this stored information with the use of additional case sets. The data mining model, through refinement, can unlearn irrelevant patterns and improve its prediction accuracy. But, the data mining model uses the training case set as its first step towards learning information from data, so your model will benefit through careful selection of training cases.

Yüklə 160,83 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7