Strategies of Data Mining

Yüklə 160,83 Kb.

səhifə	2/7
tarix	08.10.2017
ölçüsü	160,83 Kb.
	#3807

1 2 3 4 5 6 7

Data Selection
Locating Data
Identifying Data

The Data Mining Process

Analysis Services provides a set of easy-to-use, robust data mining tools. To make the best use of these tools, you should follow a consistent data mining process, such as the one outlined below:

•	Data Selection The process of locating and identifying data for data mining purposes.
•	Data Cleaning The process of inspecting data for physical inconsistencies, such as orphan records or required fields set to null, and logical inconsistencies, such as accounts with closing dates earlier than starting dates.
•	Data Enrichment The process of adding information to data, such as creating calculated fields or adding external data for data mining purposes.
•	Data Transformation The process of transforming data physically, such as changing the data types of fields, and logically, such as increasing or decreasing granularity, for data mining purposes.
•	Training Case Set Preparation The process of preparing a case set for data mining. This may include secondary transformation and extract query design.
•	Data Mining Model Construction The process of choosing a data mining model algorithm and tuning its parameters, then running the algorithm against the training case set to construct a data mining model.
•	Data Mining Model Evaluation The process of evaluating the created data mining model against a case set of test data, in which a second training data set, also called a holdout set, is viewed through the data mining model and the resulting predictive analysis is then compared against the actual results of the second training set to determine predictive accuracy.
•	Data Mining Model Feedback After the data mining model has been evaluated, the data mining model can be used to provide analysis of unknown data. The resulting analysis can be used to supply either operational or closed loop decision support.

If you are modeling data from a well-designed data warehouse, the first four steps are generally done for you as part of the process used to populate the data warehouse. However, even data warehousing data may need additional cleaning, enrichment, and transformation, because the data mining process takes a slightly different view of data than either data warehousing or OLAP processes.

Data Selection

There are two parts to selecting data for data mining. The first part, locating data, tends to be more mechanical in nature than the second part, identifying data, which requires significant input by a domain expert for the data. (A domain expert is someone who is intimately familiar with the business purposes and aspects, or domain, of the data to be examined.)

Locating Data

Data mining can be performed on almost every database, but several general database types are typically supported in business environments. Not all of these database types are suitable for data mining.

The recommended database types for data mining are listed below:

•

Enterprise Data Warehouse

For a number of reasons, a data warehouse maintained at the enterprise level is ideal for data mining. The processes used to select, clean, enrich, and transform data that will be used for data mining purposes are nearly identical to the processes used on data that will be used for data warehousing purposes. The enterprise data warehouse is optimized for high-volume queries and is usually designed to represent business entities in a dimensional format, making it easier to identify and isolate specific business scenarios. By contrast, OLTP databases are generally optimized for high-volume updates and typically represent an entity-relation (E-R) format.

•

Data Mart

A data mart is a subset of the enterprise data warehouse, encapsulated for specific business purposes. For example, a sales and marketing data mart would contain a copy of the dimensional tables and fact tables kept in the enterprise data warehouse that pertain to sales and marketing business purposes. The tables in such a data mart would contain only the data necessary to satisfy sales and marketing research.

Because data marts are aggregated according to the needs of business users, most data marts are not suitable for data mining. However, a data mart designed specifically for data mining can be constructed, giving you the power of data mining in an enterprise data warehouse with the flexibility of additional selection, cleaning, enrichment, and transformation specifically for data mining purposes. Data marts designed for this purpose are known by other terms, but serve the same purpose.

OLAP databases are often modeled as a data mart. Because their functionality and use are similar to other types of data marts, OLAP databases fit into this category neatly. OLAP databases are also aggregated according to the needs of business users, so the same issues apply.

Overaggregation can also cause problems when mining OLAP data. OLAP databases are heavily aggregated; indeed, the point of such data is to reduce the granularity of the typical OLTP or data warehouse database to an understandable level. This involves a great deal of summarization and "blurring" when it comes to viewing detailed information, including the removal of attributes unnecessary to the aggregation process. If there is too much summarization, there will not be enough attributes left to mine for meaningful information. This overaggregation can start well before the data reaches Analysis Services, as data warehouses typically aggregate fact table data. You should carefully review the incoming relational and OLAP data first before deciding to mine OLAP data.

Conversely, you should not mine data in the database types listed below.

•

OLTP database

OLTP databases, also known as operational databases, are not optimized for the kind of wholesale retrieval that data mining needs; marked performance impacts in access and transaction speed can occur on other applications that depend on the high-volume update optimization of such databases. Lack of pre-aggregation can also impact the time needed to train data mining models based on OLTP databases, because of the many joins and high record counts inherent in bulk retrieval queries executed on OLTP databases.

•

Operational data store (ODS) database

The operational data store (ODS) database has come into popular use to process and consolidate the large volumes of data typically handled by OLTP databases. The business definition of an ODS database is fluid, but ODS databases are typically used as a "buffer zone" between raw OLTP data and applications that require access to such high-granularity data for functionality, but need to be isolated from the OLTP database for query performance reasons.

While data mining ODS databases may be useful, ODS databases are known for rapid changes; such databases mirror OLTP data with low latency between updates. The data mining model then becomes a lens on a rapidly moving target, and the user is never sure that the data mining model accurately reflects the true historical view of the data.

Data mining is a search for experience in data, not a search for intelligence in data. Because developing this experience requires a broad, open view of historical data, most volatile transactional databases should be avoided.

When locating data for data mining, ideally you should use well-documented, easily accessible historical data; many of the steps involved in the data mining process involve free and direct access to data. Security issues, interdepartmental communications, physical network limitations, and so on can restrict free access to historical data. All of the issues that can potentially restrict such free access should be reviewed as part of the design process for implementing a data mining solution.

Identifying Data

This step is one of the most important of all steps in the data mining process. The quality of selected data ultimately determines the quality of the data mining models based on the selected data. The process of identifying data for use in data mining roughly parallels the process used for selecting data for data warehousing.

When identifying data for data mining, you should ask the following three questions:

1.	Does this data meet the requirements for the proposed business scenario? The data should not only match the purpose of the business scenario, but also its granularity. For example, attempting to model product performance information requires the product data to represent individual products, because each product becomes a case in a set of cases.
2.	Is this data complete? The data should have all of the attributes needed to accurately describe the business scenario. Remember that a lack of data is itself information; in the abovementioned product performance scenario, lack of performance information about a particular product could indicate a positive performance trend for a family of products; the product may perform so well that no customer has reported any performance issues with the product.
3.	Does this data contain the desired outcome attributes? When performing predictive modeling, the data used to construct the data mining model must contain the known desired outcome. Sometimes, to satisfy this requirement, a temporary attribute is constructed to provide a discrete outcome value for each case; this can be done in the data enrichment and data transformation steps.

Data that can immediately satisfy these questions is a good place to start for data mining, but you are not limited to such data. The data enrichment and data transformation steps allow you to massage data into a more useful format for data mining, and marginally acceptable data can be made useful through this manipulation.

Yüklə 160,83 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7