Strategies of Data Mining



Yüklə 160,83 Kb.
səhifə3/7
tarix08.10.2017
ölçüsü160,83 Kb.
#3807
1   2   3   4   5   6   7

Data Cleaning


Data cleaning is the process of ensuring that, for data mining purposes, the data is uniform in terms of key and attribute usage. Identifying and correcting missing required information, cleaning up "orphan" records and broken keys, and so on are all aspects of data cleaning.

Data cleaning is separate from data enrichment and data transformation because data cleaning attempts to correct misused or incorrect attributes in existing data. Data enrichment, by contrast, adds new attributes to existing data, while data transformation changes the form or structure of attributes in existing data to meet specific data mining requirements.

Typically, most data mining is performed on data already that has been processed for data warehousing purposes. However, some general guidelines for data cleaning are useful for situations in which a well-designed data warehouse is not available, and for applications in which business requirements require cleaning of such data.

When cleaning data for data warehouses, the best place to start is at home; that is, clean data in the OLTP database first, rather than import bad data into a data warehouse and clean it afterward. This rule also applies to data mining, especially if you intend to construct a data mart for data mining purposes. Always try to clean data at the source, rather than try to model unsatisfactory data. Part of the "closed loop" in the decision support process should include data quality improvements, such as data entry guidelines and optimization of validation rules for OLTP data, and the data cleaning effort provides the information needed to enact such improvements.

Ideally, a temporary storage area can be used to handle the data cleaning, data enrichment, and data transformation steps. This allows you the flexibility to not only change the data itself, but also the meta data that frames the data. Data enrichment and transformation in particular, especially for the construction of new keys and relationships or conversion of data types, can benefit from this approach.

Cleaning data for data mining purposes usually requires the following steps:



1.

Key consistency verification 

Check that key values are consistent across all pertinent data. They will most likely be used to identify cases or important attributes. 



2.

Relationship verification 

Check that relationships between cases conform to defined business rules. Relationships that do not support defined business rules can skew the results of a data mining model, misleading the model into constructing patterns and rules that may not apply to a defined business scenario. 



3.

Attribute usage and scope verification 

Generally, the quality and accuracy of a data attribute is in direct proportion to the importance of the data to the business. Inventory information, for a manufacturing business that creates parts and products for the aerospace industry, is crucial to the successful operation of the business, and will generally be more accurate and of higher quality than the contact information of the vendors that supply the inventory. 

Check that the attributes used are being used as intended in the database, and that the scope or domain of selected attributes has meaning to the business scenario to be modeled. 


4.

Attribute data analysis 

Check that the values stored in attributes reasonably conform to defined business rules. As with attribute usage and scope verification, the data for less business-critical attributes typically requires more cleaning than attributes vital to the successful operation of the business. 

You should always be cautious about excluding or substituting values for empty attributes or missing data. Missing data does not always qualify as missing information. The lack of data for a specific cluster in a business scenario can reveal much information when asking the right questions. Consequently, you should be cautious when excluding attributes or data elements from a training case set. 


Data cleaning efforts directly contribute to the overall success or failure of the data mining process. This step should never be skipped, no matter the cost in time or resources. Although Analysis Services works well with all forms of data, it works best when data is consistent and uniform.

Data Enrichment


Data enrichment is the process of adding new attributes, such as calculated fields or data from external sources, to existing data.

Most references on data mining tend to combine this step with data transformation. Data transformation involves the manipulation of data, but data enrichment involves adding information to existing data. This can include combining internal data with external data, obtained from either different departments or companies or vendors that sell standardized industry-relevant data.

Data enrichment is an important step if you are attempting to mine marginally acceptable data. You can add information to such data from standardized external industry sources to make the data mining process more successful and reliable, or provide additional derived attributes for a better understanding of indirect relationships. For example, data warehouses frequently provide preaggregation across business lines that share common attributes for cross-selling analysis purposes.

As with data cleaning and data transformation, this step is best handled in a temporary storage area. Data enrichment, in particular the combination of external data sources with data to be mined, can require a number of updates to both data and meta data, and such updates are generally not acceptable in an established data warehouse.



Yüklə 160,83 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə