HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 29
#29
1.7 Major Issues in Data Mining
29
1.7
Major Issues in Data Mining
Life is short but art is long. – Hippocrates
Data mining is a dynamic and fast-expanding field with great strengths. In this section,
we briefly outline the major issues in data mining research, partitioning them into
five groups: mining methodology, user interaction, efficiency and scalability, diversity of
data types, and
data mining and society. Many of these issues have been addressed in
recent data mining research and development to a certain extent and are now consid-
ered data mining requirements; others are still at the research stage. The issues continue
to stimulate further investigation and improvement in data mining.
1.7.1
Mining Methodology
Researchers have been vigorously developing new data mining methodologies. This
involves the investigation of new kinds of knowledge, mining in multidimensional
space, integrating methods from other disciplines, and the consideration of semantic ties
among data objects. In addition, mining methodologies should consider issues such as
data uncertainty, noise, and incompleteness. Some mining methods explore how user-
specified measures can be used to assess the interestingness of discovered patterns as
well as guide the discovery process. Let’s have a look at these various aspects of mining
methodology.
Mining various and new kinds of knowledge: Data mining covers a wide spectrum of
data analysis and knowledge discovery tasks, from data characterization and discrim-
ination to association and correlation analysis, classification, regression, clustering,
outlier analysis, sequence analysis, and trend and evolution analysis. These tasks may
use the same database in different ways and require the development of numerous
data mining techniques. Due to the diversity of applications, new mining tasks con-
tinue to emerge, making data mining a dynamic and fast-growing field. For example,
for effective knowledge discovery in information networks, integrated clustering and
ranking may lead to the discovery of high-quality clusters and object ranks in large
networks.
Mining knowledge in multidimensional space: When searching for knowledge in large
data sets, we can explore the data in multidimensional space. That is, we can search
for interesting patterns among combinations of dimensions (attributes) at varying
levels of abstraction. Such mining is known as (exploratory) multidimensional data
mining. In many cases, data can be aggregated or viewed as a multidimensional data
cube. Mining knowledge in cube space can substantially enhance the power and
flexibility of data mining.
Data mining—an interdisciplinary effort: The power of data mining can be substan-
tially enhanced by integrating new methods from multiple disciplines. For example,
HAN
08-ch01-001-038-9780123814791
2011/6/1
3:12
Page 30
#30
30
Chapter 1 Introduction
to mine data with natural language text, it makes sense to fuse data mining methods
with methods of information retrieval and natural language processing. As another
example, consider the mining of software bugs in large programs. This form of min-
ing, known as bug mining, benefits from the incorporation of software engineering
knowledge into the data mining process.
Boosting the power of discovery in a networked environment: Most data objects reside
in a linked or interconnected environment, whether it be the Web, database rela-
tions, files, or documents. Semantic links across multiple data objects can be used
to advantage in data mining. Knowledge derived in one set of objects can be used
to boost the discovery of knowledge in a “related” or semantically linked set of
objects.
Handling uncertainty, noise, or incompleteness of data: Data often contain noise,
errors, exceptions, or uncertainty, or are incomplete. Errors and noise may confuse
the data mining process, leading to the derivation of erroneous patterns. Data clean-
ing, data preprocessing, outlier detection and removal, and uncertainty reasoning are
examples of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns gen-
erated by data mining processes are interesting. What makes a pattern interesting
may vary from user to user. Therefore, techniques are needed to assess the inter-
estingness of discovered patterns based on subjective measures. These estimate the
value of patterns with respect to a given user class, based on user beliefs or expec-
tations. Moreover, by using interestingness measures or user-specified constraints to
guide the discovery process, we may generate more interesting patterns and reduce
the search space.
1.7.2
User Interaction
The user plays an important role in the data mining process. Interesting areas of research
include how to interact with a data mining system, how to incorporate a user’s back-
ground knowledge in mining, and how to visualize and comprehend data mining results.
We introduce each of these here.
Interactive mining: The data mining process should be highly
interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system. A user may like to first sample a
set of data, explore general characteristics of the data, and estimate potential min-
ing results. Interactive mining should allow users to dynamically change the focus
of a search, to refine mining requests based on returned results, and to drill, dice,
and pivot through the data and knowledge space interactively, dynamically exploring
“cube space” while mining.
Incorporation of background knowledge: Background knowledge, constraints, rules,
and other information regarding the domain under study should be incorporated