Top 10 algorithms in data mining
15
item in the order of increasing frequency and extracting frequent itemsets that contain the
chosen item by recursively calling itself on the conditional FP-tree. FP-growth is an order of
magnitude faster than the original Apriori algorithm.
There are several other dimensions regarding the extensions of frequent pattern mining.
The major ones include the followings: (1) incorporating taxonomy in items [
72
]: Use of
taxonomy makes it possible to extract frequent itemsets that are expressed by higher concepts
even when use of the base level concepts produces only infrequent itemsets. (2) incremental
mining: In this setting, it is assumed that the database is not stationary and a new instance
of transaction keeps added. The algorithm in [
12
] updates the frequent itemsets without
restarting from scratch. (3) using numeric valuable for item: When the item corresponds to a
continuous numeric value, current frequent itemset mining algorithm is not applicable unless
the values are discretized. A method of subspace clustering can be used to obtain an optimal
value interval for each item in each itemset [
85
]. (4) using other measures than frequency,
such as information gain or
χ
2
value: These measures are useful in finding discriminative
patterns but unfortunately do not satisfy anti-monotonicity property. However, these mea-
sures have a nice property of being convex with respect to their arguments and it is possible
to estimate their upperbound for supersets of a pattern and thus prune unpromising patterns
efficiently. AprioriSMP uses this principle [
59
]. (5) using richer expressions than itemset:
Many algorithms have been proposed for sequences, tree and graphs to enable mining from
more complex data structure [
90
,
42
]. (6) closed itemsets: A frequent itemset is closed if it is
not included in any other frequent itemsets. Thus, once the closed itemsets are found, all the
frequent itemsets can be derived from them. LCM is the most efficient algorithm to find the
closed itemsets [
82
].
5 The EM algorithm
Finite mixture distributions provide a flexible and mathematical-based approach to the mod-
eling and clustering of data observed on random phenomena. We focus here on the use of
normal mixture models, which can be used to cluster continuous data and to estimate the
underlying density function. These mixture models can be fitted by maximum likelihood via
the EM (Expectation–Maximization) algorithm.
5.1 Introduction
Finite mixture models are being increasingly used to model the distributions of a wide variety
of random phenomena and to cluster data sets [
57
]. Here we consider their application in the
context of cluster analysis.
We let the p-dimensional vector ( y
= (y
1
, . . . , y
p
)
T
) contain the values of
p variables
measured on each of n (independent) entities to be clustered, and we let y
j
denote the value
of y corresponding to the j th entity ( j
= 1, . . . , n). With the mixture approach to clustering,
y
1
, . . . , y
n
are assumed to be an observed random sample from mixture of a finite number,
say g, of groups in some unknown proportions
π
1
, . . . , π
g
.
The mixture density of y
j
is expressed as
f
(y
i
; ) =
g
i
=1
π
i
f
i
(y
j
; θ
i
) ( j = 1, . . . , n),
(3)
123
16
X. Wu et al.
where the mixing proportions
π
1
, . . . , π
g
sum to one and the group-conditional density
f
i
(y
j
; θ
i
) is specified up to a vector θ
i
of unknown parameters (i
= 1, . . . , g). The vector of
all the unknown parameters is given by
= (π
1
, . . . , π
g
−1
, θ
T
1
, . . . , θ
T
g
)
T
,
where the superscript “T” denotes vector transpose. Using an estimate of
, this approach
gives a probabilistic clustering of the data into g clusters in terms of estimates of the posterior
probabilities of component membership,
τ
i
(y
j
, ) =
π
i
f
i
(y
j
; θ
i
)
f
(y
j
; )
,
(4)
where
τ
i
(y
j
) is the posterior probability that y
j
(really the entity with observation y
j
) belongs
to the i th component of the mixture (i
= 1, . . . , g; j = 1, . . . , n).
The parameter vector
can be estimated by maximum likelihood. The maximum like-
lihood estimate (MLE) of
, ˆ , is given by an appropriate root of the likelihood equation,
∂ log L( )/∂ = 0,
(5)
where
log L
( ) =
n
j
=1
log f
(
y
j
; )
(6)
is the log likelihood function for
. Solutions of (
6
) corresponding to local maximizers can
be obtained via the expectation–maximization (EM) algorithm [
17
].
For the modeling of continuous data, the component-conditional densities are usually
taken to belong to the same parametric family, for example, the normal. In this case,
f
i
(y
j
; θ
i
) = φ(y
j
; µ
i
,
i
),
(7)
where
φ(y
j
; µ, ) denotes the p-dimensional multivariate normal distribution with mean
vector
µ and covariance matrix .
One attractive feature of adopting mixture models with elliptically symmetric compo-
nents such as the normal or t densities, is that the implied clustering is invariant under affine
transformations of the data (that is, under operations relating to changes in location, scale,
and rotation of the data). Thus the clustering process does not depend on irrelevant factors
such as the units of measurement or the orientation of the clusters in space.
5.2 Maximum likelihood estimation of normal mixtures
McLachlan and Peel [
57
, Chap. 3] described the E- and M-steps of the EM algorithm for the
maximum likelihood (ML) estimation of multivariate normal components; see also [
56
]. In
the EM framework for this problem, the unobservable component labels
z
i j
are treated as
being the “missing” data, where z
i j
is defined to be one or zero according as y
j
belongs or
does not belong to the i th component of the mixture (i
= 1, . . . , g; , j = 1, . . . , n).
On the (k
+ 1)th iteration of the EM algorithm, the E-step requires taking the expectation
of the complete-data log likelihood logL
c
( ), given the current estimate
k
for
. As is
linear in the unobservable z
i j
, this E-step is effected by replacing the z
i j
by their conditional
expectation given the observed data y
j
, using
(
k)
. That is, z
i j
is replaced by
τ
(k)
i j
, which
is the posterior probability that
y
j
belongs to the i th component of the mixture, using the
123