Top 10 algorithms in data mining
13
1. Generate C
k
+1
, candidates of frequent itemsets of size k
+ 1, from the frequent itemsets
of size k.
2. Scan the database and calculate the support of each candidate of frequent itemsets.
3. Add those itemsets that satisfies the minimum support requirement to F
k
+1
.
The Apriori algorithm is shown in Fig.
3
. Function apriori-gen in line 3 generates C
k
+1
from F
k
in the following two step process:
1. Join step: Generate R
K
+1
, the initial candidates of frequent itemsets of size k
+ 1 by
taking the union of the two frequent itemsets of size k, P
k
and Q
k
that have the first k
−1
elements in common.
R
K
+1
= P
k
∪ Q
k
= {item
l
, . . . , item
k
−1
, item
k
, item
k
}
P
k
= {item
l
, item
2
, . . . , item
k
−1
, item
k
}
Q
k
= {item
l
, item
2
, . . . , item
k
−1
, item
k
}
where, i teml
<
item2 < · · · <
item
k
<
item
k
.
2. Prune step: Check if all the itemsets of size k in R
k
+1
are frequent and generate C
k
+1
by
removing those that do not pass this requirement from
R
k
+1
. This is because any subset
of size
k of
C
k
+1
that is not frequent cannot be a subset of a frequent itemset of size
k
+ 1.
Function subset in line 5 finds all the candidates of the frequent itemsets included in trans-
action t. Apriori, then, calculates frequency only for those candidates generated this way by
scanning the database.
It is evident that Apriori scans the database at most k
max
+1
times when the maximum size
of frequent itemsets is set at
k
max
.
The Apriori achieves good performance by reducing the size of candidate sets (Fig.
3
).
However, in situations with very many frequent itemsets, large itemsets, or very low min-
imum support, it still suffers from the cost of generating a huge number of candidate sets
Fig. 3 Apriori algorithm
123
14
X. Wu et al.
and scanning the database repeatedly to check a large set of candidate itemsets. In fact, it is
necessary to generate 2
100
candidate itemsets to obtain frequent itemsets of size 100.
4.2 The impact of the algorithm
Many of the pattern finding algorithms such as decision tree, classification rules and clustering
techniques that are frequently used in data mining have been developed in machine learning
research community. Frequent pattern and association rule mining is one of the few excep-
tions to this tradition. The introduction of this technique boosted data mining research and its
impact is tremendous. The algorithm is quite simple and easy to implement. Experimenting
with Apriori-like algorithm is the first thing that data miners try to do.
4.3 Current and further research
Since Apriori algorithm was first introduced and as experience was accumulated, there have
been many attempts to devise more efficient algorithms of frequent itemset mining. Many
of them share the same idea with Apriori in that they generate candidates. These include
hash-based technique, partitioning, sampling and using vertical data format. Hash-based
technique can reduce the size of candidate itemsets. Each itemset is hashed into a corre-
sponding bucket by using an appropriate hash function. Since a bucket can contain different
itemsets, if its count is less than a minimum support, these itemsets in the bucket can be
removed from the candidate sets. A partitioning can be used to divide the entire mining prob-
lem into n smaller problems. The dataset is divided into n non-overlapping partitions such
that each partition fits into main memory and each partition is mined separately. Since any
itemset that is potentially frequent with respect to the entire dataset must occur as a frequent
itemset in at least one of the partitions, all the frequent itemsets found this way are candidates,
which can be checked by accessing the entire dataset only once. Sampling is simply to mine
a random sampled small subset of the entire data. Since there is no guarantee that we can find
all the frequent itemsets, normal practice is to use a lower support threshold. Trade off has to
be made between accuracy and efficiency. Apriori uses a horizontal data format, i.e. frequent
itemsets are associated with each transaction. Using vertical data format is to use a different
format in which transaction IDs (TIDs) are associated with each itemset. With this format,
mining can be performed by taking the intersection of TIDs. The support count is simply the
length of the TID set for the itemset. There is no need to scan the database because TID set
carries the complete information required for computing support.
The most outstanding improvement over Apriori would be a method called FP-growth
(frequent pattern growth) that succeeded in eliminating candidate generation [
36
]. It adopts
a divide and conquer strategy by (1) compressing the database representing frequent items
into a structure called FP-tree (frequent pattern tree) that retains all the essential information
and (2) dividing the compressed database into a set of conditional databases, each associated
with one frequent itemset and mining each one separately. It scans the database only twice.
In the first scan, all the frequent items and their support counts (frequencies) are derived and
they are sorted in the order of descending support count in each transaction. In the second
scan, items in each transaction are merged into a prefix tree and items (nodes) that appear
in common in different transactions are counted. Each node is associated with an item and
its count. Nodes with the same label are linked by a pointer called node-link. Since items
are sorted in the descending order of frequency, nodes closer to the root of the prefix tree
are shared by more transactions, thus resulting in a very compact representation that stores
all the necessary information. Pattern growth algorithm works on FP-tree by choosing an
123