S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	6/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 2 3 4 5 6 7 8 9 ... 16

Top 10 algorithms in data mining

1. Generate C

, candidates of frequent itemsets of size k

+ 1, from the frequent itemsets

of size k.

2. Scan the database and calculate the support of each candidate of frequent itemsets.

3. Add those itemsets that satisﬁes the minimum support requirement to F

The Apriori algorithm is shown in Fig.

. Function apriori-gen in line 3 generates C

from F

in the following two step process:

1. Join step: Generate R

K

, the initial candidates of frequent itemsets of size k

+ 1 by

taking the union of the two frequent itemsets of size k, P

and Q

that have the ﬁrst k

−1

elements in common.

R

K

= P

∪ Q

= {item

, . . . , item

−1

, item

}

P

= {item

, item

, . . . , item

−1

, item

}

Q

= {item

, item

, . . . , item

−1

, item

}

where, i teml

< item2 < · · · < item

k

< item

k

2. Prune step: Check if all the itemsets of size k in R

are frequent and generate C

removing those that do not pass this requirement from R

k

. This is because any subset

of size k of C

k

that is not frequent cannot be a subset of a frequent itemset of size

+ 1.

Function subset in line 5 ﬁnds all the candidates of the frequent itemsets included in trans-

action t. Apriori, then, calculates frequency only for those candidates generated this way by

scanning the database.

It is evident that Apriori scans the database at most k

max

+1

times when the maximum size

of frequent itemsets is set at k

max

The Apriori achieves good performance by reducing the size of candidate sets (Fig.

).

However, in situations with very many frequent itemsets, large itemsets, or very low min-

imum support, it still suffers from the cost of generating a huge number of candidate sets

Fig. 3 Apriori algorithm

123

X. Wu et al.

and scanning the database repeatedly to check a large set of candidate itemsets. In fact, it is

necessary to generate 2

100

candidate itemsets to obtain frequent itemsets of size 100.

4.2 The impact of the algorithm

Many of the pattern ﬁnding algorithms such as decision tree, classiﬁcation rules and clustering

techniques that are frequently used in data mining have been developed in machine learning

research community. Frequent pattern and association rule mining is one of the few excep-

tions to this tradition. The introduction of this technique boosted data mining research and its

impact is tremendous. The algorithm is quite simple and easy to implement. Experimenting

with Apriori-like algorithm is the ﬁrst thing that data miners try to do.

4.3 Current and further research

Since Apriori algorithm was ﬁrst introduced and as experience was accumulated, there have

been many attempts to devise more efﬁcient algorithms of frequent itemset mining. Many

of them share the same idea with Apriori in that they generate candidates. These include

hash-based technique, partitioning, sampling and using vertical data format. Hash-based

technique can reduce the size of candidate itemsets. Each itemset is hashed into a corre-

sponding bucket by using an appropriate hash function. Since a bucket can contain different

itemsets, if its count is less than a minimum support, these itemsets in the bucket can be

removed from the candidate sets. A partitioning can be used to divide the entire mining prob-

lem into n smaller problems. The dataset is divided into n non-overlapping partitions such

that each partition ﬁts into main memory and each partition is mined separately. Since any

itemset that is potentially frequent with respect to the entire dataset must occur as a frequent

itemset in at least one of the partitions, all the frequent itemsets found this way are candidates,

which can be checked by accessing the entire dataset only once. Sampling is simply to mine

a random sampled small subset of the entire data. Since there is no guarantee that we can ﬁnd

all the frequent itemsets, normal practice is to use a lower support threshold. Trade off has to

be made between accuracy and efﬁciency. Apriori uses a horizontal data format, i.e. frequent

itemsets are associated with each transaction. Using vertical data format is to use a different

format in which transaction IDs (TIDs) are associated with each itemset. With this format,

mining can be performed by taking the intersection of TIDs. The support count is simply the

length of the TID set for the itemset. There is no need to scan the database because TID set

carries the complete information required for computing support.

The most outstanding improvement over Apriori would be a method called FP-growth

(frequent pattern growth) that succeeded in eliminating candidate generation [

]. It adopts

a divide and conquer strategy by (1) compressing the database representing frequent items

into a structure called FP-tree (frequent pattern tree) that retains all the essential information

and (2) dividing the compressed database into a set of conditional databases, each associated

with one frequent itemset and mining each one separately. It scans the database only twice.

In the ﬁrst scan, all the frequent items and their support counts (frequencies) are derived and

they are sorted in the order of descending support count in each transaction. In the second

scan, items in each transaction are merged into a preﬁx tree and items (nodes) that appear

in common in different transactions are counted. Each node is associated with an item and

its count. Nodes with the same label are linked by a pointer called node-link. Since items

are sorted in the descending order of frequency, nodes closer to the root of the preﬁx tree

are shared by more transactions, thus resulting in a very compact representation that stores

all the necessary information. Pattern growth algorithm works on FP-tree by choosing an

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 16