Top 10 algorithms in data mining
33
Based largely on the prior work of CART co-authors Richard Olshen and Charles Stone, the
final three chapters of the monograph relate CART to theoretical work on nearest neighbors
and show that as the sample size tends to infinity the following hold: (1) the estimates of
the regression function converge to the true function, and (2) the risks of the terminal nodes
converge to the risks of the corresponding Bayes rules. In other words, speaking informally,
with large enough samples the CART tree will converge to the true function relating the
target to its predictors and achieve the smallest cost possible (the Bayes rate). Practically
speaking. such results may only be realized with sample sizes far larger than in common use
today.
10.11 Selected biographical details
CART is often thought to have originated from the field of Statistics but this is only
partially correct. Jerome Friedman completed his PhD in Physics at UC Berkeley and
became leader of the Numerical Methods Group at the Stanford Linear Accelerator Center
in 1972, where he focused on problems in computation. One of his most influential papers
from 1975 presents a state-of-the-art algorithm for high speed searches for nearest neighbors
in a database. Richard Olshen earned his BA at UC Berkeley and PhD in Statistics at Yale
and focused his earliest work on large sample theory for recursive partitioning. He began his
collaboration with Friedman after joining the Stanford Linear Accelerator Center in 1974.
Leo Breiman earned his BA in Physics at the California Institute of Technology, his PhD
in Mathematics at UC Berkeley, and made notable contributions to pure probability theory
(Breiman, 1968) [
7
] while a Professor at UCLA. In 1967 he left academia for 13 years to
work as an industrial consultant; during this time he encountered the military data analysis
problems that inspired his contributions to CART. An interview with Leo Breiman discussing
his career and personal life appears in [
60
].
Charles Stone earned his BA in mathematics at the California Institute of Technology,
and his PhD in Statistics at Stanford. He pursued probability theory in his early years as an
academic and is the author of several celebrated papers in probability theory and nonpara-
metric regression. He worked with Breiman at UCLA and was drawn by Breiman into the
research leading to CART in the early 1970s. Breiman and Friedman first met at an Interface
conference in 1976, which shortly led to collaboration involving all four co-authors. The
first outline of their book was produced in a memo dated 1978 and the completed CART
monograph was published in 1984.
The four co-authors have each been distinguished for their work outside of CART. Stone
and Breiman were elected to the National Academy of Sciences (in 1993 and 2001, respec-
tively) and Friedman was elected to the American Academy of Arts and Sciences in 2006.
The specific work for which they were honored can be found on the respective academy
websites. Olshen is a Fellow of the Institute of Mathematical Statistics, a Fellow of the IEEE,
and Fellow, American Association for the Advancement of Science.
Data mining is a broad area that integrates techniques from several fields including machine
learning, statistics, pattern recognition, artificial intelligence, and database systems, for the
analysis of large volumes of data. There have been a large number of data mining algo-
rithms rooted in these fields to perform different data analysis tasks. The 10 algorithms
identified by the IEEE International Conference on Data Mining (ICDM) and presented in
123
34
X. Wu et al.
this article are among the most influential algorithms for classification [
47
,
51
,
77
], clustering
[
11
,
31
,
40
,
44
–
46
], statistical learning [
28
,
76
,
92
], association analysis [
2
,
6
,
13
,
50
,
54
,
74
],
and link mining.
With a formal tie with the ICDM conference, Knowledge and Information Systems has
been publishing the best papers from ICDM every year, and several of the above papers cited
for classification, clustering, statistical learning, and association analysis were selected by
the previous years’ ICDM program committees for journal publication in Knowledge and
Information Systems after their revisions and expansions. We hope this survey paper can
inspire more researchers in data mining to further explore these top-10 algorithms, including
their impact and new research issues.
Acknowledgments
The initiative of identifying the top-10 data mining algorithms started in May 2006 out
of a discussion between Dr. Jiannong Cao in the Department of Computing at the Hong Kong Polytechnic
University (PolyU) and Dr. Xindong Wu, when Dr. Wu was giving a seminar on 10 Challenging Problems in
Data Mining Research [
89
] at PolyU. Dr. Wu and Dr. Kumar continued this discussion at KDD-06 in August
2006 with various people, and received very enthusiastic support. Naila Elliott in the Department of Computer
Science and Engineering at the University of Minnesota collected and compiled the algorithm nominations
and voting results in the 3-step identification process. Yan Zhang in the Department of Computer Science at
the University of Vermont converted the 10 section submissions in different formats into the same LaTeX
format, which was a time-consuming process.
References
1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th
VLDB conference, pp 487–499
2. Ahmed S, Coenen F, Leng PH (2006) Tree-based partitioning of date for association rule mining. Knowl
Inf Syst 10(3):315–331
3. Banerjee A, Merugu S, Dhillon I, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn
Res 6:1705–1749
4. Bezdek JC, Chuah SK, Leep D (1986) Generalized k-nearest neighbor rules. Fuzzy Sets Syst 18(3):237–
256.
http://dx.doi.org/10.1016/0165-0114(86)90004-7
5. Bloch DA, Olshen RA, Walker MG (2002) Risk estimation for classification trees. J Comput Graph Stat
11:263–288
6. Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf
Syst 9(2):180–201
7. Breiman L (1968) Probability theory. Addison-Wesley, Reading. Republished (1991) in Classics of math-
ematics. SIAM, Philadelphia
8. Breiman L (1999) Prediction games and arcing classifiers. Neural Comput 11(7):1493–1517
9. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth,
Belmont
10. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web Search Sngine. Comput Networks
30(1–7):107–117
11. Chen JR (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3):369–385
12. Cheung DW, Han J, Ng V, Wong CY (1996) Maintenance of discovered association rules in large databases:
an incremental updating technique. In: Proceedings of the ACM SIGMOD international conference on
management of data, pp. 13–23
13. Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over
a data stream sliding window. Knowl Inf Syst 10(3):265–294
14. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features.
Mach Learn 10:57.78 (PEBLS: Parallel Examplar-Based Learning System)
15. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
16. Dasarathy BV (ed) (1991) Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE
Computer Society Press
17. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algo-
rithm (with discussion). J Roy Stat Soc B 39:1–38
123
Top 10 algorithms in data mining
35
18. Devroye L, Gyorfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York.
ISBN 0-387-94618-7
19. Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. KDD 2004,
pp 551–556
20. Dietterich TG (1997) Machine learning: Four current directions. AI Mag 18(4):97–136
21. Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings
of the fifth international conference on knowledge discovery and data mining, pp 155–164
22. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss.
Mach Learn 29:103–130
23. Fix E, Hodges JL, Jr (1951) Discriminatory analysis, nonparametric discrimination. USAF School of
Aviation Medicine, Randolph Field, Tex., Project 21-49-004, Rept. 4, Contract AF41(128)-31, February
1951
24. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application
to boosting. J Comput Syst Sci 55(1):119–139
25. Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for finding best matches in logarithmic time.
ACM Trans. Math. Software 3, 209. Also available as Stanford Linear Accelerator Center Rep. SIX-PUB-
1549, February 1975
26. Friedman JH, Kohavi R, Yun Y (1996) Lazy decision trees. In: Proceedings of the thirteenth national
conference on artificial intelligence, San Francisco, CA. AAAI Press/MIT Press, pp. 717–724
27. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
28. Fung G, Stoeckel J (2007) SVM feature selection for classification of SPECT images of Alzheimer’s
disease using spatial information. Knowl Inf Syst 11(2):243–258
29. Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inform Theory 18:431–433
30. Golub GH, Van Loan CF (1983) Matrix computations. The Johns Hopkins University Press
31. Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1):1–24
32. Han E (1999) Text categorization using weight adjusted k-nearest neighbor classification. PhD thesis,
University of Minnesota, October 1999
33. Hand DJ, Yu K (2001) Idiot’s Bayes—not so stupid after all?. Int Stat Rev 69:385–398
34. Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inform Theory 44(6):2325–2384
35. Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inform Theory 14:515–516
36. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of
ACM SIGMOD international conference on management of data, pp 1–12
37. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern
Anal Mach Intell 18(6):607–616
38. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting with
discussions. Ann Stat 28(2):337–407
39. Herbrich R, Graepel T, Obermayer K (2000) Rank boundaries for ordinal regression. Adv Mar Classif
pp 115–132
40. Hu T, Sung SY (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst
10(4):505–514
41. Hunt EB, Marin J, Stone PJ (1966) Experiments in induction. Academic Press, New York
42. Inokuchi A, Washio T, Motoda H (2005) General framework for mining frequent subgraphs from labeled
graphs. Fundament Inform 66(1-2):53–82
43. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
44. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering.
Knowl Inf Syst 10(1):17–40
45. Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling.
Knowl Inf Syst 10(3):295–313
46. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using
Locality-Sensitive Hashing. Knowl Inf Syst 12(1):25–53
47. Kukar M (2006) Quality assessment of individual classifications in machine learning and data mining.
Knowl Inf Syst 9(3):364–384
48. Kuramochi M, Karypis G (2005) Gene Classification using Expression Profiles: A Feasibility Study. Int
J Artif Intell Tools 14(4):641–660
49. Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings.
Princeton University Press, Princeton
50. Leung CW-k, Chan SC-f, Chung F-L (2006) A collaborative filtering framework based on fuzzy associ-
ation rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381
51. Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental
investigation. Knowl Inf Syst 10(4):453–472
123
36
X. Wu et al.
52. Liu B (2007) Web data mining: exploring hyperlinks, contents and usage Data. Springer, Heidelberg
53. Lloyd SP (1957) Least squares quantization in PCM. Unpublished Bell Lab. Tech. Note, portions pre-
sented at the Institute of Mathematical Statistics Meeting Atlantic City, NJ, September 1957. Also, IEEE
Trans Inform Theory (Special Issue on Quantization), vol IT-28, pp 129–137, March 1982
54. Leung CK-S, Khan QI, Li Z, Hoque T (2007) CanTree: a canonical-order tree for incremental frequent-
pattern mining. Knowl Inf Syst 11(3):287–311
55. McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components
in a normal mixture.. Appl Stat 36:318–324
56. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York
57. McLachlan GJ, Peel D (2000) Finite Mixture Models. Wiley, New York
58. Messenger RC, Mandell ML (1972) A model search technique for predictive nominal scale multivariate
analysis. J Am Stat Assoc 67:768–772
59. Morishita S, Sese J (2000) Traversing lattice itemset with statistical metric pruning. In: Proceedings of
PODS’00, pp 226–236
60. Olshen R (2001) A conversation with Leo Breiman. Stat Sci 16(2):184–198
61. Page L, Brin S, Motwami R, Winograd T (1999) The PageRank citation ranking: bringing order to the
Web. Technical Report 1999–0120, Computer Science Department, Stanford University
62. Quinlan JR (1979) Discovering rules by induction from large collections of examples. In: Michie D (ed),
Expert systems in the micro electronic age. Edinburgh University Press, Edinburgh
63. Quinlan R (1989) Unknown attribute values in induction. In: Proceedings of the sixth international work-
shop on machine learning, pp. 164–168
64. Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo
65. Reyzin L, Schapire RE (2006) How boosting the margin can also boost classifier complexity. In: Pro-
ceedings of the 23rd international conference on machine learning. Pittsburgh, PA, pp. 753–760
66. Ridgeway G, Madigan D, Richardson T (1998) Interpretable boosted naive Bayes classification. In:
Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on
knowledge discovery and data mining.. AAAI Press, Menlo Park pp 101–104
67. Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
68. Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: A new explanation for the
effectiveness of voting methods. Ann Stat 26(5):1651–1686
69. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach
Learn 37(3):297–336
70. Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press
71. Seidl T, Kriegel H (1998) Optimal multi-step k-nearest neighbor search. In: Tiwary A, Franklin M
(eds) Proceedings of the 1998 ACM SIGMOD international conference on management of data, Seattle,
Washington, United States, 1–4 June, 1998. ACM Press, New York pp 154–165
72. Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of the 21st VLDB
conference. pp. 407–419
73. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceed-
ings of the KDD Workshop on Text Mining
74. Steinbach M, Kumar V (2007) Generalizing the notion of confidence. Knowl Inf Syst 12(3):279–299
75. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison-Wesley
76. Tao D, Li X, Wu X, Hu W, Maybank SJ (2007) Supervised tensor learning. Knowl Inf Syst 13(1):1–42
77. Thabtah FA, Cowling PI, Peng Y (2006) Multiple labels associative classification. Knowl Inf Syst
9(1):109–129
78. Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data
Eng 14:659–665
79. Toussaint GT (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: Interface-
2002, 34th symposium on computing and statistics (theme: Geoscience and Remote Sensing). Ritz-Carlton
Hotel, Montreal, Canada, 17–20 April, 2002
80. Toussaint GT (2002) Open problems in geometric methods for instance-based learning. JCDCG 273–283
81. Tsang IW, Kwok JT, Cheung P-M (2005) Core vector machines: Fast SVM training on very large data
sets. J Mach Learn Res 6:363–392
82. Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating frequent closed pat-
terns in transaction databases. In: Proc. of the 7th international conference on discovery science. LNAI
vol 3245, Springe, Heidelberg, pp 16–30
83. Vapnik V (1995) The nature of statistical learning theory. Springer, New York
84. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings
of the IEEE computer society conference on computer vision and pattern recognition. pages 511–518,
Kauai, HI
123
Top 10 algorithms in data mining
37
85. Washio T, Nakanishi K, Motoda H (2005) Association rules based on levelwise subspace clustering. In:
Proceedings. of 9th European conference on principles and practice of knowledge discovery in databases.
LNAI, vol 3721, pp. 692–700 Springer, Heidelberg
86. Wasserman S, Raust K (1994) Social network analysis. Cambridge University Press, Cambridge
87. Wettschereck D, Aha D, Mohri T (1997) A review and empirical evaluation of feature weighting methods
for a class of lazy learning algorithms. Artif Intell Rev 11:273–314
88. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst
Man Cyberne 2:408–420
89. Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis
Making 5(4):597–604
90. Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of ICDM’02,
pp 721–724
91. Yu PS, Li X, Liu B (2005) Adding the temporal dimension to search—a case study in publication search.
In: Proceedings of Web Intelligence (WI’05)
92. Zhang J, Kang D-K, Silvescu A, Honavar V (2006) Learning accurate and concise naïve Bayes classifiers
from attribute value taxonomies and data. Knowl Inf Syst 9(2):157–179
123
Document Outline - Top 10 algorithms in data mining
- Abstract
- Introduction
- C4.5 and beyond
- Introduction
- Decision trees
- Ruleset classifiers
- See5/C5.0
- Research issues
- The k-means algorithm
- The algorithm
- Limitations
- Generalizations and connections
- Support vector machines
- The Apriori algorithm
- Description of the algorithm
- The impact of the algorithm
- Current and further research
- The EM algorithm
- Introduction
- Maximum likelihood estimation of normal mixtures
- Number of clusters
- PageRank
- Overview
- The algorithm
- Further references on PageRank
- AdaBoost
- Description of the algorithm
- Impact of the algorithm
- Further research
- kNN: k-nearest neighbor classification
- Description of the algorithm
- Issues
- Impact
- Current and future research
- Naive Bayes
- Introduction
- The basic principle
- Some extensions
- Concluding remarks on naive Bayes
- CART
- Overview
- Splitting rules
- Prior probabilities and class balancing
- Missing value handling
- Attribute importance
- Dynamic feature construction
- Cost-sensitive learning
- Stopping rules, pruning, tree sequences, and tree selection
- Probability trees
- Theoretical foundations
- Selected biographical details
- Concluding remarks
- Acknowledgments
- References
Dostları ilə paylaş: |