S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	16/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

11 Concluding remarks
Acknowledgments

Top 10 algorithms in data mining

Based largely on the prior work of CART co-authors Richard Olshen and Charles Stone, the

ﬁnal three chapters of the monograph relate CART to theoretical work on nearest neighbors

and show that as the sample size tends to inﬁnity the following hold: (1) the estimates of

the regression function converge to the true function, and (2) the risks of the terminal nodes

converge to the risks of the corresponding Bayes rules. In other words, speaking informally,

with large enough samples the CART tree will converge to the true function relating the

target to its predictors and achieve the smallest cost possible (the Bayes rate). Practically

speaking. such results may only be realized with sample sizes far larger than in common use

today.

10.11 Selected biographical details

CART is often thought to have originated from the ﬁeld of Statistics but this is only

partially correct. Jerome Friedman completed his PhD in Physics at UC Berkeley and

became leader of the Numerical Methods Group at the Stanford Linear Accelerator Center

in 1972, where he focused on problems in computation. One of his most inﬂuential papers

from 1975 presents a state-of-the-art algorithm for high speed searches for nearest neighbors

in a database. Richard Olshen earned his BA at UC Berkeley and PhD in Statistics at Yale

and focused his earliest work on large sample theory for recursive partitioning. He began his

collaboration with Friedman after joining the Stanford Linear Accelerator Center in 1974.

Leo Breiman earned his BA in Physics at the California Institute of Technology, his PhD

in Mathematics at UC Berkeley, and made notable contributions to pure probability theory

(Breiman, 1968) [

] while a Professor at UCLA. In 1967 he left academia for 13 years to

work as an industrial consultant; during this time he encountered the military data analysis

problems that inspired his contributions to CART. An interview with Leo Breiman discussing

his career and personal life appears in [

Charles Stone earned his BA in mathematics at the California Institute of Technology,

and his PhD in Statistics at Stanford. He pursued probability theory in his early years as an

academic and is the author of several celebrated papers in probability theory and nonpara-

metric regression. He worked with Breiman at UCLA and was drawn by Breiman into the

research leading to CART in the early 1970s. Breiman and Friedman ﬁrst met at an Interface

conference in 1976, which shortly led to collaboration involving all four co-authors. The

ﬁrst outline of their book was produced in a memo dated 1978 and the completed CART

monograph was published in 1984.

The four co-authors have each been distinguished for their work outside of CART. Stone

and Breiman were elected to the National Academy of Sciences (in 1993 and 2001, respec-

tively) and Friedman was elected to the American Academy of Arts and Sciences in 2006.

The speciﬁc work for which they were honored can be found on the respective academy

websites. Olshen is a Fellow of the Institute of Mathematical Statistics, a Fellow of the IEEE,

and Fellow, American Association for the Advancement of Science.

11 Concluding remarks

Data mining is a broad area that integrates techniques from several ﬁelds including machine

learning, statistics, pattern recognition, artiﬁcial intelligence, and database systems, for the

analysis of large volumes of data. There have been a large number of data mining algo-

rithms rooted in these ﬁelds to perform different data analysis tasks. The 10 algorithms

identiﬁed by the IEEE International Conference on Data Mining (ICDM) and presented in

123

X. Wu et al.

this article are among the most inﬂuential algorithms for classiﬁcation [

], clustering

[

–

], statistical learning [

], association analysis [

and link mining.

With a formal tie with the ICDM conference, Knowledge and Information Systems has

been publishing the best papers from ICDM every year, and several of the above papers cited

for classiﬁcation, clustering, statistical learning, and association analysis were selected by

the previous years’ ICDM program committees for journal publication in Knowledge and

Information Systems after their revisions and expansions. We hope this survey paper can

inspire more researchers in data mining to further explore these top-10 algorithms, including

their impact and new research issues.

Acknowledgments

The initiative of identifying the top-10 data mining algorithms started in May 2006 out

of a discussion between Dr. Jiannong Cao in the Department of Computing at the Hong Kong Polytechnic

University (PolyU) and Dr. Xindong Wu, when Dr. Wu was giving a seminar on 10 Challenging Problems in

Data Mining Research [

] at PolyU. Dr. Wu and Dr. Kumar continued this discussion at KDD-06 in August

2006 with various people, and received very enthusiastic support. Naila Elliott in the Department of Computer

Science and Engineering at the University of Minnesota collected and compiled the algorithm nominations

and voting results in the 3-step identiﬁcation process. Yan Zhang in the Department of Computer Science at

the University of Vermont converted the 10 section submissions in different formats into the same LaTeX

format, which was a time-consuming process.

References

1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th

VLDB conference, pp 487–499

2. Ahmed S, Coenen F, Leng PH (2006) Tree-based partitioning of date for association rule mining. Knowl

Inf Syst 10(3):315–331

3. Banerjee A, Merugu S, Dhillon I, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn

Res 6:1705–1749

4. Bezdek JC, Chuah SK, Leep D (1986) Generalized k-nearest neighbor rules. Fuzzy Sets Syst 18(3):237–

256.

http://dx.doi.org/10.1016/0165-0114(86)90004-7

5. Bloch DA, Olshen RA, Walker MG (2002) Risk estimation for classiﬁcation trees. J Comput Graph Stat

11:263–288

6. Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf

Syst 9(2):180–201

7. Breiman L (1968) Probability theory. Addison-Wesley, Reading. Republished (1991) in Classics of math-

ematics. SIAM, Philadelphia

8. Breiman L (1999) Prediction games and arcing classiﬁers. Neural Comput 11(7):1493–1517

9. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classiﬁcation and regression trees. Wadsworth,

Belmont

10. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web Search Sngine. Comput Networks

30(1–7):107–117

11. Chen JR (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3):369–385

12. Cheung DW, Han J, Ng V, Wong CY (1996) Maintenance of discovered association rules in large databases:

an incremental updating technique. In: Proceedings of the ACM SIGMOD international conference on

management of data, pp. 13–23

13. Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over

a data stream sliding window. Knowl Inf Syst 10(3):265–294

14. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features.

Mach Learn 10:57.78 (PEBLS: Parallel Examplar-Based Learning System)

15. Cover T, Hart P (1967) Nearest neighbor pattern classiﬁcation. IEEE Trans Inform Theory 13(1):21–27

16. Dasarathy BV (ed) (1991) Nearest neighbor (NN) norms: NN pattern classiﬁcation techniques. IEEE

Computer Society Press

17. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algo-

rithm (with discussion). J Roy Stat Soc B 39:1–38

123

Top 10 algorithms in data mining

18. Devroye L, Gyorﬁ L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York.

ISBN 0-387-94618-7

19. Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. KDD 2004,

pp 551–556

20. Dietterich TG (1997) Machine learning: Four current directions. AI Mag 18(4):97–136

21. Domingos P (1999) MetaCost: A general method for making classiﬁers cost-sensitive. In: Proceedings

of the ﬁfth international conference on knowledge discovery and data mining, pp 155–164

22. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classiﬁer under zero-one loss.

Mach Learn 29:103–130

23. Fix E, Hodges JL, Jr (1951) Discriminatory analysis, nonparametric discrimination. USAF School of

Aviation Medicine, Randolph Field, Tex., Project 21-49-004, Rept. 4, Contract AF41(128)-31, February

1951

24. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application

to boosting. J Comput Syst Sci 55(1):119–139

25. Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for ﬁnding best matches in logarithmic time.

ACM Trans. Math. Software 3, 209. Also available as Stanford Linear Accelerator Center Rep. SIX-PUB-

1549, February 1975

26. Friedman JH, Kohavi R, Yun Y (1996) Lazy decision trees. In: Proceedings of the thirteenth national

conference on artiﬁcial intelligence, San Francisco, CA. AAAI Press/MIT Press, pp. 717–724

27. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classiﬁers. Mach Learn 29:131–163

28. Fung G, Stoeckel J (2007) SVM feature selection for classiﬁcation of SPECT images of Alzheimer’s

disease using spatial information. Knowl Inf Syst 11(2):243–258

29. Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Inform Theory 18:431–433

30. Golub GH, Van Loan CF (1983) Matrix computations. The Johns Hopkins University Press

31. Gondek D, Hofmann T (2007) Non-redundant data clustering. Knowl Inf Syst 12(1):1–24

32. Han E (1999) Text categorization using weight adjusted k-nearest neighbor classiﬁcation. PhD thesis,

University of Minnesota, October 1999

33. Hand DJ, Yu K (2001) Idiot’s Bayes—not so stupid after all?. Int Stat Rev 69:385–398

34. Gray RM, Neuhoff DL (1998) Quantization. IEEE Trans Inform Theory 44(6):2325–2384

35. Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inform Theory 14:515–516

36. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of

ACM SIGMOD international conference on management of data, pp 1–12

37. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classiﬁcation. IEEE Trans Pattern

Anal Mach Intell 18(6):607–616

38. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting with

discussions. Ann Stat 28(2):337–407

39. Herbrich R, Graepel T, Obermayer K (2000) Rank boundaries for ordinal regression. Adv Mar Classif

pp 115–132

40. Hu T, Sung SY (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst

10(4):505–514

41. Hunt EB, Marin J, Stone PJ (1966) Experiments in induction. Academic Press, New York

42. Inokuchi A, Washio T, Motoda H (2005) General framework for mining frequent subgraphs from labeled

graphs. Fundament Inform 66(1-2):53–82

43. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

44. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering.

Knowl Inf Syst 10(1):17–40

45. Kobayashi M, Aono M (2006) Exploring overlapping clusters using dynamic re-scaling and sampling.

Knowl Inf Syst 10(3):295–313

46. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using

Locality-Sensitive Hashing. Knowl Inf Syst 12(1):25–53

47. Kukar M (2006) Quality assessment of individual classiﬁcations in machine learning and data mining.

Knowl Inf Syst 9(3):364–384

48. Kuramochi M, Karypis G (2005) Gene Classiﬁcation using Expression Proﬁles: A Feasibility Study. Int

J Artif Intell Tools 14(4):641–660

49. Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings.

Princeton University Press, Princeton

50. Leung CW-k, Chan SC-f, Chung F-L (2006) A collaborative ﬁltering framework based on fuzzy associ-

ation rules and multiple-level similarity. Knowl Inf Syst 10(3):357–381

51. Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classiﬁcation: an experimental

investigation. Knowl Inf Syst 10(4):453–472

123

X. Wu et al.

52. Liu B (2007) Web data mining: exploring hyperlinks, contents and usage Data. Springer, Heidelberg

53. Lloyd SP (1957) Least squares quantization in PCM. Unpublished Bell Lab. Tech. Note, portions pre-

sented at the Institute of Mathematical Statistics Meeting Atlantic City, NJ, September 1957. Also, IEEE

Trans Inform Theory (Special Issue on Quantization), vol IT-28, pp 129–137, March 1982

54. Leung CK-S, Khan QI, Li Z, Hoque T (2007) CanTree: a canonical-order tree for incremental frequent-

pattern mining. Knowl Inf Syst 11(3):287–311

55. McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components

in a normal mixture.. Appl Stat 36:318–324

56. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York

57. McLachlan GJ, Peel D (2000) Finite Mixture Models. Wiley, New York

58. Messenger RC, Mandell ML (1972) A model search technique for predictive nominal scale multivariate

analysis. J Am Stat Assoc 67:768–772

59. Morishita S, Sese J (2000) Traversing lattice itemset with statistical metric pruning. In: Proceedings of

PODS’00, pp 226–236

60. Olshen R (2001) A conversation with Leo Breiman. Stat Sci 16(2):184–198

61. Page L, Brin S, Motwami R, Winograd T (1999) The PageRank citation ranking: bringing order to the

Web. Technical Report 1999–0120, Computer Science Department, Stanford University

62. Quinlan JR (1979) Discovering rules by induction from large collections of examples. In: Michie D (ed),

Expert systems in the micro electronic age. Edinburgh University Press, Edinburgh

63. Quinlan R (1989) Unknown attribute values in induction. In: Proceedings of the sixth international work-

shop on machine learning, pp. 164–168

64. Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo

65. Reyzin L, Schapire RE (2006) How boosting the margin can also boost classiﬁer complexity. In: Pro-

ceedings of the 23rd international conference on machine learning. Pittsburgh, PA, pp. 753–760

66. Ridgeway G, Madigan D, Richardson T (1998) Interpretable boosted naive Bayes classiﬁcation. In:

Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on

knowledge discovery and data mining.. AAAI Press, Menlo Park pp 101–104

67. Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227

68. Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: A new explanation for the

effectiveness of voting methods. Ann Stat 26(5):1651–1686

69. Schapire RE, Singer Y (1999) Improved boosting algorithms using conﬁdence-rated predictions. Mach

Learn 37(3):297–336

70. Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press

71. Seidl T, Kriegel H (1998) Optimal multi-step k-nearest neighbor search. In: Tiwary A, Franklin M

(eds) Proceedings of the 1998 ACM SIGMOD international conference on management of data, Seattle,

Washington, United States, 1–4 June, 1998. ACM Press, New York pp 154–165

72. Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of the 21st VLDB

conference. pp. 407–419

73. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceed-

ings of the KDD Workshop on Text Mining

74. Steinbach M, Kumar V (2007) Generalizing the notion of conﬁdence. Knowl Inf Syst 12(3):279–299

75. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison-Wesley

76. Tao D, Li X, Wu X, Hu W, Maybank SJ (2007) Supervised tensor learning. Knowl Inf Syst 13(1):1–42

77. Thabtah FA, Cowling PI, Peng Y (2006) Multiple labels associative classiﬁcation. Knowl Inf Syst

9(1):109–129

78. Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data

Eng 14:659–665

79. Toussaint GT (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: Interface-

2002, 34th symposium on computing and statistics (theme: Geoscience and Remote Sensing). Ritz-Carlton

Hotel, Montreal, Canada, 17–20 April, 2002

80. Toussaint GT (2002) Open problems in geometric methods for instance-based learning. JCDCG 273–283

81. Tsang IW, Kwok JT, Cheung P-M (2005) Core vector machines: Fast SVM training on very large data

sets. J Mach Learn Res 6:363–392

82. Uno T, Asai T, Uchida Y, Arimura H (2004) An efﬁcient algorithm for enumerating frequent closed pat-

terns in transaction databases. In: Proc. of the 7th international conference on discovery science. LNAI

vol 3245, Springe, Heidelberg, pp 16–30

83. Vapnik V (1995) The nature of statistical learning theory. Springer, New York

84. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings

of the IEEE computer society conference on computer vision and pattern recognition. pages 511–518,

Kauai, HI

123

Top 10 algorithms in data mining

85. Washio T, Nakanishi K, Motoda H (2005) Association rules based on levelwise subspace clustering. In:

Proceedings. of 9th European conference on principles and practice of knowledge discovery in databases.

LNAI, vol 3721, pp. 692–700 Springer, Heidelberg

86. Wasserman S, Raust K (1994) Social network analysis. Cambridge University Press, Cambridge

87. Wettschereck D, Aha D, Mohri T (1997) A review and empirical evaluation of feature weighting methods

for a class of lazy learning algorithms. Artif Intell Rev 11:273–314

88. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst

Man Cyberne 2:408–420

89. Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inform Technol Decis

Making 5(4):597–604

90. Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of ICDM’02,

pp 721–724

91. Yu PS, Li X, Liu B (2005) Adding the temporal dimension to search—a case study in publication search.

In: Proceedings of Web Intelligence (WI’05)

92. Zhang J, Kang D-K, Silvescu A, Honavar V (2006) Learning accurate and concise naïve Bayes classiﬁers

from attribute value taxonomies and data. Knowl Inf Syst 9(2):157–179

123

Document Outline

Top 10 algorithms in data mining
Abstract
Introduction
C4.5 and beyond
Introduction
Decision trees
Ruleset classifiers
See5/C5.0
Research issues
The k-means algorithm
The algorithm
Limitations
Generalizations and connections
Support vector machines
The Apriori algorithm
Description of the algorithm
The impact of the algorithm
Current and further research
The EM algorithm
Introduction
Maximum likelihood estimation of normal mixtures
Number of clusters
PageRank
Overview
The algorithm
Further references on PageRank
AdaBoost
Description of the algorithm
Impact of the algorithm
Further research
kNN: k-nearest neighbor classification
Description of the algorithm
Issues
Impact
Current and future research
Naive Bayes
Introduction
The basic principle
Some extensions
Concluding remarks on naive Bayes
CART
Overview
Splitting rules
Prior probabilities and class balancing
Missing value handling
Attribute importance
Dynamic feature construction
Cost-sensitive learning
Stopping rules, pruning, tree sequences, and tree selection
Probability trees
Theoretical foundations
Selected biographical details
Concluding remarks
Acknowledgments
References

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16