Gordon Cormack and Thomas Lynam

Yüklə 1,11 Mb.

A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

Feel free to interrupt when you have any question or comment!

What is Spam?

Unofficial Statistics of Spam (Feb.3 to Feb. 12)

Spam Detection

Text classification alone is not enough

Weather Report Guy

Secret Decoder Ring Dude

Secret Decoder Ring Dude

Diploma Guy

Diploma Guy

Diploma Guy

Diploma Guy

Diploma Guy

More of Diploma Guy

One Solution to Spam Detection

Naïve Bayes

A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

A Plan for Spam 2002 – P. Graham

Algorithms Used in Spam Detection

Which Algorithm is Best?

Overview of the Paper

Problem: Supervised Spam Detection

Methods

Data

Evaluation Measures (1)

Evaluation Measures (2)

Evaluation Measures (3)

Misclassification by Genre

Conclusion

The End

Yüklə 1,11 Mb.

Dostları ilə paylaş:

Gordon Cormack and Thomas Lynam

A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

Gordon Cormack and Thomas Lynam

Presented by Hui Fang

Feel free to interrupt when you have any question or comment!

Feel free to interrupt when you have any question or comment!

What is Spam?

Typical legal definition

Definition mostly used

Unofficial Statistics of Spam (Feb.3 to Feb. 12)

Spam Detection

Text classification alone is not enough

Spammers now often try to obscure text.

Special features are necessary.

…

Weather Report Guy

Content in Image

Secret Decoder Ring Dude

Another spam that looks easy

Is it?

Secret Decoder Ring Dude

Character Encoding

HTML word breaking

Diploma Guy

Word Obscuring

Diploma Guy

Word Obscuring

Diploma Guy

Word Obscuring

Diploma Guy

Word Obscuring

Diploma Guy

Word Obscuring

More of Diploma Guy

Diploma Guy is good at what he does

One Solution to Spam Detection

Machine Learning

Naïve Bayes

Want

Use Bayes Rule:

Assume independence: probability of each word independent of others

A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

One of the first papers on using machine learning to combat spam

Used Naïve Bayes

Feature Space: Words, Phrases, Domain-Specific Features

Evaluation Data: ~1700 Messages, ~88% Spam, from volunteer’s private e-mail

A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz

Hand Crafted Features

Best collection of heuristics discussed in literature

A Plan for Spam 2002 – P. Graham

Widely cited in the open source community

Uses a heavily tuned version of Naïve Bayes

Feature Space: Words in header and body

Feature Selection: ~23,000 features

Evaluation Data: ~8000 messages from author; ~50% spam

Results: Spam precision 100%, Spam recall 99.5%

Algorithms Used in Spam Detection

Which Algorithm is Best?

Very difficult to tell

Overview of the Paper

Problem: Supervised Spam Detection

Methods

Methods in six open-source spam filters

Data

A person’s eight month E-mails

Stored in the order received

49,086 messages with judgements

Evaluation Measures (1)

Evaluation Measures (2)

Ham/Spam tradeoff curve, i.e. ROC curve

Evaluation Measures (3)

Ham/Spam leaning curve

Misclassification by Genre

Not all types of ham are equal

Spam can similarly be classified

Conclusion

Present several possible evaluation measures for spam detection

Compare several spam detection methods

Provide Analysis of the experiment results

However, it would be more interesting to compare the performance of different algorithms (e.g. NB vs. SVM).