|
Gordon Cormack and Thomas Lynam
|
tarix | 07.11.2018 | ölçüsü | 1,11 Mb. | | #78830 |
|
A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang
Feel free to interrupt when you have any question or comment! Feel free to interrupt when you have any question or comment!
What is Spam? Typical legal definition - Unsolicited commercial email from someone without a pre-existing business relationship
Definition mostly used
Unofficial Statistics of Spam (Feb.3 to Feb. 12)
Text classification alone is not enough Spammers now often try to obscure text. Special features are necessary. - E.g. subject line vs. body text
- E.g. Mail in the middle of the night is more likely to be spam than mail in the middle of the day.
…
Weather Report Guy
Another spam that looks easy Is it?
Secret Decoder Ring Dude Character Encoding HTML word breaking
Diploma Guy
Diploma Guy
Diploma Guy
Diploma Guy
Diploma Guy
Diploma Guy is good at what he does
One Solution to Spam Detection
Naïve Bayes Want Use Bayes Rule: Assume independence: probability of each word independent of others
A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz Used Naïve Bayes Feature Space: Words, Phrases, Domain-Specific Features Evaluation Data: ~1700 Messages, ~88% Spam, from volunteer’s private e-mail
A Bayesian Approach to Filtering Junk E-Mail 1998 - Sahami, Dumais, Heckerman, Horvitz Hand Crafted Features - 35 Phrases
- ‘Free Money’
- ‘Only $’
- ‘be over 21’
- 20 Domain Specific Features
- Domain type of sender (.edu, .com, etc)
- Sender name resolutions (internal mail)
- Has attachments
- Time received
- Percent of non-alphanumeric characters in subject
Best collection of heuristics discussed in literature - Without them: Spam precision 97.1% Spam recall 94.3%
- With them: Spam precision 100% Spam recall 98.3%
A Plan for Spam 2002 – P. Graham Widely cited in the open source community Uses a heavily tuned version of Naïve Bayes Feature Space: Words in header and body Feature Selection: ~23,000 features - all that appeared more than 5 times
Evaluation Data: ~8000 messages from author; ~50% spam Results: Spam precision 100%, Spam recall 99.5%
Algorithms Used in Spam Detection
Which Algorithm is Best? Very difficult to tell - No consistently-used good data set
- No standard evaluation measures
Overview of the Paper
Methods Methods in six open-source spam filters - Spamassassin
- Bogofilter
- CRM-114
- DSPAM
- SpamBayes
- Spamprobe
Data A person’s eight month E-mails - From Aug. 2003 to March 2004
Stored in the order received 49,086 messages with judgements - 9,038 (18.4%) ham
- 40,048 (81.6%) spam
Evaluation Measures (1)
Evaluation Measures (2) Ham/Spam tradeoff curve, i.e. ROC curve
Evaluation Measures (3)
Misclassification by Genre Not all types of ham are equal - Some more likely misclassified
- Some more likely missed if filtered
- Some more valuable
Spam can similarly be classified
Conclusion Compare several spam detection methods Provide Analysis of the experiment results However, it would be more interesting to compare the performance of different algorithms (e.g. NB vs. SVM).
The End
Dostları ilə paylaş: |
|
|