|
Ian Stuart, Sung-Hyuk Cha, and Charles Tappert
|
tarix | 07.11.2018 | ölçüsü | 0,74 Mb. | | #78833 |
|
Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004
Spam, spam, spam, …
Fighting spam Several commercial applications exist - Server-side: expensive
- Client-side: time-consuming
No approach is 100% effective - Spammers are aggressive and adaptable
- Best solutions are typically hybrids of different approaches and criteria
Common approaches Simple filters Blacklisting: “just say NO” (if you can) - Reject e-mail from known spammers
Whitelisting: “friends only, please” - Accept e-mail only from known correspondents
Classifiers: examine each e-mail and decide - Only a few publications on spam classifiers
Naïve Bayesian classifiers Used in commercial classifiers Assumes recognition features are independent - Max likelihood = product of likelihoods of features
E-mail classifier – examines each word - Training assigns a probability to each word
- Look up each word/probability in a dictionary
- If the product of the probabilities exceeds a given threshold, it is spam
Challenge – creating the “dictionary” We compare our Neural Network against two published Naïve Bayesian classifiers
Naïve Bayesian classifier issues How many features (words), which ones? How is degradation avoided as spammers’ vocabulary changes? What values are assigned to new words? What are the thresholds? How to avoid “sabotage” of classifier?
Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule :Dramatic E,nhancement fo=r .Men = f"fumqid ,Refina'nce now. Don't wait
Which one isn’t spam? (subject headers) 5 Be a mighty warrior in bed! vcrhwt ygjztyjjh Money Back Guarantee_HGH kindle life pddez liw mzac v a l i u m - D i a z e p a m used to relieve anxiety Fairfield tennis schedule :Dramatic E,nhancement fo=r .Men = f"fumqid ,Refina'nce now. Don't wait
The more they try to hide, the easier it is to see them Therefore, we use common spammer patterns (instead of vocabulary) as features for classification Learn these patterns with a Neural Network
Neural Network features Total of 17 features - 6 from the subject header
- 2 from priority and content-type headers
- 9 from the e-mail body
Features from subject header Number of words with no vowels Number of words with at least 15 characters Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word Number of words with all letters in uppercase Binary feature indicating 3 or more repeated characters
Features from priority and content-type headers Binary feature indicating whether the priority had been set to any level besides normal or medium Binary feature indicating whether a content-type header appeared within the message headers or whether the content type had been set to “text/html”
Features from message body Proportion of alphabetic words with no vowels and at least 7 characters Proportion of alphabetic words with at lease two of letters J, K, Q, X, Z Proportion of alphabetic words at least 15 characters long Binary feature indicating whether the strings “From:” and “To:” were both present Number of HTML opening comment tags Number of hyperlinks (“href=“) Number of clickable images represented in HTML Binary feature indicating whether a text color was set to white Number of URLs in hyperlinks with digits or “&”, “%”, or “@”
Neural Network spam classifier 3-layer, feed-forward network (Perceptron) - 17 input units, variable # hidden layer units, 1 output unit
Data – 1,654 e-mails: 854 spam, 800 legitimate Use half of each (spam/non-spam) for training, the other half for testing Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)
nSS = number of spam classified as spam nSL = number of spam classified as legitimate nLL = number of legitimate classified as legitimate nLS = number of legitimate classified as spam
Measure of success: precision
Measure of success: precision Precision: the percentage of labeled spam/legitimate e-mail correctly classified
Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate e-mail correctly classified
Measure of success: accuracy Accuracy: the percentage of actual spam/legitimate e-mail correctly classified
Neural Network results Best overall results with 12 hidden nodes at 500 epochs - Spam Precision: 92.45%
- Legitimate Precision: 91.32%
- Spam Accuracy: 91.80%
- Legitimate Accuracy : 92.00%
35 spams misclassified: 8.20% 32 legitimates misclassified: 8.00%
Misclassified e-mails Most spam misclassified as legitimate were short in length, with few hyperlinks Most legitimate e-mails misclassified as spam had unusual features for personal e-mail (that is, they were “spam-like” in appearance)
Comparing Neural Network and Naïve Bayesian Classifiers Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiers NN classifier required fewer features (17 versus 100 in one study and 500 in another) NN classifier uses descriptive qualities of words and messages similar to those used by human readers
Blacklisting Experiment Manually entered IP addresses of e-mail incorrectly tagged by NN classifier - Entered first (original) IP address and, when present, second IP address (e.g., mail server or ISP)
Into a website that sends IP addresses to 173 working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htm Counted only hit counts greater than one as spam since single-list hits to be anomalies
Of the 32 legitimate e-mails misclassified by the NN, 53% were identified as spam Of the 35 spam e-mails misclassified by the NN, 97% were identified as spam These poor results indicate that the blacklisting strategy, at least for these databases, is inadequate
Conclusions NN competitive to Naïve Bayesian studies despite using a much smaller feature set Room for refinement of parsing for features Use of descriptive, more human-like features makes NN less subject to degradation than Naïve Bayesian
Conclusions (cont.) Neural Network approach is useful and accurate, but too many legitimate -> spam Should be powerful when used in conjunction with a whitelist to reduce legitimate -> spam (nLS), increasing spam precision and legitimate accuracy Blacklisting strategy is not very helpful
Dostları ilə paylaş: |
|
|