Spam predictors analysis

Yevgeny V. Yorkhov
2016/02/28

Dataset

The dataset was collected at Spamgun filtering SaaS service. It includes data about incoming spam for last 5-days.

Dataset includes 237 spam TAGS/predictors according to RSPAMD filtering engine.

You can get all these TAGS here

TAGS/predictors looks like this:

MISSING_SUBJECT
FORGED_SENDER
SUSPICIOUS_RECEIPS
FAKE_REPLY
FORGED_OUTLOOK_HTML
etc.

What TAGS/predictors mean

Actually all TAGS/predictors have scores that mean the significance of each TAG. The higher score TAG/predictor has the more probability of spam it gives to the message. All the scores of the message get summarized and the the result called as hits. If hits is higher than threshold then the message marked as spam.

The task

The task is to estimate the most significant TAGS/predictors using logistic regression.

You can use the App to investigate the dataset on your own.

Residual analysis of logistic regression fitting

plot of chunk unnamed-chunk-1

Accuracy of the prediction

As we can see the accuracy of the prediction is pretty high.

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 178  49
         1  23 632

               Accuracy : 0.9184          
                 95% CI : (0.8983, 0.9356)
    No Information Rate : 0.7721          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7781          
 Mcnemar's Test P-Value : 0.003216        

            Sensitivity : 0.8856          
            Specificity : 0.9280          
         Pos Pred Value : 0.7841          
         Neg Pred Value : 0.9649          
             Prevalence : 0.2279          
         Detection Rate : 0.2018          
   Detection Prevalence : 0.2574          
      Balanced Accuracy : 0.9068          

       'Positive' Class : 0

The most significant TAGS/predictors

According to p-value the most significant TAGS/predictors are as following:

                      Estimate Pr(>|z|)
MISSING_SUBJECT         1.8321   0.0022
FORGED_OUTLOOK_TAGS     1.4102   0.0325
FORGED_SENDER           3.5839   0.0055
MIME_HTML_ONLY          2.1592   0.0000
FROM_EXCESS_QP         -2.7457   0.0230
HTML_SHORT_LINK_IMG_1   3.1721   0.0000
HTML_SHORT_LINK_IMG_2  -2.7817   0.0369
BAYES_SPAM              2.9068   0.0000
BAYES_HAM               2.4133   0.0000
HFILTER_URL_ONLY        4.1828   0.0014
DMARC_POLICY_ALLOW      3.0670   0.0002
DMARC_POLICY_SOFTFAIL   3.0428   0.0000
DCC_CHECK              -0.6858   0.0006