Yevgeny V. Yorkhov
2016/02/28
The dataset was collected at Spamgun filtering SaaS service. It includes data about incoming spam for last 5-days.
Dataset includes 237 spam TAGS/predictors according to RSPAMD filtering engine.
You can get all these TAGS here
TAGS/predictors looks like this:
Actually all TAGS/predictors have scores that mean the significance of each TAG. The higher score TAG/predictor has the more probability of spam it gives to the message. All the scores of the message get summarized and the the result called as hits. If hits is higher than threshold then the message marked as spam.
The task is to estimate the most significant TAGS/predictors using logistic regression.
You can use the App to investigate the dataset on your own.
As we can see the accuracy of the prediction is pretty high.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 178 49
1 23 632
Accuracy : 0.9184
95% CI : (0.8983, 0.9356)
No Information Rate : 0.7721
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7781
Mcnemar's Test P-Value : 0.003216
Sensitivity : 0.8856
Specificity : 0.9280
Pos Pred Value : 0.7841
Neg Pred Value : 0.9649
Prevalence : 0.2279
Detection Rate : 0.2018
Detection Prevalence : 0.2574
Balanced Accuracy : 0.9068
'Positive' Class : 0
According to p-value the most significant TAGS/predictors are as following:
Estimate Pr(>|z|)
MISSING_SUBJECT 1.8321 0.0022
FORGED_OUTLOOK_TAGS 1.4102 0.0325
FORGED_SENDER 3.5839 0.0055
MIME_HTML_ONLY 2.1592 0.0000
FROM_EXCESS_QP -2.7457 0.0230
HTML_SHORT_LINK_IMG_1 3.1721 0.0000
HTML_SHORT_LINK_IMG_2 -2.7817 0.0369
BAYES_SPAM 2.9068 0.0000
BAYES_HAM 2.4133 0.0000
HFILTER_URL_ONLY 4.1828 0.0014
DMARC_POLICY_ALLOW 3.0670 0.0002
DMARC_POLICY_SOFTFAIL 3.0428 0.0000
DCC_CHECK -0.6858 0.0006