Spam Dataset Analysis

Goal

Create a classifier for spam messages based on the dataset taken from the web: https://archive.ics.uci.edu/ml/datasets/Spambase. Find the best Youden value and threshold for ROC curve.

About Spambase

The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography…

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

For more info, visit the dataset web page: https://archive.ics.uci.edu/ml/datasets/Spambase

How does it work

This classifier is a logistic regrassion classifier using gml function.

First step: open the csv file and set the workspace

setwd("~/Documents/RSTUDIO/spambase");
spam = read.csv("spambase.csv");

After this, it’s time to generate our Logistic regrassion. In this case, X1 is the target class - independet variable - and the rest are dependent variables - features.

spamLR = glm(X1 ~ X0 + X0.64 + X0.64.1 + X0.1 + X0.32 + X0.2 + X0.3 + 
               X0.4 + X0.5 + X0.6 + X0.7 + X0.64.2 + X0.8 + X0.9 + 
               X0.10 + X0.32.1 + X0.11 + X1.29 + X1.93 + X0.12 + 
               X0.96 + X0.13 + X0.14 + X0.15 + X0.16 + X0.17 + 
               X0.18 + X0.19 + X0.20 + X0.21 + X0.22 + X0.23 + 
               X0.24 + X0.25 + X0.26 + X0.27 + X0.28 + X0.29 + 
               X0.30 + X0.31 + X0.33 + X0.34 + X0.35 + X0.36 + 
               X0.37 + X0.38 + X0.39 + X0.40 + X0.41 + X0.42 + 
               X0.43 + X0.778 + X0.44 + X0.45 + X3.756 + X61 + X278,
             data=spam, family=binomial);

At this time, the logistic regration is done and now it’s time to find the threshold that maximizes Youden’s value.

A for loop is the best way to find the best match against Sensitivity and Specificity. At each iteration it calculates the value of sensitivity, specificity, accuracy and youden for this specific threshold.

In this case, the minimum threshold value is 0.1 and the maximum is 0.9.

j=0
for(i in seq( 0, 0.99, by = 0.01)){
  j=j+1
  confMatrix = table(true=spam$X1, test=spamLR$fitted.values>i);
  resultsTable[j,1] = i;
  #true positive rate - sensitivity
  resultsTable[j,2] = confMatrix[4]/(confMatrix[2]+confMatrix[4]);
  #true negative rate - specificity
  resultsTable[j,3] = confMatrix[1]/(confMatrix[1]+confMatrix[3]);
  #accuracy
  resultsTable[j,4] = (confMatrix[4] + confMatrix[1])/(sum(confMatrix));
  #Youden
  resultsTable[j,5] = resultsTable[j,2] + resultsTable[j,3] - 1;
  }

The result of this operation is the follow table

##        Threshold Sensitivity Specificity  Accuracy Youden's J
##   [1,]      0.00   1.0000000   0.0000000 0.3939130  0.0000000
##   [2,]      0.01   0.9977925   0.5412482 0.7210870  0.5390407
##   [3,]      0.02   0.9961369   0.5864419 0.7478261  0.5825788
##   [4,]      0.03   0.9950331   0.6269727 0.7719565  0.6220059
##   [5,]      0.04   0.9933775   0.6474175 0.7836957  0.6407950
##   [6,]      0.05   0.9928256   0.6646341 0.7939130  0.6574598
##   [7,]      0.06   0.9911700   0.6800574 0.8026087  0.6712274
##   [8,]      0.07   0.9900662   0.6933286 0.8102174  0.6833948
##   [9,]      0.08   0.9900662   0.7083931 0.8193478  0.6984593
##  [10,]      0.09   0.9884106   0.7137733 0.8219565  0.7021839
##  [11,]      0.10   0.9884106   0.7220230 0.8269565  0.7104336
##  [12,]      0.11   0.9878587   0.7284792 0.8306522  0.7163379
##  [13,]      0.12   0.9873068   0.7395983 0.8371739  0.7269051
##  [14,]      0.13   0.9850993   0.7485653 0.8417391  0.7336646
##  [15,]      0.14   0.9850993   0.7553802 0.8458696  0.7404795
##  [16,]      0.15   0.9845475   0.7618364 0.8495652  0.7463839
##  [17,]      0.16   0.9828918   0.7725968 0.8554348  0.7554887
##  [18,]      0.17   0.9823400   0.7804878 0.8600000  0.7628278
##  [19,]      0.18   0.9806843   0.8142037 0.8797826  0.7948881
##  [20,]      0.19   0.9773731   0.8346485 0.8908696  0.8120216
##  [21,]      0.20   0.9757174   0.8443329 0.8960870  0.8200503
##  [22,]      0.21   0.9735099   0.8525825 0.9002174  0.8260924
##  [23,]      0.22   0.9707506   0.8590387 0.9030435  0.8297893
##  [24,]      0.23   0.9668874   0.8654950 0.9054348  0.8323824
##  [25,]      0.24   0.9635762   0.8741033 0.9093478  0.8376795
##  [26,]      0.25   0.9602649   0.8787661 0.9108696  0.8390310
##  [27,]      0.26   0.9552980   0.8834290 0.9117391  0.8387270
##  [28,]      0.27   0.9530905   0.8877331 0.9134783  0.8408236
##  [29,]      0.28   0.9497792   0.8923960 0.9150000  0.8421752
##  [30,]      0.29   0.9475717   0.8977762 0.9173913  0.8453479
##  [31,]      0.30   0.9453642   0.9024390 0.9193478  0.8478033
##  [32,]      0.31   0.9431567   0.9067432 0.9210870  0.8498999
##  [33,]      0.32   0.9420530   0.9114060 0.9234783  0.8534590
##  [34,]      0.33   0.9387417   0.9139168 0.9236957  0.8526585
##  [35,]      0.34   0.9365342   0.9175036 0.9250000  0.8540378
##  [36,]      0.35   0.9348786   0.9239598 0.9282609  0.8588384
##  [37,]      0.36   0.9310155   0.9264706 0.9282609  0.8574860
##  [38,]      0.37   0.9293598   0.9286227 0.9289130  0.8579825
##  [39,]      0.38   0.9271523   0.9314921 0.9297826  0.8586444
##  [40,]      0.39   0.9254967   0.9336442 0.9304348  0.8591409
##  [41,]      0.40   0.9221854   0.9372310 0.9313043  0.8594164
##  [42,]      0.41   0.9205298   0.9397418 0.9321739  0.8602716
##  [43,]      0.42   0.9177704   0.9422525 0.9326087  0.8600229
##  [44,]      0.43   0.9155629   0.9458393 0.9339130  0.8614022
##  [45,]      0.44   0.9116998   0.9472740 0.9332609  0.8589738
##  [46,]      0.45   0.9067329   0.9479914 0.9317391  0.8547243
##  [47,]      0.46   0.9028698   0.9487088 0.9306522  0.8515785
##  [48,]      0.47   0.9001104   0.9512195 0.9310870  0.8513299
##  [49,]      0.48   0.8984547   0.9526542 0.9313043  0.8511090
##  [50,]      0.49   0.8940397   0.9548063 0.9308696  0.8488460
##  [51,]      0.50   0.8923841   0.9562410 0.9310870  0.8486251
##  [52,]      0.51   0.8907285   0.9583931 0.9317391  0.8491216
##  [53,]      0.52   0.8863135   0.9594692 0.9306522  0.8457826
##  [54,]      0.53   0.8852097   0.9609039 0.9310870  0.8461136
##  [55,]      0.54   0.8830022   0.9623386 0.9310870  0.8453408
##  [56,]      0.55   0.8791391   0.9626973 0.9297826  0.8418363
##  [57,]      0.56   0.8785872   0.9637733 0.9302174  0.8423605
##  [58,]      0.57   0.8763797   0.9641320 0.9295652  0.8405117
##  [59,]      0.58   0.8714128   0.9641320 0.9276087  0.8355448
##  [60,]      0.59   0.8686534   0.9644907 0.9267391  0.8331441
##  [61,]      0.60   0.8609272   0.9655667 0.9243478  0.8264939
##  [62,]      0.61   0.8581678   0.9666428 0.9239130  0.8248105
##  [63,]      0.62   0.8548565   0.9684362 0.9236957  0.8232927
##  [64,]      0.63   0.8526490   0.9684362 0.9228261  0.8210852
##  [65,]      0.64   0.8465784   0.9695122 0.9210870  0.8160906
##  [66,]      0.65   0.8421634   0.9705882 0.9200000  0.8127516
##  [67,]      0.66   0.8399558   0.9713056 0.9195652  0.8112614
##  [68,]      0.67   0.8355408   0.9713056 0.9178261  0.8068464
##  [69,]      0.68   0.8327815   0.9720230 0.9171739  0.8048044
##  [70,]      0.69   0.8283664   0.9720230 0.9154348  0.8003894
##  [71,]      0.70   0.8245033   0.9723816 0.9141304  0.7968849
##  [72,]      0.71   0.8167770   0.9723816 0.9110870  0.7891587
##  [73,]      0.72   0.8151214   0.9730990 0.9108696  0.7882204
##  [74,]      0.73   0.8079470   0.9745337 0.9089130  0.7824807
##  [75,]      0.74   0.8013245   0.9748924 0.9065217  0.7762169
##  [76,]      0.75   0.7941501   0.9756098 0.9041304  0.7697599
##  [77,]      0.76   0.7858720   0.9756098 0.9008696  0.7614817
##  [78,]      0.77   0.7814570   0.9770445 0.9000000  0.7585014
##  [79,]      0.78   0.7764901   0.9774032 0.8982609  0.7538932
##  [80,]      0.79   0.7704194   0.9784792 0.8965217  0.7488986
##  [81,]      0.80   0.7632450   0.9791966 0.8941304  0.7424416
##  [82,]      0.81   0.7577263   0.9791966 0.8919565  0.7369228
##  [83,]      0.82   0.7500000   0.9809900 0.8900000  0.7309900
##  [84,]      0.83   0.7345475   0.9824247 0.8847826  0.7169721
##  [85,]      0.84   0.7201987   0.9835007 0.8797826  0.7036994
##  [86,]      0.85   0.7113687   0.9842181 0.8767391  0.6955867
##  [87,]      0.86   0.6898455   0.9845768 0.8684783  0.6744222
##  [88,]      0.87   0.6738411   0.9845768 0.8621739  0.6584178
##  [89,]      0.88   0.6605960   0.9852941 0.8573913  0.6458901
##  [90,]      0.89   0.6484547   0.9870875 0.8536957  0.6355423
##  [91,]      0.90   0.6368653   0.9881636 0.8497826  0.6250289
##  [92,]      0.91   0.6274834   0.9895983 0.8469565  0.6170817
##  [93,]      0.92   0.6181015   0.9895983 0.8432609  0.6076998
##  [94,]      0.93   0.6037528   0.9906743 0.8382609  0.5944271
##  [95,]      0.94   0.5849890   0.9913917 0.8313043  0.5763806
##  [96,]      0.95   0.5535320   0.9921090 0.8193478  0.5456410
##  [97,]      0.96   0.5275938   0.9924677 0.8093478  0.5200615
##  [98,]      0.97   0.4928256   0.9924677 0.7956522  0.4852933
##  [99,]      0.98   0.4514349   0.9942611 0.7804348  0.4456960
## [100,]      0.99   0.3962472   0.9956958 0.7595652  0.3919431
## [101,]      1.00   0.0000000   1.0000000 0.6060870  0.0000000

And using this values, we can now representante the ROC curve. On X axis we’ve 1-specificity and Y axis sensitivity.

## [1] "Max value of Youden's J:  0.861402225241575"

## [1] "Th:  0.43"

Accoardint to this ROC curve, the max Youden’s corresponds to 0.8614022, and the best match between sensitivity ans specificity corresponds to: Sensitivity: 0.9155629, Specificity: 0.9458393 (or 0.0541607 that corresponds to 1-Specificity).

PCA

Disorganized

library(‘caret’)

spam2 = read.csv(“spambase.csv”);

range(nzv$percentUnique);

Remove zero and close to zero variance

nzv = nearZeroVar(spam2, saveMetrics = TRUE)

how many have no variation at all

print(length(nzv[nzv$zeroVar==T,]))

print(paste(‘Column count before cutoff:’,ncol(spam2)))

how many have less than 0.1 percent variance

dim(nzv[nzv$percentUnique > 0.1,])

remove zero & near-zero variance from original data set

spam2_nzv = spam2[c(rownames(nzv[nzv$percentUnique > 0.1,])) ]

print(paste(‘Column count after cutoff:’,ncol(spam2_nzv)))

Run model on original data set TO DO

calcular a área abaixo da curva ROC para o dataset inicial

PCA

pmatrix = scale(spam2_nzv)

princ = prcomp(pmatrix)

nFeatures = 10;

dfComponents = predict(princ, newdata=pmatrix)[,1:nFeatures] fazer regressão logística a estesnovos dados e calcular a área abaixo da curva ROC

Spam Dataset Analysis

J.P. Oliveira

14 de Março de 2015

Goal

About Spambase

How does it work

PCA