Create a classifier for spam messages based on the dataset taken from the web: https://archive.ics.uci.edu/ml/datasets/Spambase. Find the best Youden value and threshold for ROC curve.
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography…
Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
For more info, visit the dataset web page: https://archive.ics.uci.edu/ml/datasets/Spambase
This classifier is a logistic regrassion classifier using gml function.
First step: open the csv file and set the workspace
setwd("~/Documents/RSTUDIO/spambase");
spam = read.csv("spambase.csv");
After this, it’s time to generate our Logistic regrassion. In this case, X1 is the target class - independet variable - and the rest are dependent variables - features.
spamLR = glm(X1 ~ X0 + X0.64 + X0.64.1 + X0.1 + X0.32 + X0.2 + X0.3 +
X0.4 + X0.5 + X0.6 + X0.7 + X0.64.2 + X0.8 + X0.9 +
X0.10 + X0.32.1 + X0.11 + X1.29 + X1.93 + X0.12 +
X0.96 + X0.13 + X0.14 + X0.15 + X0.16 + X0.17 +
X0.18 + X0.19 + X0.20 + X0.21 + X0.22 + X0.23 +
X0.24 + X0.25 + X0.26 + X0.27 + X0.28 + X0.29 +
X0.30 + X0.31 + X0.33 + X0.34 + X0.35 + X0.36 +
X0.37 + X0.38 + X0.39 + X0.40 + X0.41 + X0.42 +
X0.43 + X0.778 + X0.44 + X0.45 + X3.756 + X61 + X278,
data=spam, family=binomial);
At this time, the logistic regration is done and now it’s time to find the threshold that maximizes Youden’s value.
A for loop is the best way to find the best match against Sensitivity and Specificity. At each iteration it calculates the value of sensitivity, specificity, accuracy and youden for this specific threshold.
In this case, the minimum threshold value is 0.1 and the maximum is 0.9.
j=0
for(i in seq( 0, 0.99, by = 0.01)){
j=j+1
confMatrix = table(true=spam$X1, test=spamLR$fitted.values>i);
resultsTable[j,1] = i;
#true positive rate - sensitivity
resultsTable[j,2] = confMatrix[4]/(confMatrix[2]+confMatrix[4]);
#true negative rate - specificity
resultsTable[j,3] = confMatrix[1]/(confMatrix[1]+confMatrix[3]);
#accuracy
resultsTable[j,4] = (confMatrix[4] + confMatrix[1])/(sum(confMatrix));
#Youden
resultsTable[j,5] = resultsTable[j,2] + resultsTable[j,3] - 1;
}
The result of this operation is the follow table
## Threshold Sensitivity Specificity Accuracy Youden's J
## [1,] 0.00 1.0000000 0.0000000 0.3939130 0.0000000
## [2,] 0.01 0.9977925 0.5412482 0.7210870 0.5390407
## [3,] 0.02 0.9961369 0.5864419 0.7478261 0.5825788
## [4,] 0.03 0.9950331 0.6269727 0.7719565 0.6220059
## [5,] 0.04 0.9933775 0.6474175 0.7836957 0.6407950
## [6,] 0.05 0.9928256 0.6646341 0.7939130 0.6574598
## [7,] 0.06 0.9911700 0.6800574 0.8026087 0.6712274
## [8,] 0.07 0.9900662 0.6933286 0.8102174 0.6833948
## [9,] 0.08 0.9900662 0.7083931 0.8193478 0.6984593
## [10,] 0.09 0.9884106 0.7137733 0.8219565 0.7021839
## [11,] 0.10 0.9884106 0.7220230 0.8269565 0.7104336
## [12,] 0.11 0.9878587 0.7284792 0.8306522 0.7163379
## [13,] 0.12 0.9873068 0.7395983 0.8371739 0.7269051
## [14,] 0.13 0.9850993 0.7485653 0.8417391 0.7336646
## [15,] 0.14 0.9850993 0.7553802 0.8458696 0.7404795
## [16,] 0.15 0.9845475 0.7618364 0.8495652 0.7463839
## [17,] 0.16 0.9828918 0.7725968 0.8554348 0.7554887
## [18,] 0.17 0.9823400 0.7804878 0.8600000 0.7628278
## [19,] 0.18 0.9806843 0.8142037 0.8797826 0.7948881
## [20,] 0.19 0.9773731 0.8346485 0.8908696 0.8120216
## [21,] 0.20 0.9757174 0.8443329 0.8960870 0.8200503
## [22,] 0.21 0.9735099 0.8525825 0.9002174 0.8260924
## [23,] 0.22 0.9707506 0.8590387 0.9030435 0.8297893
## [24,] 0.23 0.9668874 0.8654950 0.9054348 0.8323824
## [25,] 0.24 0.9635762 0.8741033 0.9093478 0.8376795
## [26,] 0.25 0.9602649 0.8787661 0.9108696 0.8390310
## [27,] 0.26 0.9552980 0.8834290 0.9117391 0.8387270
## [28,] 0.27 0.9530905 0.8877331 0.9134783 0.8408236
## [29,] 0.28 0.9497792 0.8923960 0.9150000 0.8421752
## [30,] 0.29 0.9475717 0.8977762 0.9173913 0.8453479
## [31,] 0.30 0.9453642 0.9024390 0.9193478 0.8478033
## [32,] 0.31 0.9431567 0.9067432 0.9210870 0.8498999
## [33,] 0.32 0.9420530 0.9114060 0.9234783 0.8534590
## [34,] 0.33 0.9387417 0.9139168 0.9236957 0.8526585
## [35,] 0.34 0.9365342 0.9175036 0.9250000 0.8540378
## [36,] 0.35 0.9348786 0.9239598 0.9282609 0.8588384
## [37,] 0.36 0.9310155 0.9264706 0.9282609 0.8574860
## [38,] 0.37 0.9293598 0.9286227 0.9289130 0.8579825
## [39,] 0.38 0.9271523 0.9314921 0.9297826 0.8586444
## [40,] 0.39 0.9254967 0.9336442 0.9304348 0.8591409
## [41,] 0.40 0.9221854 0.9372310 0.9313043 0.8594164
## [42,] 0.41 0.9205298 0.9397418 0.9321739 0.8602716
## [43,] 0.42 0.9177704 0.9422525 0.9326087 0.8600229
## [44,] 0.43 0.9155629 0.9458393 0.9339130 0.8614022
## [45,] 0.44 0.9116998 0.9472740 0.9332609 0.8589738
## [46,] 0.45 0.9067329 0.9479914 0.9317391 0.8547243
## [47,] 0.46 0.9028698 0.9487088 0.9306522 0.8515785
## [48,] 0.47 0.9001104 0.9512195 0.9310870 0.8513299
## [49,] 0.48 0.8984547 0.9526542 0.9313043 0.8511090
## [50,] 0.49 0.8940397 0.9548063 0.9308696 0.8488460
## [51,] 0.50 0.8923841 0.9562410 0.9310870 0.8486251
## [52,] 0.51 0.8907285 0.9583931 0.9317391 0.8491216
## [53,] 0.52 0.8863135 0.9594692 0.9306522 0.8457826
## [54,] 0.53 0.8852097 0.9609039 0.9310870 0.8461136
## [55,] 0.54 0.8830022 0.9623386 0.9310870 0.8453408
## [56,] 0.55 0.8791391 0.9626973 0.9297826 0.8418363
## [57,] 0.56 0.8785872 0.9637733 0.9302174 0.8423605
## [58,] 0.57 0.8763797 0.9641320 0.9295652 0.8405117
## [59,] 0.58 0.8714128 0.9641320 0.9276087 0.8355448
## [60,] 0.59 0.8686534 0.9644907 0.9267391 0.8331441
## [61,] 0.60 0.8609272 0.9655667 0.9243478 0.8264939
## [62,] 0.61 0.8581678 0.9666428 0.9239130 0.8248105
## [63,] 0.62 0.8548565 0.9684362 0.9236957 0.8232927
## [64,] 0.63 0.8526490 0.9684362 0.9228261 0.8210852
## [65,] 0.64 0.8465784 0.9695122 0.9210870 0.8160906
## [66,] 0.65 0.8421634 0.9705882 0.9200000 0.8127516
## [67,] 0.66 0.8399558 0.9713056 0.9195652 0.8112614
## [68,] 0.67 0.8355408 0.9713056 0.9178261 0.8068464
## [69,] 0.68 0.8327815 0.9720230 0.9171739 0.8048044
## [70,] 0.69 0.8283664 0.9720230 0.9154348 0.8003894
## [71,] 0.70 0.8245033 0.9723816 0.9141304 0.7968849
## [72,] 0.71 0.8167770 0.9723816 0.9110870 0.7891587
## [73,] 0.72 0.8151214 0.9730990 0.9108696 0.7882204
## [74,] 0.73 0.8079470 0.9745337 0.9089130 0.7824807
## [75,] 0.74 0.8013245 0.9748924 0.9065217 0.7762169
## [76,] 0.75 0.7941501 0.9756098 0.9041304 0.7697599
## [77,] 0.76 0.7858720 0.9756098 0.9008696 0.7614817
## [78,] 0.77 0.7814570 0.9770445 0.9000000 0.7585014
## [79,] 0.78 0.7764901 0.9774032 0.8982609 0.7538932
## [80,] 0.79 0.7704194 0.9784792 0.8965217 0.7488986
## [81,] 0.80 0.7632450 0.9791966 0.8941304 0.7424416
## [82,] 0.81 0.7577263 0.9791966 0.8919565 0.7369228
## [83,] 0.82 0.7500000 0.9809900 0.8900000 0.7309900
## [84,] 0.83 0.7345475 0.9824247 0.8847826 0.7169721
## [85,] 0.84 0.7201987 0.9835007 0.8797826 0.7036994
## [86,] 0.85 0.7113687 0.9842181 0.8767391 0.6955867
## [87,] 0.86 0.6898455 0.9845768 0.8684783 0.6744222
## [88,] 0.87 0.6738411 0.9845768 0.8621739 0.6584178
## [89,] 0.88 0.6605960 0.9852941 0.8573913 0.6458901
## [90,] 0.89 0.6484547 0.9870875 0.8536957 0.6355423
## [91,] 0.90 0.6368653 0.9881636 0.8497826 0.6250289
## [92,] 0.91 0.6274834 0.9895983 0.8469565 0.6170817
## [93,] 0.92 0.6181015 0.9895983 0.8432609 0.6076998
## [94,] 0.93 0.6037528 0.9906743 0.8382609 0.5944271
## [95,] 0.94 0.5849890 0.9913917 0.8313043 0.5763806
## [96,] 0.95 0.5535320 0.9921090 0.8193478 0.5456410
## [97,] 0.96 0.5275938 0.9924677 0.8093478 0.5200615
## [98,] 0.97 0.4928256 0.9924677 0.7956522 0.4852933
## [99,] 0.98 0.4514349 0.9942611 0.7804348 0.4456960
## [100,] 0.99 0.3962472 0.9956958 0.7595652 0.3919431
## [101,] 1.00 0.0000000 1.0000000 0.6060870 0.0000000
And using this values, we can now representante the ROC curve. On X axis we’ve 1-specificity and Y axis sensitivity.
## [1] "Max value of Youden's J: 0.861402225241575"
## [1] "Th: 0.43"
Accoardint to this ROC curve, the max Youden’s corresponds to 0.8614022, and the best match between sensitivity ans specificity corresponds to: Sensitivity: 0.9155629, Specificity: 0.9458393 (or 0.0541607 that corresponds to 1-Specificity).
Disorganized
library(‘caret’)
spam2 = read.csv(“spambase.csv”);
range(nzv$percentUnique);
Remove zero and close to zero variance
nzv = nearZeroVar(spam2, saveMetrics = TRUE)
how many have no variation at all
print(length(nzv[nzv$zeroVar==T,]))
print(paste(‘Column count before cutoff:’,ncol(spam2)))
how many have less than 0.1 percent variance
dim(nzv[nzv$percentUnique > 0.1,])
remove zero & near-zero variance from original data set
spam2_nzv = spam2[c(rownames(nzv[nzv$percentUnique > 0.1,])) ]
print(paste(‘Column count after cutoff:’,ncol(spam2_nzv)))
Run model on original data set TO DO
calcular a área abaixo da curva ROC para o dataset inicial
PCA
pmatrix = scale(spam2_nzv)
princ = prcomp(pmatrix)
nFeatures = 10;
dfComponents = predict(princ, newdata=pmatrix)[,1:nFeatures] fazer regressão logística a estesnovos dados e calcular a área abaixo da curva ROC