E-mail is one of the most common tools for fast communication. Unfortunately, due mostly to advertising or fraud, an increasingly great number of unwanted messages (spam) are sent and received by e-mail. Looking for a way to minimize the amount of spam reaching the inbox has became a real challenge for web security developers. Here we look to build a spam filter, i.e., a binary classifier for e-mail. This is based on an ensemble of logistic regression classifiers, the goal being to attain high accuracy. To build and test the spam filter, data was taken from an open source database. The analysis presented in this document is the first step of a roadmap towards building NLP and classification tools for bioinformatics.
The original dataset contains 4600 entries and 58 features and it was divided into three groups
Train group, that contains approximately 60% of dataset entries, having the same relative amount of spam and non-spam entries
Validation group, that contains approximately 20% of dataset entries, again having the same relative amount of spam and non-spam entries
Test group, that contains approximately 20% of dataset entries
The first group was used to train the classifier, the second to perform cross-validation and the last one to evaluate the models, notably in terms of their generalizability.
The first step was the pre-processing of the database. The spam dataset was split into three parts: train set, validation set and test set. The train and validation sets were taken to be balanced, i.e., they contain the same number of spam and non-spam instances.
We started this pre-processing step by sampling the dataset : 60% of the instances were assigned to the train set, 20% to the validation set and 20% to the test set.
## Loading required package: lattice
## Loading required package: ggplot2
ind = c(1:nrow(spam));
newOrder = sample(ind);
trainSet = sort(newOrder[c(1:(0.6*length(newOrder)))]);
valSet = sort(newOrder[c((0.6*length(newOrder)+1):(0.8*length(newOrder)))]);
testSet = sort(newOrder[c((0.8*length(newOrder)+1):length(newOrder))]);
trainData = spam[trainSet,]
row.names(trainData) = NULL
valData = spam[valSet,]
row.names(valData) = NULL
testData = spam[testSet,]
row.names = NULL
At this point, the dataset was split but there is no balance between spam and non-spam instances on train and validation set. As a next step we balance the data sets by removing overflow of instances. All removed instances were added to testset.
if(length(which(trainData$X1 == 0)) > length(which(trainData$X1 == 1))){ #nSpam > spam
toRemove = length(which(trainData$X1 == 0)) - length(which(trainData$X1 == 1))
nSpam = which(trainData$X1 == 0);
keepInd = nSpam[c((toRemove+1):length(nSpam))];
ySpam = which(trainData$X1 == 1);
keepData = append(keepInd,ySpam);
trainData = trainData[keepData,];
addIndexes = nSpam[c(1:toRemove)]
} else if(length(which(trainData$X1 == 1)) > length(which(trainData$X1 == 0))){
toRemove = length(which(trainData$X1 == 1)) - length(which(trainData$X1 == 0))
ySpam = which(trainData$X1 == 1);
keepInd = ySpam[c((toRemove+1):length(ySpam))];
addIndToTest = ySpam[c(1:toRemove)]
nSpam = which(trainData$X1 == 0);
keepData = append(keepInd,nSpam);
trainData = trainData[keepData,];
addIndexes = ySpam[c(1:toRemove)];
}
addIndTest = addIndexes[c(1:(0.5*length(addIndexes)-1))]
addIndVal = addIndexes[c((0.5*length(addIndexes)):length(addIndexes))]
testData = rbind(testData, trainData[addIndTest,])
valData = rbind(valData, trainData[addIndVal,])
if(length(which(valData$X1 == 0)) > length(which(valData$X1 == 1))){ #nSpam > spam
ttoRemove = length(which(valData$X1 == 0)) - length(which(valData$X1 == 1))
nnSpam = which(valData$X1 == 0);
kkeepInd = nnSpam[c((ttoRemove+1):length(nnSpam))];
yySpam = which(valData$X1 == 1);
kkeepData = append(kkeepInd,yySpam);
valData = valData[kkeepData,];
aaddIndexes = nnSpam[c(1:ttoRemove)]
} else if(length(which(valData$X1 == 1)) > length(which(valData$X1 == 0))){
ttoRemove = length(which(valData$X1 == 1)) - length(which(valData$X1 == 0))
yySpam = which(valData$X1 == 1);
kkeepInd = yySpam[c((ttoRemove+1):length(yySpam))];
aaddIndToTest = yySpam[c(1:ttoRemove)]
nnSpam = which(valData$X1 == 0);
kkeepData = append(kkeepInd,nnSpam);
valData = valData[kkeepData,];
aaddIndexes = yySpam[c(1:ttoRemove)];
}
testData = rbind(testData, valData[aaddIndexes,])
testClass = testData$X1
testData$X1 = NULL
row.names(testClass) = NULL
valClass = valData$X1
valData$X1 = NULL;
row.names(valData) = NULL
Finally the train, validation set and test set are ready to be used.
To create a classifier we will use three different methods based on Logistic
The first approach is the one demanding the least effort, which we shall denote as “lazy approach”. We create a logistic regression (LR) based classifier using the train set. Then we looked for the threshold that maximizes the Youden’s value (and accuracy) for this classifier when applied to the validation set. At last, we evaluated the performance testing it on test set. First of all we generated the LR using glm R function, and a prediction, using predict R function.
columnNames = colnames(trainData)
formulaString = paste(paste(columnNames[(length(columnNames))],"~"),(paste(columnNames[c(1:(length(columnNames)-1))],sep=" ",collapse="+")))
formulaFormula = formula(formulaString);
spamLR = glm(formulaFormula, data=trainData, family=binomial)
validation = predict(spamLR, newdata=valData, type="response");
To vizualise the result, just need to run View(data.frame(round(validation),row.names=NULL)). At this point, we already have The Model which was created using the train dataset
In spite of the fact that we could simply round the probability outputted by the logistic regression to evaluate the classifier performance, this threshold (0.5)may not be the best for this classifier. We need to infer which is the best threshold to discern between spam and non-spam on predicted dataset. We start with a threshold of 0.0005 and increase it by 0.0005, up to0.995 to, find the best threshold. For assessing which is the best threshold, tests are performed over the cross-validation set.
threshold = seq(0.0005,0.995,0.0005);
resultsTable = matrix(NA , length(threshold) , 5);
colnames(resultsTable) = c("Threshold","Sensitivity","Specificity","Accuracy","Youden's J");
To evaluate the classifier’s performance we find the threshold that combines the best relation between sensitivity and specificity (optimizing Youden index), using a ROC curve.
for(th in threshold){
TP = 0;
FN = 0;
TN = 0;
FP = 0;
for(kk in c(1:length(validation))){
if(validationData[kk,1] >= th & valClass[kk] == 1){
TP = TP + 1;
}else if(validationData[kk,1] < th & valClass[kk] == 0){
FN = FN + 1;
}else if(validationData[kk,1] < th & valClass[kk] == 1){
FP = FP + 1;
}else if(validationData[kk,1] >= th & valClass[kk] == 0){
TN = TN + 1;
}else{
print('An error has occured')
}
}
#threshold
resultsTableIndex = resultsTableIndex+1;
resultsTable[resultsTableIndex,1] = th;
#sensitivity
resultsTable[resultsTableIndex,2] = TP/(TP+FP);
#specificity
resultsTable[resultsTableIndex,3] = FN/(FN+TN);
#accuracy
resultsTable[resultsTableIndex,4] = (FN + TP)/(TP + FP + TN + FN)
#Youden
resultsTable[resultsTableIndex,5] = resultsTable[resultsTableIndex,2] + resultsTable[resultsTableIndex,3] - 1;
}
At last, we tested the classifier on test set.
resultsTableInd = which(resultsTable[,5] == max(resultsTable[,5]));
if(length(resultsTableInd) > 1){ resultsTableInd = resultsTableInd[1]}
classifierThreshold = resultsTable[resultsTableInd,1];
classify = predict(spamLR, newdata=testData, type="response");
classifyResults = data.frame(classify);
TPc = 0;
FNc = 0;
TNc = 0;
FPc = 0;
for(kk in c(1:length(classify))){
if(classifyResults[kk,1] >= classifierThreshold & testClass[kk] == 1){
TPc = TPc + 1;
}else if(classifyResults[kk,1] < classifierThreshold & testClass[kk] == 0){
FNc = FNc + 1;
}else if(classifyResults[kk,1] < classifierThreshold & testClass[kk] == 1){
FPc = FPc + 1;
}else if(classifyResults[kk,1] >= classifierThreshold & testClass[kk] == 0){
TNc = TNc + 1;
}else{
print('An error has occured')
}
}
LRThreshold = resultsTable[resultsTableInd,1]
LRSens = (TPc/(TPc+FPc))
LRSpec = (FNc/(FNc+TNc))
LRAcc = ((FNc + TPc)/(TPc + FPc + TNc + FNc))
This lazy approach is the zero-point reference for this study. Now, the goal is to beat this accuracy. To do this, the next step was performing a Principal Components Analysis (PCA) and feature reduction, assessing for how many variables to consider in order to obtain optimal results.
In order to increase the performace of this classification, we performed a PCA analysis and feature reduction.
First of all we prepared our dataset in order to apply the PCA, removing features whose variance was near zero. Then, we used the PCA on train set and got the rotation matrix of this transformation. This PCA allowed us to remove redundancy of some features on dataset and improve the performance of this classifier.
trainDataPCA = trainData
trainClass = trainData$X1;
trainDataPCA$X1 = NULL;
nzv = nearZeroVar(trainDataPCA, saveMetrics = TRUE)
trainPCA_nzv = trainDataPCA[c(rownames(nzv[nzv$percentUnique > 0.1,]))]
trainPCA = prcomp(trainPCA_nzv)
rotationMatrix = trainPCA$rotation;
testDataPCA = data.frame((data.matrix(testData))%*%rotationMatrix);
valDataPCA = data.frame((data.matrix(valData))%*%rotationMatrix);
Then we studied what was the reduction that offers the best accuracy in order to apply this reduction to test set and evaluate the classifier performance. For each reduction, we just generated a new LR and applied it over the cross-validation set, choosing the best threshold, likewise to what had been done in the lazy approach.
nFeatures = c(57:2)
resumeTestTable = matrix(NA , length(nFeatures) , 5)
colnames(resumeTestTable) = c("Number of Features","Threshold","Sensitivity","Specificity","Accuracy")
## REDUCTION #####
k=0
for(ii in nFeatures){
k = k+1
dfComponents = predict(trainPCA, newdata=trainPCA_nzv)[,1:ii]
spamTrain = cbind(dfComponents,trainClass)
spamTrainPCA = data.frame(spamTrain);
spamTrainPCA$trainClass = factor(spamTrainPCA$trainClass)
source("./functions/LR2.r")
newPrediction = LR2(spamTrainPCA,valDataPCA,valClass,valDataPCA);
TPv=0
FPv=0
TNv=0
FNv=0
for(kk in c(1:length(newPrediction))){
if(newPrediction[kk] == 1 & valClass[kk] == 1){
TPv = TPv + 1
}else if(newPrediction[kk] == 0 & valClass[kk] == 0){
FNv = FNv + 1
}else if(newPrediction[kk] == 0 & valClass[kk] == 1){
FPv = FPv + 1
}else if(newPrediction[kk] == 1 & valClass[kk] == 0){
TNv = TNv + 1
}else{
print('An error has occured')
}
resumeTestTable[k,1] = ii
resumeTestTable[k,2] = 0
resumeTestTable[k,3] = (TPv/(TPv+FPv))
resumeTestTable[k,4] = (FNv/(FNv+TNv))
resumeTestTable[k,5] = ((FNv + TPv)/(TPv + FPv + TNv + FNv))
}
}
At this point we found the best reduction and applied it on test set to evaluate the performance of this PCA based classifier.
index = which(resumeTestTable[,5] == max(resumeTestTable[,5]))
featureReducing = resumeTestTable[index,1]
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Another strategy to improve the performance of the classifier is using bootstrap agregating-also called bagging. This method combines a sampling with replacement technique, several classifiers and a votting system. It also reduces variance and helps to avoid overfitting. See more about bagging.
This method allows us to generate an ensemble of logistic regression classifiers and attain high accuracy.
Using the previously generated train dataset, we generated 2000 more datasets with this set’s instances.
bagginsData = matrix(NA , dim(testData)[1] , 2000);
for( bb in c(1:2000)){
newTrainIndexes = sample(dim(trainData)[1],dim(trainData)[1],replace = TRUE);
newTrain = trainData[newTrainIndexes,]
newTrain$X1 = factor(newTrain$X1)
source("./functions/LR2.r")
bg = LR2(newTrain,valData,valClass,testData);
bagginsData[,bb] = bg;
}
Above, bagginsData is a matrix that contains the information of each classification.
After this, we evaluate this classification performing through a voting system.
CF = matrix(NA,dim(bagginsData)[1],1);
for(m in c(1:dim(bagginsData)[1])){
countTrue = 0;
countFalse = 0;
for(n in c(1:(dim(bagginsData)[2]))){
if(bagginsData[m,n] == 0){
countFalse = countFalse+1
}else{
countTrue = countTrue+1
}
}
if(countFalse>countTrue){
CF[m] = 0
}else{
CF[m] = 1
}
}
We also can apply this method using PCA for each classifier. We have not done it in this study, but it is very likely that the performance would increase.
To find the best Youden’s value, for each LR we used a ROC curve similar to this one.
plot(1-resultsTable[,3],resultsTable[,2], type="l", xlab ="1-Specificity", ylab="Sensitivity");
points(1-resultsTable[resultsTableInd,3], resultsTable[resultsTableInd,2], col ="red");
LR Sensitivity: 0.9511278
LR Specificity: 0.898917
LR Accuracy: 0.9297337
PCA Sensitivity: 0.9385965
PCA Specificity: 0.9043321
PCA Accuracy: 0.9245562
Bagging Sensitivity: 0.9385965
Bagging Specificity: 0.9205776
Bagging Accuracy: 0.931213
In spam filters, obtaining high accuracy classification is paramount. In this work we tested different classifiers: a lazy approach (simple logistic regression), feature reduction with PCA and Bagging. The results of the different classifiers are similar, though, as is to be expected, using the latter two methods yielded slightly better results. Indeed we can verify that we beat the zero-point performance - the simple LR performance - with both the other classifiers (PCA and Bagging). The biggest conclusion of this study is the benefits of using Boostrap Methods. They improve our classifier’s performance and reduce the risk of overfitting. In the future, we plan to combine bootstrapping and feature reduction and apply these classifiers, together with NLP techniques, to bioinformatics data.