Introduction

Goal

E-mail is one of the most common tools for fast communication. Unfortunately, due mostly to advertising or fraud, an increasingly great number of unwanted messages (spam) are sent and received by e-mail. Looking for a way to minimize the amount of spam reaching the inbox has became a real challenge for web security developers. Here we look to build a spam filter, i.e., a binary classifier for e-mail. This is based on an ensemble of logistic regression classifiers, the goal being to attain high accuracy. To build and test the spam filter, data was taken from an open source database. The analysis presented in this document is the first step of a roadmap towards building NLP and classification tools for bioinformatics.

Method

The original dataset contains 4600 entries and 58 features and it was divided into three groups

  • Train group, that contains approximately 60% of dataset entries, having the same relative amount of spam and non-spam entries

  • Validation group, that contains approximately 20% of dataset entries, again having the same relative amount of spam and non-spam entries

  • Test group, that contains approximately 20% of dataset entries

The first group was used to train the classifier, the second to perform cross-validation and the last one to evaluate the models, notably in terms of their generalizability.


Methodology

Database Pre-Processing

The first step was the pre-processing of the database. The spam dataset was split into three parts: train set, validation set and test set. The train and validation sets were taken to be balanced, i.e., they contain the same number of spam and non-spam instances.

We started this pre-processing step by sampling the dataset : 60% of the instances were assigned to the train set, 20% to the validation set and 20% to the test set.

## Loading required package: lattice
## Loading required package: ggplot2
ind = c(1:nrow(spam));

newOrder = sample(ind);
trainSet = sort(newOrder[c(1:(0.6*length(newOrder)))]);
valSet = sort(newOrder[c((0.6*length(newOrder)+1):(0.8*length(newOrder)))]);
testSet = sort(newOrder[c((0.8*length(newOrder)+1):length(newOrder))]);

trainData = spam[trainSet,]
row.names(trainData) = NULL
valData = spam[valSet,]
row.names(valData) = NULL
testData = spam[testSet,]
row.names = NULL

At this point, the dataset was split but there is no balance between spam and non-spam instances on train and validation set. As a next step we balance the data sets by removing overflow of instances. All removed instances were added to testset.

if(length(which(trainData$X1 == 0)) > length(which(trainData$X1 == 1))){    #nSpam > spam
 toRemove = length(which(trainData$X1 == 0)) - length(which(trainData$X1 == 1))
 nSpam = which(trainData$X1 == 0);
 keepInd = nSpam[c((toRemove+1):length(nSpam))];
 ySpam = which(trainData$X1 == 1);
 keepData = append(keepInd,ySpam);
 trainData = trainData[keepData,];
 addIndexes = nSpam[c(1:toRemove)]
 
 } else if(length(which(trainData$X1 == 1)) > length(which(trainData$X1 == 0))){
  toRemove = length(which(trainData$X1 == 1)) - length(which(trainData$X1 == 0))
  ySpam = which(trainData$X1 == 1);
  keepInd = ySpam[c((toRemove+1):length(ySpam))];
  addIndToTest = ySpam[c(1:toRemove)]
  nSpam = which(trainData$X1 == 0);
  keepData = append(keepInd,nSpam);
  trainData = trainData[keepData,];
  addIndexes = ySpam[c(1:toRemove)];
  }

addIndTest = addIndexes[c(1:(0.5*length(addIndexes)-1))]
addIndVal = addIndexes[c((0.5*length(addIndexes)):length(addIndexes))]
testData = rbind(testData, trainData[addIndTest,])
valData = rbind(valData, trainData[addIndVal,])

if(length(which(valData$X1 == 0)) > length(which(valData$X1 == 1))){    #nSpam > spam
 ttoRemove = length(which(valData$X1 == 0)) - length(which(valData$X1 == 1))
 nnSpam = which(valData$X1 == 0);
 kkeepInd = nnSpam[c((ttoRemove+1):length(nnSpam))];
 yySpam = which(valData$X1 == 1);
 kkeepData = append(kkeepInd,yySpam);
 valData = valData[kkeepData,];
 aaddIndexes = nnSpam[c(1:ttoRemove)]
 
 } else if(length(which(valData$X1 == 1)) > length(which(valData$X1 == 0))){
  ttoRemove = length(which(valData$X1 == 1)) - length(which(valData$X1 == 0))
  yySpam = which(valData$X1 == 1);
  kkeepInd = yySpam[c((ttoRemove+1):length(yySpam))];
  aaddIndToTest = yySpam[c(1:ttoRemove)]
  nnSpam = which(valData$X1 == 0);
  kkeepData = append(kkeepInd,nnSpam);
  valData = valData[kkeepData,];
  aaddIndexes = yySpam[c(1:ttoRemove)];
  }

testData = rbind(testData, valData[aaddIndexes,])


testClass = testData$X1 
testData$X1 = NULL 
row.names(testClass) = NULL
valClass = valData$X1 
valData$X1 = NULL;
row.names(valData) = NULL

Finally the train, validation set and test set are ready to be used.

To create a classifier we will use three different methods based on Logistic

Logistic Regression - method #1

The first approach is the one demanding the least effort, which we shall denote as “lazy approach”. We create a logistic regression (LR) based classifier using the train set. Then we looked for the threshold that maximizes the Youden’s value (and accuracy) for this classifier when applied to the validation set. At last, we evaluated the performance testing it on test set. First of all we generated the LR using glm R function, and a prediction, using predict R function.

columnNames = colnames(trainData)
formulaString = paste(paste(columnNames[(length(columnNames))],"~"),(paste(columnNames[c(1:(length(columnNames)-1))],sep=" ",collapse="+")))
formulaFormula = formula(formulaString);

spamLR = glm(formulaFormula, data=trainData, family=binomial)
validation = predict(spamLR, newdata=valData, type="response");

To vizualise the result, just need to run View(data.frame(round(validation),row.names=NULL)). At this point, we already have The Model which was created using the train dataset

In spite of the fact that we could simply round the probability outputted by the logistic regression to evaluate the classifier performance, this threshold (0.5)may not be the best for this classifier. We need to infer which is the best threshold to discern between spam and non-spam on predicted dataset. We start with a threshold of 0.0005 and increase it by 0.0005, up to0.995 to, find the best threshold. For assessing which is the best threshold, tests are performed over the cross-validation set.

threshold = seq(0.0005,0.995,0.0005);

resultsTable = matrix(NA , length(threshold) , 5);
colnames(resultsTable) = c("Threshold","Sensitivity","Specificity","Accuracy","Youden's J");

To evaluate the classifier’s performance we find the threshold that combines the best relation between sensitivity and specificity (optimizing Youden index), using a ROC curve.

for(th in threshold){
 TP = 0;
 FN = 0;
 TN = 0;
 FP = 0;
 for(kk in c(1:length(validation))){
  if(validationData[kk,1] >= th & valClass[kk] == 1){
   TP = TP + 1;
  }else if(validationData[kk,1] < th & valClass[kk] == 0){
   FN = FN + 1;
  }else if(validationData[kk,1] < th & valClass[kk] == 1){
   FP = FP + 1;
  }else if(validationData[kk,1] >= th & valClass[kk] == 0){
   TN = TN + 1;
  }else{
   print('An error has occured')
  }
 }
 #threshold
 resultsTableIndex = resultsTableIndex+1;
 resultsTable[resultsTableIndex,1] = th;
 #sensitivity
 resultsTable[resultsTableIndex,2] = TP/(TP+FP);
 #specificity
 resultsTable[resultsTableIndex,3] = FN/(FN+TN);
 #accuracy
 resultsTable[resultsTableIndex,4] = (FN + TP)/(TP + FP + TN + FN)
 #Youden
 resultsTable[resultsTableIndex,5] = resultsTable[resultsTableIndex,2] + resultsTable[resultsTableIndex,3] - 1;
}    

At last, we tested the classifier on test set.

resultsTableInd = which(resultsTable[,5] == max(resultsTable[,5]));
if(length(resultsTableInd) > 1){ resultsTableInd = resultsTableInd[1]}
classifierThreshold = resultsTable[resultsTableInd,1];

classify = predict(spamLR, newdata=testData, type="response");
classifyResults = data.frame(classify);

TPc = 0;
FNc = 0;
TNc = 0;
FPc = 0;

for(kk in c(1:length(classify))){
 if(classifyResults[kk,1] >= classifierThreshold & testClass[kk] == 1){
  TPc = TPc + 1;
 }else if(classifyResults[kk,1] < classifierThreshold & testClass[kk] == 0){
  FNc = FNc + 1;
 }else if(classifyResults[kk,1] < classifierThreshold & testClass[kk] == 1){
  FPc = FPc + 1;
 }else if(classifyResults[kk,1] >= classifierThreshold & testClass[kk] == 0){
  TNc = TNc + 1;
 }else{
  print('An error has occured')
 }
}


LRThreshold = resultsTable[resultsTableInd,1]
LRSens = (TPc/(TPc+FPc)) 
LRSpec = (FNc/(FNc+TNc)) 
LRAcc = ((FNc + TPc)/(TPc + FPc + TNc + FNc)) 

This lazy approach is the zero-point reference for this study. Now, the goal is to beat this accuracy. To do this, the next step was performing a Principal Components Analysis (PCA) and feature reduction, assessing for how many variables to consider in order to obtain optimal results.

Principal Components Analysis - method #2

In order to increase the performace of this classification, we performed a PCA analysis and feature reduction.

First of all we prepared our dataset in order to apply the PCA, removing features whose variance was near zero. Then, we used the PCA on train set and got the rotation matrix of this transformation. This PCA allowed us to remove redundancy of some features on dataset and improve the performance of this classifier.

trainDataPCA = trainData
trainClass = trainData$X1;
trainDataPCA$X1 = NULL;


nzv = nearZeroVar(trainDataPCA, saveMetrics = TRUE) 
trainPCA_nzv = trainDataPCA[c(rownames(nzv[nzv$percentUnique > 0.1,]))] 
trainPCA = prcomp(trainPCA_nzv)

rotationMatrix = trainPCA$rotation;


testDataPCA = data.frame((data.matrix(testData))%*%rotationMatrix);
valDataPCA = data.frame((data.matrix(valData))%*%rotationMatrix);

Then we studied what was the reduction that offers the best accuracy in order to apply this reduction to test set and evaluate the classifier performance. For each reduction, we just generated a new LR and applied it over the cross-validation set, choosing the best threshold, likewise to what had been done in the lazy approach.

nFeatures = c(57:2) 

resumeTestTable = matrix(NA , length(nFeatures) , 5) 
colnames(resumeTestTable) = c("Number of Features","Threshold","Sensitivity","Specificity","Accuracy") 

## REDUCTION #####

k=0 
for(ii in nFeatures){
 k = k+1 
 
 dfComponents = predict(trainPCA, newdata=trainPCA_nzv)[,1:ii] 
 
 spamTrain = cbind(dfComponents,trainClass) 
 spamTrainPCA = data.frame(spamTrain);
 spamTrainPCA$trainClass = factor(spamTrainPCA$trainClass)
 
 source("./functions/LR2.r")
 newPrediction = LR2(spamTrainPCA,valDataPCA,valClass,valDataPCA);
 
 TPv=0
 FPv=0
 TNv=0
 FNv=0
 
 
 for(kk in c(1:length(newPrediction))){
  if(newPrediction[kk] == 1 & valClass[kk] == 1){
    TPv = TPv + 1 
  }else if(newPrediction[kk] == 0 & valClass[kk] == 0){
    FNv = FNv + 1 
  }else if(newPrediction[kk] == 0 & valClass[kk] == 1){
    FPv = FPv + 1 
  }else if(newPrediction[kk] == 1 & valClass[kk] == 0){
    TNv = TNv + 1 
  }else{
    print('An error has occured')
  }

  resumeTestTable[k,1] = ii 
  resumeTestTable[k,2] = 0 
  resumeTestTable[k,3] = (TPv/(TPv+FPv)) 
  resumeTestTable[k,4] = (FNv/(FNv+TNv)) 
  resumeTestTable[k,5] = ((FNv + TPv)/(TPv + FPv + TNv + FNv)) 
  }
}

At this point we found the best reduction and applied it on test set to evaluate the performance of this PCA based classifier.

index = which(resumeTestTable[,5] == max(resumeTestTable[,5]))
featureReducing = resumeTestTable[index,1]
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Bagging - method #3

Another strategy to improve the performance of the classifier is using bootstrap agregating-also called bagging. This method combines a sampling with replacement technique, several classifiers and a votting system. It also reduces variance and helps to avoid overfitting. See more about bagging.

This method allows us to generate an ensemble of logistic regression classifiers and attain high accuracy.

Using the previously generated train dataset, we generated 2000 more datasets with this set’s instances.

bagginsData = matrix(NA , dim(testData)[1] , 2000);

for( bb in c(1:2000)){
 newTrainIndexes = sample(dim(trainData)[1],dim(trainData)[1],replace = TRUE);
 newTrain = trainData[newTrainIndexes,]
 
 newTrain$X1 = factor(newTrain$X1)
 
 
 source("./functions/LR2.r")
 bg = LR2(newTrain,valData,valClass,testData);
 bagginsData[,bb] = bg;
}

Above, bagginsData is a matrix that contains the information of each classification.

After this, we evaluate this classification performing through a voting system.

CF = matrix(NA,dim(bagginsData)[1],1);

for(m in c(1:dim(bagginsData)[1])){
  
  countTrue = 0;
  countFalse = 0;
  
  for(n in c(1:(dim(bagginsData)[2]))){
    if(bagginsData[m,n] == 0){
      countFalse = countFalse+1
    }else{
      countTrue = countTrue+1
    }
  }
  if(countFalse>countTrue){
    CF[m] = 0
  }else{
    CF[m] = 1
  }
}

We also can apply this method using PCA for each classifier. We have not done it in this study, but it is very likely that the performance would increase.

Results

To find the best Youden’s value, for each LR we used a ROC curve similar to this one.

plot(1-resultsTable[,3],resultsTable[,2], type="l", xlab ="1-Specificity", ylab="Sensitivity");
points(1-resultsTable[resultsTableInd,3], resultsTable[resultsTableInd,2], col ="red");


Performance results

LR Sensitivity: 0.9511278

LR Specificity: 0.898917

LR Accuracy: 0.9297337

PCA Sensitivity: 0.9385965

PCA Specificity: 0.9043321

PCA Accuracy: 0.9245562

Bagging Sensitivity: 0.9385965

Bagging Specificity: 0.9205776

Bagging Accuracy: 0.931213


Conclusion

In spam filters, obtaining high accuracy classification is paramount. In this work we tested different classifiers: a lazy approach (simple logistic regression), feature reduction with PCA and Bagging. The results of the different classifiers are similar, though, as is to be expected, using the latter two methods yielded slightly better results. Indeed we can verify that we beat the zero-point performance - the simple LR performance - with both the other classifiers (PCA and Bagging). The biggest conclusion of this study is the benefits of using Boostrap Methods. They improve our classifier’s performance and reduce the risk of overfitting. In the future, we plan to combine bootstrapping and feature reduction and apply these classifiers, together with NLP techniques, to bioinformatics data.