Email Spambase Data Set Analysis Using KNN

Introduction

This report focuses on analysis of Real time and Multivariate Dataset. It depicts the Analysis of Spam base Dataset. The model prediction in was accomplished using K-Nearest Neighbors Algorithm.

K-Nearest Neighbors Algorithm:

Nearest Neighbor analysis, or Nearest Neighbor search, is an algorithm for classifying n-dimensional objects based on their similarity to other n- dimensional objects (Parsian, 2015). The performance of algorithm solely depends on the choice of k as well as distance metrics used to determine it.

Dataset Description:

This Dataset focuses on classifying Email as Spam or Non-Spam by frequency of word or character. The dataset was developed at Hewlett-Packard Labs and was donated by George Forman on July 1999 (Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt, 1999). The dataset contains 4601 instances and 58 variables. It contains two fields “Spam” and “Not Spam” for prediction. It is multivariate, real dataset mainly used for classification of attributes.

Attributes Information:

Out of 58 variables, 48 attributes are continuous, real and determines the frequency of words like “data”, “telnet”, “technology”, “1999” and many more, 6 attributes are continuous and real and characters like “;”, “(”, “[“ and so on, 1 is continuous and real attribute named “capital_run_length_longest” which determines length of longest uninterrupted sequence of capital letters, 1 continuous and integer attribute named capital_run_length_total which determines sum of length of uninterrupted sequences of capital letters and last attribute is Class which determines whether it is spam or not by 0 and 1.

#Session Information
sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.14       digest_0.6.25  
## [13] rlang_0.4.6     evaluate_0.14

Step 1: Collecting Data

#Downloading Dataset
if(!file.exists("Data/spambase.zip")) {
        download.file(url = "http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.zip",
                      destfile = "Data/spambase.zip")
        
        unzip("Data/spambase.zip", exdir = "Data/")
}

#Reading Data
#Data File
data_raw <- read.csv("Data/spambase.data", header = F)

#Names File
library(readr)
data_raw_names <- read.delim("Data/spambase.names", header = FALSE)
data_raw_names <- data_raw_names[-(1:30),]
data_raw_names <- as.data.frame(data_raw_names)

library(dplyr)
library(tidyr)
data_raw_names <- data_raw_names %>%
        separate(data_raw_names, c("Variable", "Type"), sep = ":")

#Assigning Name to Dataset
names(data_raw) <- data_raw_names$Variable
names(data_raw)[is.na(names(data_raw))] <- "classes"

Step 2 - Exploring and Preparing Data

As a part of Exploratory Data Analysis, I found that there are no missing values in the dataset.

data <- data_raw

#Checking for missing values
any(is.na(data))

## [1] FALSE

I renamed the levels “0” as “Not Spam” and “1” as “Spam” of class column of dataset observed that there are 1813 mails which are spam out of 4601 mails which is approx. 39.4%.

#Converting Classes to factors
data$classes <- as.factor(data$classes)
levels(data$classes)

## [1] "0" "1"

#Renaming levels of Diagnosis Column
data$classes <- recode(data$classes,
                         "0" = "Not Spam",
                         "1" = "Spam")
levels(data$classes)

## [1] "Not Spam" "Spam"

#Total Number of Benign and Malignant
summary(data$classes)

## Not Spam     Spam 
##     2788     1813

#Percentage of Benign and Malignant
round(prop.table(table(data$classes))*100, 2)

## 
## Not Spam     Spam 
##     60.6     39.4

As the data is unevenly distributed, the data was normalized using following function so that all the numeric values lies between 0 to 1

#Normalizing Function
normalize <- function(x) {
        return( (x - min(x)) / (max(x) - min(x)))
}
NormalizeData <- as.data.frame(lapply(data[1:57], normalize))

#Summary of make word frequency
summary(data$word_freq_make)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  4.5400

#Summary of Normalized Data make word frequency
summary(NormalizeData$word_freq_make)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02303 0.00000 1.00000

#Cleaning Data
CleanData <- cbind(data[, 58], NormalizeData)
names(CleanData)[names(CleanData) == "data[, 58]"] <- "class"
CleanData$class <- as.character(CleanData$class)

To predict the model using K – Nearest Neighbors Algorithm, I split the 80% Data into Training Dataset and 20% Data into Testing Dataset.

##### Splitting the Data into Train and Test Datasets #####
set.seed(1234)
samp <- sample(nrow(CleanData),0.80*nrow(CleanData))
TrainData <- CleanData[samp,]
TestData <- CleanData[-samp,]

TrainLabels <- TrainData[, 1]
TestLabels <- TestData[, 1]

Step 3 - Training a Model

#KNN Algorithm
library(class)
TestPredict <- knn(train = TrainData[, 2:58], 
                   test = TestData[, 2:58],
                   cl = TrainLabels, k = 1)

Step 4 - Evaluating Model Performance

library(gmodels)
CrossTable(x = TestLabels, 
           y = TestPredict,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  921 
## 
##  
##              | TestPredict 
##   TestLabels |  Not Spam |      Spam | Row Total | 
## -------------|-----------|-----------|-----------|
##     Not Spam |       518 |        37 |       555 | 
##              |     0.933 |     0.067 |     0.603 | 
##              |     0.922 |     0.103 |           | 
##              |     0.562 |     0.040 |           | 
## -------------|-----------|-----------|-----------|
##         Spam |        44 |       322 |       366 | 
##              |     0.120 |     0.880 |     0.397 | 
##              |     0.078 |     0.897 |           | 
##              |     0.048 |     0.350 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       562 |       359 |       921 | 
##              |     0.610 |     0.390 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Performance Table for different values of K

library(png)
knitr::include_graphics("images/NormalizedData.png")

So, from the table it is observed that Highest Accuracy is 91.2% and is achieved by k value 1.

Step 5 - Improving Model Performance

Z - Transformation is one way for improving performance of model.

# Z - score Transformation
ZScoreData <- as.data.frame(scale(data[-58]))
summary(ZScoreData$word_freq_make)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3424 -0.3424 -0.3424  0.0000 -0.3424 14.5254

#Updated Data
UpdateData <- cbind(data[, 58], ZScoreData)
names(UpdateData)[names(UpdateData) == "data[, 58]"] <- "class"
UpdateData$class <- as.character(UpdateData$class)

set.seed(1234)
samp1 <- sample(nrow(UpdateData),0.80*nrow(UpdateData))
TrainData1 <- UpdateData[samp1,]
TestData1 <- UpdateData[-samp1,]

TrainLabels1 <- TrainData[, 1]
TestLabels1 <- TestData[, 1]

library(class)
TestPredict1 <- knn(train = TrainData1[, 2:31], 
                    test = TestData1[, 2:31],
                    cl = TrainLabels1, k = 1)

library(gmodels)
CrossTable(x = TestLabels1, 
           y = TestPredict1,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  921 
## 
##  
##              | TestPredict1 
##  TestLabels1 |  Not Spam |      Spam | Row Total | 
## -------------|-----------|-----------|-----------|
##     Not Spam |       520 |        35 |       555 | 
##              |     0.937 |     0.063 |     0.603 | 
##              |     0.901 |     0.102 |           | 
##              |     0.565 |     0.038 |           | 
## -------------|-----------|-----------|-----------|
##         Spam |        57 |       309 |       366 | 
##              |     0.156 |     0.844 |     0.397 | 
##              |     0.099 |     0.898 |           | 
##              |     0.062 |     0.336 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       577 |       344 |       921 | 
##              |     0.626 |     0.374 |           | 
## -------------|-----------|-----------|-----------|
## 
##

library(png)
knitr::include_graphics("images/ZScoreData.png")

From the table we can observe that with Z - Score Transformation, the maximum accuracy achieved is 90.01%.

Summary

Unlike many classification algorithms, KNN does not do any learning. It simply stores the training data verbatim.
Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.
In Email SpamBase Data Set, the model trained on Normalized Data is more accurate then the model trained on Z-Score Data.

References

Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt. (1999, July 01). Spambase Data Set. Retrieved from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Spambase
Parsian, M. (2015). Data Algorithms. Sebastopol, CA: O’Reilly Media, Inc.
Lantz, B. (2013). Chapter 3: Lazy Learning – Classification Using Nearest Neighbors. In B. Lantz, Machine Learning with R (pp. 76 - 87). Birmingham B3 2PB, UK.: Packt Publishing Ltd.