This report focuses on analysis of Real time and Multivariate Dataset. It depicts the Analysis of Spam base Dataset. The model prediction in was accomplished using K-Nearest Neighbors Algorithm.
Nearest Neighbor analysis, or Nearest Neighbor search, is an algorithm for classifying n-dimensional objects based on their similarity to other n- dimensional objects (Parsian, 2015). The performance of algorithm solely depends on the choice of k as well as distance metrics used to determine it.
This Dataset focuses on classifying Email as Spam or Non-Spam by frequency of word or character. The dataset was developed at Hewlett-Packard Labs and was donated by George Forman on July 1999 (Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt, 1999). The dataset contains 4601 instances and 58 variables. It contains two fields “Spam” and “Not Spam” for prediction. It is multivariate, real dataset mainly used for classification of attributes.
Out of 58 variables, 48 attributes are continuous, real and determines the frequency of words like “data”, “telnet”, “technology”, “1999” and many more, 6 attributes are continuous and real and characters like “;”, “(”, “[“ and so on, 1 is continuous and real attribute named “capital_run_length_longest” which determines length of longest uninterrupted sequence of capital letters, 1 continuous and integer attribute named capital_run_length_total which determines sum of length of uninterrupted sequences of capital letters and last attribute is Class which determines whether it is spam or not by 0 and 1.
#Session Information
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 magrittr_1.5 tools_3.6.1 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.14 digest_0.6.25
## [13] rlang_0.4.6 evaluate_0.14
#Downloading Dataset
if(!file.exists("Data/spambase.zip")) {
download.file(url = "http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.zip",
destfile = "Data/spambase.zip")
unzip("Data/spambase.zip", exdir = "Data/")
}
#Reading Data
#Data File
data_raw <- read.csv("Data/spambase.data", header = F)
#Names File
library(readr)
data_raw_names <- read.delim("Data/spambase.names", header = FALSE)
data_raw_names <- data_raw_names[-(1:30),]
data_raw_names <- as.data.frame(data_raw_names)
library(dplyr)
library(tidyr)
data_raw_names <- data_raw_names %>%
separate(data_raw_names, c("Variable", "Type"), sep = ":")
#Assigning Name to Dataset
names(data_raw) <- data_raw_names$Variable
names(data_raw)[is.na(names(data_raw))] <- "classes"
As a part of Exploratory Data Analysis, I found that there are no missing values in the dataset.
data <- data_raw
#Checking for missing values
any(is.na(data))
## [1] FALSE
I renamed the levels “0” as “Not Spam” and “1” as “Spam” of class column of dataset observed that there are 1813 mails which are spam out of 4601 mails which is approx. 39.4%.
#Converting Classes to factors
data$classes <- as.factor(data$classes)
levels(data$classes)
## [1] "0" "1"
#Renaming levels of Diagnosis Column
data$classes <- recode(data$classes,
"0" = "Not Spam",
"1" = "Spam")
levels(data$classes)
## [1] "Not Spam" "Spam"
#Total Number of Benign and Malignant
summary(data$classes)
## Not Spam Spam
## 2788 1813
#Percentage of Benign and Malignant
round(prop.table(table(data$classes))*100, 2)
##
## Not Spam Spam
## 60.6 39.4
As the data is unevenly distributed, the data was normalized using following function so that all the numeric values lies between 0 to 1
#Normalizing Function
normalize <- function(x) {
return( (x - min(x)) / (max(x) - min(x)))
}
NormalizeData <- as.data.frame(lapply(data[1:57], normalize))
#Summary of make word frequency
summary(data$word_freq_make)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1046 0.0000 4.5400
#Summary of Normalized Data make word frequency
summary(NormalizeData$word_freq_make)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02303 0.00000 1.00000
#Cleaning Data
CleanData <- cbind(data[, 58], NormalizeData)
names(CleanData)[names(CleanData) == "data[, 58]"] <- "class"
CleanData$class <- as.character(CleanData$class)
To predict the model using K – Nearest Neighbors Algorithm, I split the 80% Data into Training Dataset and 20% Data into Testing Dataset.
##### Splitting the Data into Train and Test Datasets #####
set.seed(1234)
samp <- sample(nrow(CleanData),0.80*nrow(CleanData))
TrainData <- CleanData[samp,]
TestData <- CleanData[-samp,]
TrainLabels <- TrainData[, 1]
TestLabels <- TestData[, 1]
#KNN Algorithm
library(class)
TestPredict <- knn(train = TrainData[, 2:58],
test = TestData[, 2:58],
cl = TrainLabels, k = 1)
library(gmodels)
CrossTable(x = TestLabels,
y = TestPredict,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 921
##
##
## | TestPredict
## TestLabels | Not Spam | Spam | Row Total |
## -------------|-----------|-----------|-----------|
## Not Spam | 518 | 37 | 555 |
## | 0.933 | 0.067 | 0.603 |
## | 0.922 | 0.103 | |
## | 0.562 | 0.040 | |
## -------------|-----------|-----------|-----------|
## Spam | 44 | 322 | 366 |
## | 0.120 | 0.880 | 0.397 |
## | 0.078 | 0.897 | |
## | 0.048 | 0.350 | |
## -------------|-----------|-----------|-----------|
## Column Total | 562 | 359 | 921 |
## | 0.610 | 0.390 | |
## -------------|-----------|-----------|-----------|
##
##
library(png)
knitr::include_graphics("images/NormalizedData.png")
So, from the table it is observed that Highest Accuracy is 91.2% and is achieved by k value 1.
Z - Transformation is one way for improving performance of model.
# Z - score Transformation
ZScoreData <- as.data.frame(scale(data[-58]))
summary(ZScoreData$word_freq_make)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3424 -0.3424 -0.3424 0.0000 -0.3424 14.5254
#Updated Data
UpdateData <- cbind(data[, 58], ZScoreData)
names(UpdateData)[names(UpdateData) == "data[, 58]"] <- "class"
UpdateData$class <- as.character(UpdateData$class)
set.seed(1234)
samp1 <- sample(nrow(UpdateData),0.80*nrow(UpdateData))
TrainData1 <- UpdateData[samp1,]
TestData1 <- UpdateData[-samp1,]
TrainLabels1 <- TrainData[, 1]
TestLabels1 <- TestData[, 1]
library(class)
TestPredict1 <- knn(train = TrainData1[, 2:31],
test = TestData1[, 2:31],
cl = TrainLabels1, k = 1)
library(gmodels)
CrossTable(x = TestLabels1,
y = TestPredict1,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 921
##
##
## | TestPredict1
## TestLabels1 | Not Spam | Spam | Row Total |
## -------------|-----------|-----------|-----------|
## Not Spam | 520 | 35 | 555 |
## | 0.937 | 0.063 | 0.603 |
## | 0.901 | 0.102 | |
## | 0.565 | 0.038 | |
## -------------|-----------|-----------|-----------|
## Spam | 57 | 309 | 366 |
## | 0.156 | 0.844 | 0.397 |
## | 0.099 | 0.898 | |
## | 0.062 | 0.336 | |
## -------------|-----------|-----------|-----------|
## Column Total | 577 | 344 | 921 |
## | 0.626 | 0.374 | |
## -------------|-----------|-----------|-----------|
##
##
library(png)
knitr::include_graphics("images/ZScoreData.png")
From the table we can observe that with Z - Score Transformation, the maximum accuracy achieved is 90.01%.