Spam Emails - Random Forest

Introduction
Random Forest

The steps to predict using Random Forest:
* Step 1
* Step 2

Problem Definition
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

So Our Problem Statement is-
To create the model to predict the mail is Spam or Not.
Test the model on test dataset
Check the accuracy.

Data Location
http://archive.ics.uci.edu/ml/datasets/Spambase?ref=datanews.io

Data Description

Number of Attributes: 58 (57 continuous, 1 nominal class label)

The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Setup

Load Libs

library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(psych)
library(gridExtra)
#install.packages("caret")
library(caret)
#install.packages("randomForest")
library(randomForest)
#install.packages("gbm")
library(gbm)
#install.packages("corrgram")
library(corrgram)
library(corrplot)

Functions

##Function to Check total NA Records in a column  
detectNA <- function(inp) {
  sum(is.na(inp))
}

##Function to Check Correlation between Status variable & Other variable  
detectCor <- function(x) {
  cor(as.numeric(dfrDataset[, x]), 
    as.numeric(dfrDataset$status), 
    method="spearman")
}

##To check no. of Outliers in a Column  
detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 1.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  #inp <- count(inp[is.na(otp)])
  sum(is.na(otp))
}

##Function to Check data except Outliers  
Except_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

##Function to Remove Outliers  
Remove_Outliers <- function ( z, na.rm = TRUE){
 Out <- Except_outliers(z)
 Out <-as.data.frame (Out)
 z <- Out$Out[match(z, Out$Out)]
 z
}

##Function to replace Outliers with Median  
Replace_Outliers_to_Median <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- median(x)
  y[x > (qnt[2] + H)] <- median(x)
  y
}

##Function to replace Outliers with Mean   
Replace_Outliers_to_Mean <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- mean(x)
  y[x > (qnt[2] + H)] <- mean(x)
  y
}

##Boxplot Function  
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrDataset, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

##Data Imputation Function  
Data_Impute <- function (input){
dfrDataset$input[is.na(dfrDataset$input)] <- mean(dfrDataset$input[!is.na(dfrDataset$input)])
}

Load Dataset

setwd("D:/Welingkar/Trim 4/Machine Learning/Project/Data")
dfrDataset <- read.csv("./Spam_Email_Base.csv", header=T, stringsAsFactors=T)
head(dfrDataset)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 2           0.21              0.28          0.50            0
## 3           0.06              0.00          0.71            0
## 4           0.00              0.00          0.00            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32           0.00             0.00               0.00
## 2          0.14           0.28             0.21               0.07
## 3          1.23           0.19             0.19               0.12
## 4          0.63           0.00             0.31               0.63
## 5          0.63           0.00             0.31               0.63
## 6          1.85           0.00             0.00               1.85
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 2            0.00           0.94              0.21           0.79
## 3            0.64           0.25              0.38           0.45
## 4            0.31           0.63              0.31           0.31
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00             0.00                0.00           0.32
## 2             0.65             0.21                0.14           0.14
## 3             0.12             0.00                1.75           0.06
## 4             0.31             0.00                0.00           0.31
## 5             0.31             0.00                0.00           0.31
## 6             0.00             0.00                0.00           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1               0.00            1.29          1.93             0.00
## 2               0.07            0.28          3.47             0.00
## 3               0.06            1.03          1.36             0.32
## 4               0.00            0.00          3.18             0.00
## 5               0.00            0.00          3.18             0.00
## 6               0.00            0.00          0.00             0.00
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0          0.00            0.00            0
## 2           1.59              0          0.43            0.43            0
## 3           0.51              0          1.16            0.06            0
## 4           0.31              0          0.00            0.00            0
## 5           0.31              0          0.00            0.00            0
## 6           0.00              0          0.00            0.00            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 2             0                0             0             0
## 3             0                0             0             0
## 4             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0              0
## 2              0                0             0              0
## 3              0                0             0              0
## 4              0                0             0              0
## 5              0                0             0              0
## 6              0                0             0              0
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0           0.00
## 2             0            0                    0           0.07
## 3             0            0                    0           0.00
## 4             0            0                    0           0.00
## 5             0            0                    0           0.00
## 6             0            0                    0           0.00
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0             0.00            0
## 2               0            0             0.00            0
## 3               0            0             0.06            0
## 4               0            0             0.00            0
## 5               0            0             0.00            0
## 6               0            0             0.00            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0               0.00                 0         0.00
## 2                 0               0.00                 0         0.00
## 3                 0               0.12                 0         0.06
## 4                 0               0.00                 0         0.00
## 5                 0               0.00                 0         0.00
## 6                 0               0.00                 0         0.00
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1          0.00               0                    0        0.00
## 2          0.00               0                    0        0.00
## 3          0.06               0                    0        0.01
## 4          0.00               0                    0        0.00
## 5          0.00               0                    0        0.00
## 6          0.00               0                    0        0.00
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 2         0.132             0         0.372         0.180         0.048
## 3         0.143             0         0.276         0.184         0.010
## 4         0.137             0         0.137         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 2                      5.114                        101
## 3                      9.821                        485
## 4                      3.537                         40
## 5                      3.537                         40
## 6                      3.000                         15
##   capital_run_length_total status
## 1                      278      1
## 2                     1028      1
## 3                     2259      1
## 4                      191      1
## 5                      191      1
## 6                       54      1

Observations
1. Data has been loaded successfully.
2. There are 4601 records in the data set.

Dataframe Stucture

str(dfrDataset)

## 'data.frame':    4601 obs. of  58 variables:
##  $ word_freq_make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ word_freq_address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ word_freq_all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ word_freq_over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ word_freq_remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ word_freq_internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ word_freq_order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ word_freq_mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ word_freq_receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ word_freq_will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ word_freq_people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ word_freq_report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ word_freq_free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ word_freq_business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ word_freq_you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ word_freq_credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ word_freq_your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ word_freq_money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ word_freq_hp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ word_freq_re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ char_freq_..1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ char_freq_..2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ char_freq_..4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ char_freq_..5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capital_run_length_average: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capital_run_length_longest: int  61 101 485 40 40 15 4 11 445 43 ...
##  $ capital_run_length_total  : int  278 1028 2259 191 191 54 112 49 1257 749 ...
##  $ status                    : int  1 1 1 1 1 1 1 1 1 1 ...

#RF deals with non numeric as well as numeric data

Obsrvations
1. All the columns are either Numbers or Integers.

Dataframe Summary

lapply(dfrDataset, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.213   0.000  14.280 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2807  0.4200  5.1000 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06542  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3122  0.3800 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0959  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1142  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1053  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2394  0.1600 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5417  0.8000  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05863  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0492  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2488  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1426  0.0000  7.1400 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1847  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.662   2.640  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08558  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2200  0.8098  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1212  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1016  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09427  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5495  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2654  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7673  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1248  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09892  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1029  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06475  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000 
## 
## $word_freq_data
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09723  0.00000 18.18000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1054  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.137   0.000   6.890 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0132  0.0000  8.3300 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07863  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1323  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0461  0.0000  3.5700 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0792  0.0000 20.0000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3012  0.1100 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1798  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03187  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.065   0.139   0.188   9.752 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2691  0.3150 32.4780 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04424  0.00000 19.82900 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.588    2.276    5.191    3.706 1102.500 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   52.17   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.0   283.3   266.0 15841.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.394   1.000   1.000

#summary(dfrDataset)

Exploratory Statistics

lapply(dfrDataset, FUN=describe)

## $word_freq_make
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601  0.1 0.31      0    0.03   0   0 4.54  4.54 5.67    49.23  0
## 
## $word_freq_address
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.21 1.29      0    0.02   0   0 14.28 14.28 10.08   105.48
##      se
## X1 0.02
## 
## $word_freq_all
##    vars    n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 4601 0.28 0.5      0    0.17   0   0 5.1   5.1 3.01    13.29 0.01
## 
## $word_freq_3d
##    vars    n mean  sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.07 1.4      0       0   0   0 42.81 42.81 26.21   725.34
##      se
## X1 0.02
## 
## $word_freq_our
##    vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 4601 0.31 0.67      0    0.16   0   0  10    10 4.74    37.88 0.01
## 
## $word_freq_over
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601  0.1 0.27      0    0.03   0   0 5.88  5.88 5.95    68.34  0
## 
## $word_freq_remove
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.11 0.39      0    0.02   0   0 7.27  7.27 6.76     75.3
##      se
## X1 0.01
## 
## $word_freq_internet
##    vars    n mean  sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.11 0.4      0    0.02   0   0 11.11 11.11 9.72    168.9
##      se
## X1 0.01
## 
## $word_freq_order
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.09 0.28      0    0.01   0   0 5.26  5.26 5.22    46.87  0
## 
## $word_freq_mail
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.24 0.64      0    0.09   0   0 18.18 18.18 8.48   160.97
##      se
## X1 0.01
## 
## $word_freq_receive
##    vars    n mean  sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.06 0.2      0    0.01   0   0 2.61  2.61 5.51    39.59  0
## 
## $word_freq_will
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 4601 0.54 0.86    0.1    0.36 0.15   0 9.67  9.67 2.87    12.53
##      se
## X1 0.01
## 
## $word_freq_people
##    vars    n mean  sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.09 0.3      0    0.02   0   0 5.55  5.55 6.95    84.81  0
## 
## $word_freq_report
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis se
## X1    1 4601 0.06 0.34      0       0   0   0  10    10 11.75   228.85  0
## 
## $word_freq_addresses
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.05 0.26      0       0   0   0 4.41  4.41 6.97    57.64  0
## 
## $word_freq_free
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis
## X1    1 4601 0.25 0.83      0    0.08   0   0  20    20 10.76   196.12
##      se
## X1 0.01
## 
## $word_freq_business
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.14 0.44      0    0.03   0   0 7.14  7.14 5.68     45.6
##      se
## X1 0.01
## 
## $word_freq_email
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.18 0.53      0    0.05   0   0 9.09  9.09 5.41    47.89
##      se
## X1 0.01
## 
## $word_freq_you
##    vars    n mean   sd median trimmed  mad min   max range skew kurtosis
## X1    1 4601 1.66 1.78   1.31    1.41 1.94   0 18.75 18.75 1.59     5.25
##      se
## X1 0.03
## 
## $word_freq_credit
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.09 0.51      0       0   0   0 18.18 18.18 14.59   382.42
##      se
## X1 0.01
## 
## $word_freq_your
##    vars    n mean  sd median trimmed  mad min   max range skew kurtosis
## X1    1 4601 0.81 1.2   0.22    0.56 0.33   0 11.11 11.11 2.43     8.99
##      se
## X1 0.02
## 
## $word_freq_font
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.12 1.03      0       0   0   0 17.1  17.1 9.97   108.97
##      se
## X1 0.02
## 
## $word_freq_000
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601  0.1 0.35      0    0.01   0   0 5.45  5.45 5.71    46.73
##      se
## X1 0.01
## 
## $word_freq_money
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis
## X1    1 4601 0.09 0.44      0    0.01   0   0 12.5  12.5 14.68   301.59
##      se
## X1 0.01
## 
## $word_freq_hp
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.55 1.67      0    0.15   0   0 20.83 20.83 5.71    43.53
##      se
## X1 0.02
## 
## $word_freq_hpl
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.27 0.89      0    0.05   0   0 16.66 16.66 6.35     63.8
##      se
## X1 0.01
## 
## $word_freq_george
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.77 3.37      0    0.05   0   0 33.33 33.33 5.74    34.15
##      se
## X1 0.05
## 
## $word_freq_650
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.12 0.54      0       0   0   0 9.09  9.09  6.6    58.28
##      se
## X1 0.01
## 
## $word_freq_lab
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601  0.1 0.59      0       0   0   0 14.28 14.28 11.36   174.98
##      se
## X1 0.01
## 
## $word_freq_labs
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601  0.1 0.46      0       0   0   0 5.88  5.88 6.63    51.93
##      se
## X1 0.01
## 
## $word_freq_telnet
##    vars    n mean  sd median trimmed mad min  max range  skew kurtosis
## X1    1 4601 0.06 0.4      0       0   0   0 12.5  12.5 12.66   253.84
##      se
## X1 0.01
## 
## $word_freq_857
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.05 0.33      0       0   0   0 4.76  4.76 10.54   127.18  0
## 
## $word_freq_data
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601  0.1 0.56      0       0   0   0 18.18 18.18 13.18   295.64
##      se
## X1 0.01
## 
## $word_freq_415
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.05 0.33      0       0   0   0 4.76  4.76 10.47   125.75  0
## 
## $word_freq_85
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis
## X1    1 4601 0.11 0.53      0       0   0   0  20    20 15.22   448.69
##      se
## X1 0.01
## 
## $word_freq_technology
##    vars    n mean  sd median trimmed mad min  max range skew kurtosis   se
## X1    1 4601  0.1 0.4      0       0   0   0 7.69  7.69 7.67    81.08 0.01
## 
## $word_freq_1999
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.14 0.42      0    0.03   0   0 6.89  6.89 5.32    42.55
##      se
## X1 0.01
## 
## $word_freq_parts
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.01 0.22      0       0   0   0 8.33  8.33 28.24   910.66  0
## 
## $word_freq_pm
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.08 0.43      0       0   0   0 11.11 11.11 12.05   215.39
##      se
## X1 0.01
## 
## $word_freq_direct
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis
## X1    1 4601 0.06 0.35      0       0   0   0 4.76  4.76 9.14    99.23
##      se
## X1 0.01
## 
## $word_freq_cs
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis
## X1    1 4601 0.04 0.36      0       0   0   0 7.14  7.14 12.58   193.32
##      se
## X1 0.01
## 
## $word_freq_meeting
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601 0.13 0.77      0       0   0   0 14.28 14.28 9.45   115.53
##      se
## X1 0.01
## 
## $word_freq_original
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.05 0.22      0       0   0   0 3.57  3.57 7.62    78.45  0
## 
## $word_freq_project
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis
## X1    1 4601 0.08 0.62      0       0   0   0  20    20 18.76    479.1
##      se
## X1 0.01
## 
## $word_freq_re
##    vars    n mean   sd median trimmed mad min   max range skew kurtosis
## X1    1 4601  0.3 1.01      0    0.09   0   0 21.42 21.42 9.14   128.67
##      se
## X1 0.01
## 
## $word_freq_edu
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.18 0.91      0       0   0   0 22.05 22.05 10.12   150.67
##      se
## X1 0.01
## 
## $word_freq_table
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.01 0.08      0       0   0   0 2.17  2.17 19.85   458.73  0
## 
## $word_freq_conference
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis se
## X1    1 4601 0.03 0.29      0       0   0   0  10    10 19.71   536.67  0
## 
## $char_freq_.
##    vars    n mean   sd median trimmed mad min  max range skew kurtosis se
## X1    1 4601 0.04 0.24      0       0   0   0 4.38  4.38 13.7   212.74  0
## 
## $char_freq_..1
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.14 0.27   0.06    0.09 0.1   0 9.75  9.75 13.57   392.81  0
## 
## $char_freq_..2
##    vars    n mean   sd median trimmed mad min  max range  skew kurtosis se
## X1    1 4601 0.02 0.11      0       0   0   0 4.08  4.08 21.07   617.53  0
## 
## $char_freq_..3
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.27 0.82      0    0.13   0   0 32.48 32.48 18.65   606.53
##      se
## X1 0.01
## 
## $char_freq_..4
##    vars    n mean   sd median trimmed mad min max range  skew kurtosis se
## X1    1 4601 0.08 0.25      0    0.03   0   0   6     6 11.16   199.65  0
## 
## $char_freq_..5
##    vars    n mean   sd median trimmed mad min   max range  skew kurtosis
## X1    1 4601 0.04 0.43      0       0   0   0 19.83 19.83 31.04  1216.64
##      se
## X1 0.01
## 
## $capital_run_length_average
##    vars    n mean    sd median trimmed  mad min    max  range  skew
## X1    1 4601 5.19 31.73   2.28    2.63 1.31   1 1102.5 1101.5 23.75
##    kurtosis   se
## X1   669.35 0.47
## 
## $capital_run_length_longest
##    vars    n  mean     sd median trimmed   mad min  max range  skew
## X1    1 4601 52.17 194.89     15   23.52 16.31   1 9989  9988 30.74
##    kurtosis   se
## X1  1478.39 2.87
## 
## $capital_run_length_total
##    vars    n   mean     sd median trimmed    mad min   max range skew
## X1    1 4601 283.29 606.35     95   156.6 114.16   1 15841 15840  8.7
##    kurtosis   se
## X1   145.61 8.94
## 
## $status
##    vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 4601 0.39 0.49      0    0.37   0   0   1     1 0.43    -1.81 0.01

#describe(dfrDataset)

Observations
1. Summary can be used to check the distribution of data between different quartiles.
2. Second function is used to check the statistics in the data, what are the different parameters values eg. skewness, Kurtosis, No of Values etc.

Missing Data

lapply(dfrDataset, FUN=detectNA)

## $word_freq_make
## [1] 0
## 
## $word_freq_address
## [1] 0
## 
## $word_freq_all
## [1] 0
## 
## $word_freq_3d
## [1] 0
## 
## $word_freq_our
## [1] 0
## 
## $word_freq_over
## [1] 0
## 
## $word_freq_remove
## [1] 0
## 
## $word_freq_internet
## [1] 0
## 
## $word_freq_order
## [1] 0
## 
## $word_freq_mail
## [1] 0
## 
## $word_freq_receive
## [1] 0
## 
## $word_freq_will
## [1] 0
## 
## $word_freq_people
## [1] 0
## 
## $word_freq_report
## [1] 0
## 
## $word_freq_addresses
## [1] 0
## 
## $word_freq_free
## [1] 0
## 
## $word_freq_business
## [1] 0
## 
## $word_freq_email
## [1] 0
## 
## $word_freq_you
## [1] 0
## 
## $word_freq_credit
## [1] 0
## 
## $word_freq_your
## [1] 0
## 
## $word_freq_font
## [1] 0
## 
## $word_freq_000
## [1] 0
## 
## $word_freq_money
## [1] 0
## 
## $word_freq_hp
## [1] 0
## 
## $word_freq_hpl
## [1] 0
## 
## $word_freq_george
## [1] 0
## 
## $word_freq_650
## [1] 0
## 
## $word_freq_lab
## [1] 0
## 
## $word_freq_labs
## [1] 0
## 
## $word_freq_telnet
## [1] 0
## 
## $word_freq_857
## [1] 0
## 
## $word_freq_data
## [1] 0
## 
## $word_freq_415
## [1] 0
## 
## $word_freq_85
## [1] 0
## 
## $word_freq_technology
## [1] 0
## 
## $word_freq_1999
## [1] 0
## 
## $word_freq_parts
## [1] 0
## 
## $word_freq_pm
## [1] 0
## 
## $word_freq_direct
## [1] 0
## 
## $word_freq_cs
## [1] 0
## 
## $word_freq_meeting
## [1] 0
## 
## $word_freq_original
## [1] 0
## 
## $word_freq_project
## [1] 0
## 
## $word_freq_re
## [1] 0
## 
## $word_freq_edu
## [1] 0
## 
## $word_freq_table
## [1] 0
## 
## $word_freq_conference
## [1] 0
## 
## $char_freq_.
## [1] 0
## 
## $char_freq_..1
## [1] 0
## 
## $char_freq_..2
## [1] 0
## 
## $char_freq_..3
## [1] 0
## 
## $char_freq_..4
## [1] 0
## 
## $char_freq_..5
## [1] 0
## 
## $capital_run_length_average
## [1] 0
## 
## $capital_run_length_longest
## [1] 0
## 
## $capital_run_length_total
## [1] 0
## 
## $status
## [1] 0

Observations
1. There are no NA records in the dataset.
2. So Data imputation is not required.

Data Imputation

# inline comments
#Data_Impute(dfrDataset$word_freq_make)
#dfrDataset$word_freq_make[is.na(dfrDataset$word_freq_make)] <- mean(dfrDataset$word_freq_make[!is.na(dfrDataset$word_freq_make)])

Box Plot

lapply(dfrDataset, FUN=Graph_Boxplot)

## $word_freq_make

## 
## $word_freq_address

## 
## $word_freq_all

## 
## $word_freq_3d

## 
## $word_freq_our

## 
## $word_freq_over

## 
## $word_freq_remove

## 
## $word_freq_internet

## 
## $word_freq_order

## 
## $word_freq_mail

## 
## $word_freq_receive

## 
## $word_freq_will

## 
## $word_freq_people

## 
## $word_freq_report

## 
## $word_freq_addresses

## 
## $word_freq_free

## 
## $word_freq_business

## 
## $word_freq_email

## 
## $word_freq_you

## 
## $word_freq_credit

## 
## $word_freq_your

## 
## $word_freq_font

## 
## $word_freq_000

## 
## $word_freq_money

## 
## $word_freq_hp

## 
## $word_freq_hpl

## 
## $word_freq_george

## 
## $word_freq_650

## 
## $word_freq_lab

## 
## $word_freq_labs

## 
## $word_freq_telnet

## 
## $word_freq_857

## 
## $word_freq_data

## 
## $word_freq_415

## 
## $word_freq_85

## 
## $word_freq_technology

## 
## $word_freq_1999

## 
## $word_freq_parts

## 
## $word_freq_pm

## 
## $word_freq_direct

## 
## $word_freq_cs

## 
## $word_freq_meeting

## 
## $word_freq_original

## 
## $word_freq_project

## 
## $word_freq_re

## 
## $word_freq_edu

## 
## $word_freq_table

## 
## $word_freq_conference

## 
## $char_freq_.

## 
## $char_freq_..1

## 
## $char_freq_..2

## 
## $char_freq_..3

## 
## $char_freq_..4

## 
## $char_freq_..5

## 
## $capital_run_length_average

## 
## $capital_run_length_longest

## 
## $capital_run_length_total

## 
## $status

Outliers

detect_outliers(dfrDataset$word_freq_make)

## [1] 1053

detect_outliers(dfrDataset$word_freq_address)

## [1] 898

detect_outliers(dfrDataset$word_freq_all)

## [1] 338

detect_outliers(dfrDataset$word_freq_3d)

## [1] 47

detect_outliers(dfrDataset$word_freq_our)

## [1] 501

detect_outliers(dfrDataset$word_freq_over)

## [1] 999

detect_outliers(dfrDataset$word_freq_remove)

## [1] 807

detect_outliers(dfrDataset$word_freq_internet)

## [1] 824

detect_outliers(dfrDataset$word_freq_order)

## [1] 773

detect_outliers(dfrDataset$word_freq_mail)

## [1] 852

detect_outliers(dfrDataset$word_freq_receive)

## [1] 709

detect_outliers(dfrDataset$word_freq_will)

## [1] 270

detect_outliers(dfrDataset$word_freq_people)

## [1] 852

detect_outliers(dfrDataset$word_freq_report)

## [1] 357

detect_outliers(dfrDataset$word_freq_addresses)

## [1] 336

detect_outliers(dfrDataset$word_freq_free)

## [1] 957

detect_outliers(dfrDataset$word_freq_business)

## [1] 963

detect_outliers(dfrDataset$word_freq_email)

## [1] 1038

detect_outliers(dfrDataset$word_freq_you)

## [1] 75

detect_outliers(dfrDataset$word_freq_credit)

## [1] 424

detect_outliers(dfrDataset$word_freq_your)

## [1] 229

detect_outliers(dfrDataset$word_freq_font)

## [1] 117

detect_outliers(dfrDataset$word_freq_000)

## [1] 679

detect_outliers(dfrDataset$word_freq_money)

## [1] 735

detect_outliers(dfrDataset$word_freq_hp)

## [1] 1090

detect_outliers(dfrDataset$word_freq_hpl)

## [1] 811

detect_outliers(dfrDataset$word_freq_george)

## [1] 780

detect_outliers(dfrDataset$word_freq_650)

## [1] 463

detect_outliers(dfrDataset$word_freq_lab)

## [1] 372

detect_outliers(dfrDataset$word_freq_labs)

## [1] 469

detect_outliers(dfrDataset$word_freq_telnet)

## [1] 293

detect_outliers(dfrDataset$word_freq_857)

## [1] 205

detect_outliers(dfrDataset$word_freq_data)

## [1] 405

detect_outliers(dfrDataset$word_freq_415)

## [1] 215

detect_outliers(dfrDataset$word_freq_85)

## [1] 485

detect_outliers(dfrDataset$word_freq_technology)

## [1] 599

detect_outliers(dfrDataset$word_freq_1999)

## [1] 829

detect_outliers(dfrDataset$word_freq_parts)

## [1] 83

detect_outliers(dfrDataset$word_freq_pm)

## [1] 384

detect_outliers(dfrDataset$word_freq_direct)

## [1] 453

detect_outliers(dfrDataset$word_freq_cs)

## [1] 148

detect_outliers(dfrDataset$word_freq_meeting)

## [1] 341

detect_outliers(dfrDataset$word_freq_original)

## [1] 375

detect_outliers(dfrDataset$word_freq_project)

## [1] 327

detect_outliers(dfrDataset$word_freq_re)

## [1] 1001

detect_outliers(dfrDataset$word_freq_edu)

## [1] 517

detect_outliers(dfrDataset$word_freq_table)

## [1] 63

detect_outliers(dfrDataset$word_freq_conference)

## [1] 203

detect_outliers(dfrDataset$char_freq_.)

## [1] 790

detect_outliers(dfrDataset$char_freq_..1)

## [1] 296

detect_outliers(dfrDataset$char_freq_..2)

## [1] 529

detect_outliers(dfrDataset$char_freq_..3)

## [1] 411

detect_outliers(dfrDataset$char_freq_..4)

## [1] 811

detect_outliers(dfrDataset$char_freq_..5)

## [1] 750

detect_outliers(dfrDataset$capital_run_length_average)

## [1] 363

detect_outliers(dfrDataset$capital_run_length_longest)

## [1] 463

detect_outliers(dfrDataset$capital_run_length_total)

## [1] 550

detect_outliers(dfrDataset$status)

## [1] 0

Observations
1. We can see that there are outliers in the Data set.
2. As Random Forest is insensitive to Outliers so we are going with Outliers in this Model.

Outliers Treatment

#dfrDataset$word_freq_make<- Replace_Outliers_to_Median(dfrDataset$word_freq_make)
#dfrDataset$word_freq_address<- Replace_Outliers_to_Median(dfrDataset$word_freq_address)
#dfrDataset$word_freq_all<- Replace_Outliers_to_Median(dfrDataset$word_freq_all)
#dfrDataset$word_freq_3d<- Replace_Outliers_to_Median(dfrDataset$word_freq_3d)
#dfrDataset$word_freq_our<- Replace_Outliers_to_Median(dfrDataset$word_freq_our)
#dfrDataset$word_freq_over<- Replace_Outliers_to_Median(dfrDataset$word_freq_over)
#dfrDataset$word_freq_remove<- #Replace_Outliers_to_Median(dfrDataset$word_freq_remove)
#dfrDataset$word_freq_internet<- #Replace_Outliers_to_Median(dfrDataset$word_freq_internet)
#dfrDataset$word_freq_order<- #Replace_Outliers_to_Median(dfrDataset$word_freq_order)
#dfrDataset$word_freq_mail<- Replace_Outliers_to_Median(dfrDataset$word_freq_mail)
#dfrDataset$word_freq_receive<- Replace_Outliers_to_Median(dfrDataset$word_freq_receive)
#dfrDataset$word_freq_will<- Replace_Outliers_to_Median(dfrDataset$word_freq_will)
#dfrDataset$word_freq_people<- Replace_Outliers_to_Median(dfrDataset$word_freq_people)
#dfrDataset$word_freq_report<- Replace_Outliers_to_Median(dfrDataset$word_freq_report)
#dfrDataset$word_freq_addresses<- Replace_Outliers_to_Median(dfrDataset$word_freq_addresses)
#dfrDataset$word_freq_free<- Replace_Outliers_to_Median(dfrDataset$word_freq_free)
#dfrDataset$word_freq_business<- Replace_Outliers_to_Median(dfrDataset$word_freq_business)
#dfrDataset$word_freq_email<- Replace_Outliers_to_Median(dfrDataset$word_freq_email)
#dfrDataset$word_freq_you<- Replace_Outliers_to_Median(dfrDataset$word_freq_you)
#dfrDataset$word_freq_credit<- #Replace_Outliers_to_Median(dfrDataset$word_freq_credit)
#dfrDataset$word_freq_your<- Replace_Outliers_to_Median(dfrDataset$word_freq_your)
#dfrDataset$word_freq_font<- Replace_Outliers_to_Median(dfrDataset$word_freq_font)
#dfrDataset$word_freq_000<- Replace_Outliers_to_Median(dfrDataset$word_freq_000)
#dfrDataset$word_freq_money<- Replace_Outliers_to_Median(dfrDataset$word_freq_money)
#dfrDataset$word_freq_hp<- Replace_Outliers_to_Median(dfrDataset$word_freq_hp)
#dfrDataset$word_freq_hpl<- Replace_Outliers_to_Median(dfrDataset$word_freq_hpl)
#dfrDataset$word_freq_george<- Replace_Outliers_to_Median(dfrDataset$word_freq_george)
#dfrDataset$word_freq_650<- Replace_Outliers_to_Median(dfrDataset$word_freq_650)
#dfrDataset$word_freq_lab<- Replace_Outliers_to_Median(dfrDataset$word_freq_lab)
#dfrDataset$word_freq_labs<- Replace_Outliers_to_Median(dfrDataset$word_freq_labs)
#dfrDataset$word_freq_telnet<- Replace_Outliers_to_Median(dfrDataset$word_freq_telnet)
#dfrDataset$word_freq_857<- Replace_Outliers_to_Median(dfrDataset$word_freq_857)
#dfrDataset$word_freq_data<- Replace_Outliers_to_Median(dfrDataset$word_freq_data)
#dfrDataset$word_freq_415<- Replace_Outliers_to_Median(dfrDataset$word_freq_415)
#dfrDataset$word_freq_85<- Replace_Outliers_to_Median(dfrDataset$word_freq_85)
#dfrDataset$word_freq_technology<- Replace_Outliers_to_Median(dfrDataset$word_freq_technology)
#dfrDataset$word_freq_1999<- Replace_Outliers_to_Median(dfrDataset$word_freq_1999)
#dfrDataset$word_freq_parts<- Replace_Outliers_to_Median(dfrDataset$word_freq_parts)
#dfrDataset$word_freq_pm<- Replace_Outliers_to_Median(dfrDataset$word_freq_pm)
#dfrDataset$word_freq_direct<- Replace_Outliers_to_Median(dfrDataset$word_freq_direct)
#dfrDataset$word_freq_cs<- Replace_Outliers_to_Median(dfrDataset$word_freq_cs)
#dfrDataset$word_freq_meeting<- Replace_Outliers_to_Median(dfrDataset$word_freq_meeting)
#dfrDataset$word_freq_original<- Replace_Outliers_to_Median(dfrDataset$word_freq_original)
#dfrDataset$word_freq_project<- Replace_Outliers_to_Median(dfrDataset$word_freq_project)
#dfrDataset$word_freq_re<- Replace_Outliers_to_Median(dfrDataset$word_freq_re)
#dfrDataset$word_freq_edu<- Replace_Outliers_to_Median(dfrDataset$word_freq_edu)
#dfrDataset$word_freq_table<- #Replace_Outliers_to_Median(dfrDataset$word_freq_table)
#dfrDataset$word_freq_conference<- #Replace_Outliers_to_Median(dfrDataset$word_freq_conference)
#dfrDataset$char_freq_.<- Replace_Outliers_to_Median(dfrDataset$char_freq_.)
#dfrDataset$char_freq_..1<- Replace_Outliers_to_Median(dfrDataset$char_freq_..1)
#dfrDataset$char_freq_..2<- Replace_Outliers_to_Median(dfrDataset$char_freq_..2)
#dfrDataset$char_freq_..3<- Replace_Outliers_to_Median(dfrDataset$char_freq_..3)
#dfrDataset$char_freq_..4<- Replace_Outliers_to_Median(dfrDataset$char_freq_..4)
#dfrDataset$char_freq_..5<- Replace_Outliers_to_Median(dfrDataset$char_freq_..5)
#dfrDataset$capital_run_length_average<- #Replace_Outliers_to_Median(dfrDataset$capital_run_length_average)
#dfrDataset$capital_run_length_longest<- Replace_Outliers_to_Median(dfrDataset$capital_run_length_longest)
#dfrDataset$capital_run_length_total<- Replace_Outliers_to_Median(dfrDataset$capital_run_length_total)
#dfrDataset$status<- Replace_Outliers_to_Median(dfrDataset$status)

#dfrDataset <- dfrDataset[complete.cases(dfrDataset), ]
#nrow(dfrDataset)
#head(dfrDataset)

Reason Why Outlier Handling is not require.
1. Total No of data records are 4601, after removing outliers of each column only 177 data records are remaining which is not good as full bunch of records are not part of analysis.
2. If Outliers is replaced with Mean value then again same no of outliers remains in the database as for most of the columns Median value is Zero.
3. If Outliers are replaced with Median then many columns will not have any specific values other than Zeroes as Median is zero for most of the columns & after replacing with median we will not be able to find out Correlation between variables.
4. Random Forest is insensitive to the Outliers.
5. After creating the models with different methods we find out that Data accuracy is freat & Outliers are not creating any problem in pridction.(This point is added after model creation)

Spam Distribution

status <- table(dfrDataset$status)
status

## 
##    0    1 
## 2788 1813

prop.table(status)

## 
##         0         1 
## 0.6059552 0.3940448

Observations
1. 1813 Emails are spam which are 39.40% of total records.

Top Words of Spam & non Spam mails

dfrDataset_All <- aggregate(dfrDataset[, 1:54], list(dfrDataset$status), mean)
dfrDataset_spam <- dfrDataset_All[dfrDataset_All$Group.1==1,]
dfrDataset_spam$Group.1 <- NULL
dfrDataset_spam <- dfrDataset_spam[order(-dfrDataset_spam)[1:3]]
dfrDataset_spam <- unlist(dfrDataset_spam)
barplot(dfrDataset_spam,  main="Spam Emails Top 3 Words",ylab="Percentage", xlab="Words")

dfrDataset_notspam <- dfrDataset_All[dfrDataset_All$Group.1==0,]
dfrDataset_notspam$Group.1 <- NULL
dfrDataset_notspam <- dfrDataset_notspam[order(-dfrDataset_notspam)[1:3]]
dfrDataset_notspam <- unlist(dfrDataset_notspam)
barplot(dfrDataset_notspam,  main="Non Spam Emails Top 3 Words", ylab="Percentage", xlab="Words")

Output Visualisation

dfrQltyFreq <- summarise(group_by(dfrDataset, status), count=n())
dfrQltyFreq

## # A tibble: 2 x 2
##   status count
##    <int> <int>
## 1      0  2788
## 2      1  1813

# boxplot of mpg by car cylinders
ggplot(dfrQltyFreq, aes(x=status, y=count)) +
    geom_bar(stat="identity", aes(fill=count)) +
    labs(title="Status Frequency Distribution") +
    labs(x="Status") +
    labs(y="Counts")

Find Corelations

## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor)) #absolute value
summary(vcnCorsData)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002525 0.148801 0.253256 0.276879 0.354708 1.000000

Show Corelations

vcnCorsData

##             word_freq_make          word_freq_address 
##                 0.24069974                 0.29750940 
##              word_freq_all               word_freq_3d 
##                 0.33283147                 0.09077776 
##              word_freq_our             word_freq_over 
##                 0.40913946                 0.31864550 
##           word_freq_remove         word_freq_internet 
##                 0.51877779                 0.34379623 
##            word_freq_order             word_freq_mail 
##                 0.30073703                 0.29682394 
##          word_freq_receive             word_freq_will 
##                 0.35496682                 0.14847653 
##           word_freq_people           word_freq_report 
##                 0.21287588                 0.14977533 
##        word_freq_addresses             word_freq_free 
##                 0.26515743                 0.50416922 
##         word_freq_business            word_freq_email 
##                 0.35290749                 0.29909391 
##              word_freq_you           word_freq_credit 
##                 0.36110406                 0.32418657 
##             word_freq_your             word_freq_font 
##                 0.50159062                 0.13797471 
##              word_freq_000            word_freq_money 
##                 0.42580256                 0.47215455 
##               word_freq_hp              word_freq_hpl 
##                 0.39981558                 0.34188069 
##           word_freq_george              word_freq_650 
##                 0.35393063                 0.22619064 
##              word_freq_lab             word_freq_labs 
##                 0.22068802                 0.24580530 
##           word_freq_telnet              word_freq_857 
##                 0.20467400                 0.16983798 
##             word_freq_data              word_freq_415 
##                 0.15756347                 0.15802818 
##               word_freq_85       word_freq_technology 
##                 0.21413087                 0.16680254 
##             word_freq_1999            word_freq_parts 
##                 0.26070752                 0.00252536 
##               word_freq_pm           word_freq_direct 
##                 0.14721389                 0.02813193 
##               word_freq_cs          word_freq_meeting 
##                 0.14453750                 0.19574176 
##         word_freq_original          word_freq_project 
##                 0.10781412                 0.14453744 
##               word_freq_re              word_freq_edu 
##                 0.07176763                 0.19702549 
##            word_freq_table       word_freq_conference 
##                 0.02266674                 0.13903044 
##                char_freq_.              char_freq_..1 
##                 0.05683530                 0.03263555 
##              char_freq_..2              char_freq_..3 
##                 0.11122690                 0.59785363 
##              char_freq_..4              char_freq_..5 
##                 0.56563314                 0.26668614 
## capital_run_length_average capital_run_length_longest 
##                 0.48794983                 0.51515693 
##   capital_run_length_total                     status 
##                 0.44397367                 1.00000000

Plot Corelations

#corrplot(cor(dfrDataset[1:10,1:58])[1:58,58, drop=FALSE], cl.pos='n')
corrplot(cor(dfrDataset[c(1:10,58)]))

corrplot(cor(dfrDataset[c(10:20,58)]))

corrplot(cor(dfrDataset[c(20:30,58)]))

corrplot(cor(dfrDataset[c(30:40,58)]))

corrplot(cor(dfrDataset[c(40:50,58)]))

corrplot(cor(dfrDataset[c(50:58)]))

Observations
1. Correlation has been ploted between Status variable & other variables.
2. There is Medium to high correlation only for few variables with respect to status variable.

More than Medium Corelations

vcnCorsData[vcnCorsData>0.4]

##              word_freq_our           word_freq_remove 
##                  0.4091395                  0.5187778 
##             word_freq_free             word_freq_your 
##                  0.5041692                  0.5015906 
##              word_freq_000            word_freq_money 
##                  0.4258026                  0.4721546 
##              char_freq_..3              char_freq_..4 
##                  0.5978536                  0.5656331 
## capital_run_length_average capital_run_length_longest 
##                  0.4879498                  0.5151569 
##   capital_run_length_total                     status 
##                  0.4439737                  1.0000000

Create Column IsSpam

dfrDataset <- mutate(dfrDataset, IsSpam= ifelse(dfrDataset$status ==0,'Not Spam',
                                'Spam'))
dfrDataset$IsSpam <- as.factor(dfrDataset$IsSpam)
table(dfrDataset$IsSpam)

## 
## Not Spam     Spam 
##     2788     1813

#By now u should have checked for correlation and remove columns if required and checked for data imbalance
#Columns with more NA remove that too.

Observations
1. New variable has been created successfully according to status variable.
2. There are 2788 email records which are not spam while 1813 emails are spam.

Dataset Split

set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.8, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]

Observations
1. Data set is divided between training & test datasets.
2. 80% data is in training data to create better model while 20% data is in test data to test the model.

Training Dataset RowCount & ColCount

dim(dfrTrnData)

## [1] 3681   59

Observations
1. There are 3681 data records in the training data set.

Testing Dataset RowCount & ColCount

dim(dfrTstData)

## [1] 920  59

Observations
1. There are 920 data records in the test data set.

Training Dataset Head

head(dfrTrnData)

##    word_freq_make word_freq_address word_freq_all word_freq_3d
## 2            0.21              0.28          0.50            0
## 3            0.06              0.00          0.71            0
## 4            0.00              0.00          0.00            0
## 10           0.06              0.12          0.77            0
## 11           0.00              0.00          0.00            0
## 12           0.00              0.00          0.25            0
##    word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2           0.14           0.28             0.21               0.07
## 3           1.23           0.19             0.19               0.12
## 4           0.63           0.00             0.31               0.63
## 10          0.19           0.32             0.38               0.00
## 11          0.00           0.00             0.96               0.00
## 12          0.38           0.25             0.25               0.00
##    word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2             0.00           0.94              0.21           0.79
## 3             0.64           0.25              0.38           0.45
## 4             0.31           0.63              0.31           0.31
## 10            0.06           0.00              0.00           0.64
## 11            0.00           1.92              0.96           0.00
## 12            0.00           0.00              0.12           0.12
##    word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2              0.65             0.21                0.14           0.14
## 3              0.12             0.00                1.75           0.06
## 4              0.31             0.00                0.00           0.31
## 10             0.25             0.00                0.12           0.00
## 11             0.00             0.00                0.00           0.00
## 12             0.12             0.00                0.00           0.00
##    word_freq_business word_freq_email word_freq_you word_freq_credit
## 2                0.07            0.28          3.47             0.00
## 3                0.06            1.03          1.36             0.32
## 4                0.00            0.00          3.18             0.00
## 10               0.00            0.12          1.67             0.06
## 11               0.00            0.96          3.84             0.00
## 12               0.00            0.00          1.16             0.00
##    word_freq_your word_freq_font word_freq_000 word_freq_money
## 2            1.59              0          0.43            0.43
## 3            0.51              0          1.16            0.06
## 4            0.31              0          0.00            0.00
## 10           0.71              0          0.19            0.00
## 11           0.96              0          0.00            0.00
## 12           0.77              0          0.00            0.00
##    word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2             0             0                0             0             0
## 3             0             0                0             0             0
## 4             0             0                0             0             0
## 10            0             0                0             0             0
## 11            0             0                0             0             0
## 12            0             0                0             0             0
##    word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2               0                0             0              0
## 3               0                0             0              0
## 4               0                0             0              0
## 10              0                0             0              0
## 11              0                0             0              0
## 12              0                0             0              0
##    word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2              0            0                    0           0.07
## 3              0            0                    0           0.00
## 4              0            0                    0           0.00
## 10             0            0                    0           0.00
## 11             0            0                    0           0.00
## 12             0            0                    0           0.00
##    word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2                0            0             0.00            0
## 3                0            0             0.06            0
## 4                0            0             0.00            0
## 10               0            0             0.00            0
## 11               0            0             0.96            0
## 12               0            0             0.00            0
##    word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2                  0               0.00              0.00         0.00
## 3                  0               0.12              0.00         0.06
## 4                  0               0.00              0.00         0.00
## 10                 0               0.00              0.06         0.00
## 11                 0               0.00              0.00         0.00
## 12                 0               0.00              0.00         0.00
##    word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2           0.00               0                    0       0.000
## 3           0.06               0                    0       0.010
## 4           0.00               0                    0       0.000
## 10          0.00               0                    0       0.040
## 11          0.00               0                    0       0.000
## 12          0.00               0                    0       0.022
##    char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2          0.132             0         0.372         0.180         0.048
## 3          0.143             0         0.276         0.184         0.010
## 4          0.137             0         0.137         0.000         0.000
## 10         0.030             0         0.244         0.081         0.000
## 11         0.000             0         0.462         0.000         0.000
## 12         0.044             0         0.663         0.000         0.000
##    capital_run_length_average capital_run_length_longest
## 2                       5.114                        101
## 3                       9.821                        485
## 4                       3.537                         40
## 10                      1.729                         43
## 11                      1.312                          6
## 12                      1.243                         11
##    capital_run_length_total status IsSpam
## 2                      1028      1   Spam
## 3                      2259      1   Spam
## 4                       191      1   Spam
## 10                      749      1   Spam
## 11                       21      1   Spam
## 12                      184      1   Spam

Testing Dataset Head

head(dfrTstData)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
## 7           0.00              0.00          0.00            0
## 8           0.00              0.00          0.00            0
## 9           0.15              0.00          0.46            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32              0             0.00               0.00
## 5          0.63              0             0.31               0.63
## 6          1.85              0             0.00               1.85
## 7          1.92              0             0.00               0.00
## 8          1.88              0             0.00               1.88
## 9          0.61              0             0.30               0.00
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
## 7            0.00           0.64              0.96           1.28
## 8            0.00           0.00              0.00           0.00
## 9            0.92           0.76              0.76           0.92
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00                0                   0           0.32
## 5             0.31                0                   0           0.31
## 6             0.00                0                   0           0.00
## 7             0.00                0                   0           0.96
## 8             0.00                0                   0           0.00
## 9             0.00                0                   0           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1                  0            1.29          1.93             0.00
## 5                  0            0.00          3.18             0.00
## 6                  0            0.00          0.00             0.00
## 7                  0            0.32          3.85             0.00
## 8                  0            0.00          0.00             0.00
## 9                  0            0.15          1.23             3.53
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0             0            0.00            0
## 5           0.31              0             0            0.00            0
## 6           0.00              0             0            0.00            0
## 7           0.64              0             0            0.00            0
## 8           0.00              0             0            0.00            0
## 9           2.00              0             0            0.15            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
## 7             0                0             0             0
## 8             0                0             0             0
## 9             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0           0.00
## 5              0                0             0           0.00
## 6              0                0             0           0.00
## 7              0                0             0           0.00
## 8              0                0             0           0.00
## 9              0                0             0           0.15
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0              0
## 5             0            0                    0              0
## 6             0            0                    0              0
## 7             0            0                    0              0
## 8             0            0                    0              0
## 9             0            0                    0              0
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0                0            0
## 5               0            0                0            0
## 6               0            0                0            0
## 7               0            0                0            0
## 8               0            0                0            0
## 9               0            0                0            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0                0.0                 0            0
## 5                 0                0.0                 0            0
## 6                 0                0.0                 0            0
## 7                 0                0.0                 0            0
## 8                 0                0.0                 0            0
## 9                 0                0.3                 0            0
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1             0               0                    0           0
## 5             0               0                    0           0
## 6             0               0                    0           0
## 7             0               0                    0           0
## 8             0               0                    0           0
## 9             0               0                    0           0
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
## 7         0.054             0         0.164         0.054         0.000
## 8         0.206             0         0.000         0.000         0.000
## 9         0.271             0         0.181         0.203         0.022
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 5                      3.537                         40
## 6                      3.000                         15
## 7                      1.671                          4
## 8                      2.450                         11
## 9                      9.744                        445
##   capital_run_length_total status IsSpam
## 1                      278      1   Spam
## 5                      191      1   Spam
## 6                       54      1   Spam
## 7                      112      1   Spam
## 8                       49      1   Spam
## 9                     1257      1   Spam

Training Dataset Summary

lapply(dfrTrnData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09885 0.00000 4.54000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2027  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.276   0.400   5.100 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07702  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3066  0.3700 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09361 0.00000 3.57000 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1128  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1111  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09092 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2315  0.1400 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06053 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.090   0.533   0.780   7.690 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09264 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05711  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0495  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2504  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.14    0.00    7.14 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1867  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.681   2.670  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08879  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.8052  1.2800 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1237  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1015  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09806  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5447  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2772  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7443  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1324  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09591  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1049  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06735  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04803 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1003  0.0000 18.1800 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04888 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.111   0.000  20.000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1027  0.0000  7.6900 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1399  0.0000  6.8900 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01477 0.00000 8.33000 
## 
## $word_freq_pm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0794  0.0000  9.7500 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06296 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04437 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1302  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04856 0.00000 3.57000 
## 
## $word_freq_project
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08274  0.00000 20.00000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3028  0.1200 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1815  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005186 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02867  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0403  0.0000  4.3850 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0650  0.1398  0.1890  9.7520 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01525 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2693  0.3110 32.4780 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07428 0.05000 5.30000 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03956  0.00000 13.12900 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.576    2.250    5.180    3.697 1102.500 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   14.00   52.23   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    93.0   282.7   266.0 15841.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3953  1.0000  1.0000 
## 
## $IsSpam
## Not Spam     Spam 
##     2226     1455

Testing Dataset Summary

lapply(dfrTstData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1274  0.0000  4.0000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2542  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2994  0.5000  4.5400 
## 
## $word_freq_3d
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01904 0.00000 7.18000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3349  0.4300  7.1400 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1051  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1199  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08196 0.00000 4.00000 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08664 0.00000 3.33000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2709  0.2200 11.1100 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.00000 2.00000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1650  0.5763  0.8625  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09909 0.00000 2.94000 
## 
## $word_freq_report
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06469 0.00000 5.55000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04803 0.00000 2.31000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2427  0.1100 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1529  0.0000  4.8700 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.177   0.000   4.160 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.585   2.500  14.000 
## 
## $word_freq_credit
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07271 0.00000 6.25000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.270   0.828   1.260   8.000 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1112  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1022  0.0000  3.3800 
## 
## $word_freq_money
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07909 0.00000 4.41000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5688  0.0000 20.0000 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2181  0.0000  7.6900 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8594  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09478 0.00000 4.76000 
## 
## $word_freq_lab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1109  0.0000 10.0000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09458 0.00000 4.76000 
## 
## $word_freq_telnet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05438 0.00000 4.76000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04312 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08496 0.00000 8.33000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04366 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08305 0.00000 4.76000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07655 0.00000 4.76000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.125   0.000   3.700 
## 
## $word_freq_parts
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006913 0.000000 1.560000 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07554  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07235 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04086 0.00000 4.75000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1408  0.0000  7.6900 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03627 0.00000 1.69000 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06501 0.00000 4.54000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.295   0.070  16.660 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1731  0.0000  9.0900 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006478 0.000000 2.120000 
## 
## $word_freq_conference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04469 0.00000 8.33000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03168 0.00000 3.67200 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0675  0.1360  0.1782  4.2710 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0239  0.0000  2.7770 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2681  0.3352  5.8280 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08193 0.06125 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06296  0.00000 19.82900 
## 
## $capital_run_length_average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.625   2.333   5.236   3.796 443.666 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   16.00   51.94   44.00 1325.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    37.0   101.5   285.5   259.2  9088.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3891  1.0000  1.0000 
## 
## $IsSpam
## Not Spam     Spam 
##      562      358

Random Forest Approach

Default Method

myCtrl <- trainControl(method=“cv”, number=10, repeats=3)

Increase in the iterations increases accruacy and thus takes time to make the model. for train we require caret package, for RF we require RF package.

m1 <- train(predictor~., data=dataFrame, method=“rf”,
verbose=F, trControl=myCtrl)

Boost Method

cvCtrl <- trainControl(method=“repeatedcv”, number=10, repeats=3) m2 <- train(predictor~., data=dataFrame, method=“gmb”, verbose=F, trControl=cvCtrl)

Custom Algorithm … notice method is not mentioned here

myCtrl <- trainControl(method=“oob”, number=10, repeats=3) m3 <- train(predictor~., data=dataFrame, tuneGrid=data.frame(mtry=10), trControl=myCtrl)

We could also try one of these if required

Support Vector Machines Model

myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o1 <- train(predictor~., data=dataFrame, method=“svm”, verbose=F, trControl=cvCtrl)

KNN Model

myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o2 <- train(predictor~., data=dataFrame, method=“knn”, verbose=F, trControl=cvCtrl)

Bagged Model

myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o3 <- train(predictor~., data=dataFrame, method=“bag”, verbose=F, trControl=cvCtrl)

Create Model - Random Forest (Default)

## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2)  #total number of predictor columns. hence u minus 2 coz u not taking taste and quality
myNtrees=500
# start time
vctProcStrt <- proc.time()
#Proc.time gives statistics of the process.

# random forest (default)
#mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData)
mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData, 
                             mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 17.37"

Observations
1. 15.87 seconds was taken to create the model by Default method.

View Model - Default Random Forest

mdlRndForDef

## 
## Call:
##  randomForest(formula = IsSpam ~ . - status, data = dfrTrnData,      mtry = myMtry, ntree = myNtrees) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 4.54%
## Confusion matrix:
##          Not Spam Spam class.error
## Not Spam     2168   58  0.02605571
## Spam          109 1346  0.07491409

#Anything above 70% is good.

Observations
1. Error rate is only 4.54% which means accuracy is around 95.46% wich is very high accuracy.

View Model Summary - Default Random Forest

summary(mdlRndForDef)

##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       3681   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           7362   matrix numeric  
## oob.times       3681   -none- numeric  
## classes            2   -none- character
## importance        57   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3681   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

#mdlRndForDef[3]
#It gives you the lengthof the vector. 
#It prints the attributes of the predictor model.
#If you want to know the values in the attributes.
# Type mdlRndForDef$ntree[1] in the console

Prediction - Test Data - Random Forest (Default)

vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef  <- confusionMatrix(vctRndForDef, dfrTstData$IsSpam)
cmxRndForDef

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      534   21
##   Spam           28  337
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9302, 0.9603)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8884          
##  Mcnemar's Test P-Value : 0.3914          
##                                           
##             Sensitivity : 0.9502          
##             Specificity : 0.9413          
##          Pos Pred Value : 0.9622          
##          Neg Pred Value : 0.9233          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5804          
##    Detection Prevalence : 0.6033          
##       Balanced Accuracy : 0.9458          
##                                           
##        'Positive' Class : Not Spam        
##

#Look for accruacy an 95% CI. The highest accruacy is 75.
#p value < 0.05. hence your model is good.

Observations
1. P value is less than 0.05 which is rejecting the NULL Hypothesis at 95% confidence interval & showing great dependency between dependent & predictor variables.
2. Accuracy of test data is also very highwhich is around 95%, near to train data so model is good.

Create Model - Random Forest (RFM)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3)  #change these params to increase accruacy. insted of cv u can write repeated cv, number can be more than 10 and repeats more than 3
myMetric <- "Accuracy"   #Consider this as syntax
myMtry <- sqrt(ncol(dfrTrnData)-2) #can change this also.
myNtrees <- 500  #to increase the accruacy increase this. Change only on thing at a time
myTuneGrid <-  expand.grid(.mtry=myMtry)  #How to configure mtry in train method we use this. consider this as syntax.
mdlRndForRfm <- train(IsSpam~.-status, data=dfrTrnData, method="rf",
                        verbose=F, metric=myMetric, trControl=myControl,
                        tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 174.72"

Observations
1. 56.97 seconds was taken to create the model by Random Forest method.

View Model - Random Forest (RFM)

mdlRndForRfm

## Random Forest 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9521966  0.8995836
## 
## Tuning parameter 'mtry' was held constant at a value of 7.549834

Observations
1. Accuracy is around 95.22% wich is very high accuracy but less than default method.

View Model Summary - Random Forest (RFM)

summary(mdlRndForRfm)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3681   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           7362   matrix     numeric  
## oob.times       3681   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3681   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (RFM)

vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm  <- confusionMatrix(vctRndForRfm, dfrTstData$IsSpam)
cmxRndForRfm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      534   22
##   Spam           28  336
##                                          
##                Accuracy : 0.9457         
##                  95% CI : (0.929, 0.9594)
##     No Information Rate : 0.6109         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.886          
##  Mcnemar's Test P-Value : 0.4795         
##                                          
##             Sensitivity : 0.9502         
##             Specificity : 0.9385         
##          Pos Pred Value : 0.9604         
##          Neg Pred Value : 0.9231         
##              Prevalence : 0.6109         
##          Detection Rate : 0.5804         
##    Detection Prevalence : 0.6043         
##       Balanced Accuracy : 0.9444         
##                                          
##        'Positive' Class : Not Spam       
##

Create Model - Random Forest (GBM)- General bagging method. Creating groups of decision trees.GBM is generic. RF model is specific

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(IsSpam~.-status, data=dfrTrnData, method="gbm",
                 verbose=F, metric=myMetric, trControl=myControl)
                         #ntree=myNtrees)
#                 tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 154.41"

Observations
1. 155.7 seconds was taken to create the model by Random Forest method which is much higher than default method.

View Model - Random Forest (GBM)

mdlRndForGbm

## Stochastic Gradient Boosting 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9125264  0.8130164
##   1                  100      0.9307222  0.8532164
##   1                  150      0.9367051  0.8663083
##   2                   50      0.9312693  0.8546294
##   2                  100      0.9405110  0.8744321
##   2                  150      0.9432274  0.8803238
##   3                   50      0.9363428  0.8654668
##   3                  100      0.9449460  0.8839671
##   3                  150      0.9471209  0.8886790
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Observations
1. Accuracy is around 94% wich is very high accuracy but less than default method.

View Model Summary - Random Forest (GBM)

summary(mdlRndForGbm)

##                                                   var     rel.inf
## char_freq_..3                           char_freq_..3 21.71328125
## char_freq_..4                           char_freq_..4 19.17273719
## word_freq_remove                     word_freq_remove 11.66807935
## word_freq_hp                             word_freq_hp  9.17395240
## word_freq_free                         word_freq_free  7.94587755
## capital_run_length_longest capital_run_length_longest  5.19101659
## capital_run_length_average capital_run_length_average  4.55363849
## word_freq_your                         word_freq_your  3.03793007
## word_freq_our                           word_freq_our  2.49322457
## word_freq_money                       word_freq_money  2.43499996
## word_freq_george                     word_freq_george  2.39065969
## word_freq_edu                           word_freq_edu  1.74627237
## word_freq_1999                         word_freq_1999  1.28704265
## capital_run_length_total     capital_run_length_total  1.13584861
## word_freq_meeting                   word_freq_meeting  0.61307150
## word_freq_you                           word_freq_you  0.59182671
## word_freq_000                           word_freq_000  0.56908488
## word_freq_receive                   word_freq_receive  0.51898504
## word_freq_re                             word_freq_re  0.44280485
## word_freq_650                           word_freq_650  0.40545253
## char_freq_.                               char_freq_.  0.40499187
## word_freq_business                 word_freq_business  0.35117130
## word_freq_hpl                           word_freq_hpl  0.33315065
## word_freq_will                         word_freq_will  0.31637881
## word_freq_internet                 word_freq_internet  0.29897659
## word_freq_email                       word_freq_email  0.23535307
## word_freq_technology             word_freq_technology  0.18272811
## char_freq_..1                           char_freq_..1  0.15921637
## word_freq_over                         word_freq_over  0.13902156
## word_freq_font                         word_freq_font  0.12361794
## word_freq_mail                         word_freq_mail  0.05960259
## word_freq_report                     word_freq_report  0.05752083
## word_freq_pm                             word_freq_pm  0.05072411
## word_freq_project                   word_freq_project  0.03704862
## word_freq_credit                     word_freq_credit  0.03502091
## word_freq_order                       word_freq_order  0.03055389
## word_freq_conference             word_freq_conference  0.02965122
## word_freq_3d                             word_freq_3d  0.02071226
## word_freq_original                 word_freq_original  0.01781143
## word_freq_address                   word_freq_address  0.01701609
## word_freq_make                         word_freq_make  0.01394553
## word_freq_all                           word_freq_all  0.00000000
## word_freq_people                     word_freq_people  0.00000000
## word_freq_addresses               word_freq_addresses  0.00000000
## word_freq_lab                           word_freq_lab  0.00000000
## word_freq_labs                         word_freq_labs  0.00000000
## word_freq_telnet                     word_freq_telnet  0.00000000
## word_freq_857                           word_freq_857  0.00000000
## word_freq_data                         word_freq_data  0.00000000
## word_freq_415                           word_freq_415  0.00000000
## word_freq_85                             word_freq_85  0.00000000
## word_freq_parts                       word_freq_parts  0.00000000
## word_freq_direct                     word_freq_direct  0.00000000
## word_freq_cs                             word_freq_cs  0.00000000
## word_freq_table                       word_freq_table  0.00000000
## char_freq_..2                           char_freq_..2  0.00000000
## char_freq_..5                           char_freq_..5  0.00000000

Prediction - Test Data - Random Forest (GBM)

vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm  <- confusionMatrix(vctRndForGbm, dfrTstData$IsSpam)
cmxRndForGbm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      532   19
##   Spam           30  339
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9302, 0.9603)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8886          
##  Mcnemar's Test P-Value : 0.1531          
##                                           
##             Sensitivity : 0.9466          
##             Specificity : 0.9469          
##          Pos Pred Value : 0.9655          
##          Neg Pred Value : 0.9187          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5783          
##    Detection Prevalence : 0.5989          
##       Balanced Accuracy : 0.9468          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (OOB)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForOob <- train(IsSpam~.-status, data=dfrTrnData, 
                    verbose=F, metric=myMetric, trControl=myControl, 
                    tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 33.93"

Observations
1. 35.24 seconds was taken to create the model by Random Forest method which is much higher than default method.

View Model - Random Formbest (OOB)

mdlRndForOob

## Random Forest 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9527302  0.9007209
## 
## Tuning parameter 'mtry' was held constant at a value of 10

Observations
1. Accuracy is around 95.27% wich is very high accuracy but less than default method.

View Model Summary - Random Forest (OOB)

summary(mdlRndForOob)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3681   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           7362   matrix     numeric  
## oob.times       3681   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3681   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (OOB)

vctRndForOob <- predict(mdlRndForOob, newdata=dfrTstData)
cmxRndForOob <- confusionMatrix(vctRndForOob, dfrTstData$IsSpam)
cmxRndForOob

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      533   22
##   Spam           29  336
##                                           
##                Accuracy : 0.9446          
##                  95% CI : (0.9278, 0.9585)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8838          
##  Mcnemar's Test P-Value : 0.4008          
##                                           
##             Sensitivity : 0.9484          
##             Specificity : 0.9385          
##          Pos Pred Value : 0.9604          
##          Neg Pred Value : 0.9205          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5793          
##    Detection Prevalence : 0.6033          
##       Balanced Accuracy : 0.9435          
##                                           
##        'Positive' Class : Not Spam        
##

SUMMARY

Data has been loaded successfully.
There are 4601 records in the data set.
All the columns are either Numbers or Integers.
Data summary & statistics has been performed on all columns to do exploratory analysis.
Box Plot has been created for all the coulmns to know the data distribution & statistics like 1st Quartile, Median, 3rd Quartile, Outliers etc.
There are no NA records in the dataset so no Data Imputation is required.
There are outliers in the Data sets but as Random forest is in sensitive to the outliers so outliers are not removed.

Reason Why Outlier Handling is not required.
- Total No of data records are 4601, after removing outliers of each column only 177 data records are remaining which is not good as full bunch of records are not part of analysis.
- If Outliers is replaced with Mean value then again same no of outliers remains in the database as for most of the columns Median value is Zero.
- If Outliers are replaced with Median then many columns will not have any specific values other than Zeroes as Median is zero for most of the columns & after replacing with median we will not be able to find out Correlation between variables.
- Random Forest is insensitive to the Outliers.
- After creating the models with different methods we find out that Data accuracy is freat & Outliers are not creating any problem in pridction.

Correlation has been ploted between Status variable & other variables.
There is Medium to high correlation only for few variables with respect to status variable.
New variable has been created successfully according to status variable.
There are 2788 email records which are not spam while 1813 emails are spam.
Data set is divided between training & test datasets.
80% data is in training data to create better model while 20% data is in test data to test the model.
There are 3681 data records in the training data set & There are 920 data records in the test data set.

Default Method
Time Taken 15.87 Sec
Train Data Accuracy 95.46%
P-value <0.05
Test Data Accuracy 94.67%

Random Forest
Time Taken 56.97 Sec
Train Data Accuracy 95.22%
P-value <0.05
Test Data Accuracy 94.57%

GBM
Time Taken 155.7 Sec
Train Data Accuracy 94%
P-value <0.05
Test Data Accuracy 94.67%

OOB
Time Taken 35.24 Sec
Train Data Accuracy 95.27%
P-value <0.05
Test Data Accuracy 94.46%

As we can see that default method is showing best accuracy for both Train as well as test data along with this it is taking lowest time to create the model which is around 15 second so default method is giving the best results.
As well as False positives(marking good mail as spam) are also least for default method which is 28.
So as per above two points Default method is best to predict the spam mails.