Random Forest Assignment

Introduction
Random forest algorithm is a supervised classification algorithm.As the name suggest, this algorithm creates the forest with a number of trees.

Problem Definition
Predict whether the Emails are spam or not Using Random Forest.

Data Location
The dataset can be found on
https://archive.ics.uci.edu/ml/datasets/spambase

Data Description
Number of Attributes: 58 (57 continuous, 1 nominal class label)
The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0).
The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.
Type word_freq_WORD = percentage of words in the e-mail that match WORD.
Type char_freq_CHAR= percentage of characters in the e-mail that match CHAR.
Continuous real [1,…] attribute of type capital_run_length_average= average length of uninterrupted sequences of capital letters. Continuous integer [1,…] attribute of type capital_run_length_longest= length of longest uninterrupted sequence of capital letters.
Continuous integer [1,…] attribute of type capital_run_length_total= sum of length of uninterrupted 8- sequences of capital letters= total number of capital letters in the e-mail.

Setup

Load Libs

library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(gbm)
library(corrgram)

Functions

detectNA <- function(inp) {
  sum(is.na(inp))
}
detectCor <- function(x) {
  cor(as.numeric(dfrDataset[, x]), 
    as.numeric(dfrDataset$status), 
    method="spearman")
}

Load Dataset

dfrDataset <- read.csv("D://PGDM//Trim 4//MachineLearning//Data//spambase.csv", header=T, stringsAsFactors=T)
head(dfrDataset)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 2           0.21              0.28          0.50            0
## 3           0.06              0.00          0.71            0
## 4           0.00              0.00          0.00            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32           0.00             0.00               0.00
## 2          0.14           0.28             0.21               0.07
## 3          1.23           0.19             0.19               0.12
## 4          0.63           0.00             0.31               0.63
## 5          0.63           0.00             0.31               0.63
## 6          1.85           0.00             0.00               1.85
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 2            0.00           0.94              0.21           0.79
## 3            0.64           0.25              0.38           0.45
## 4            0.31           0.63              0.31           0.31
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00             0.00                0.00           0.32
## 2             0.65             0.21                0.14           0.14
## 3             0.12             0.00                1.75           0.06
## 4             0.31             0.00                0.00           0.31
## 5             0.31             0.00                0.00           0.31
## 6             0.00             0.00                0.00           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1               0.00            1.29          1.93             0.00
## 2               0.07            0.28          3.47             0.00
## 3               0.06            1.03          1.36             0.32
## 4               0.00            0.00          3.18             0.00
## 5               0.00            0.00          3.18             0.00
## 6               0.00            0.00          0.00             0.00
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0          0.00            0.00            0
## 2           1.59              0          0.43            0.43            0
## 3           0.51              0          1.16            0.06            0
## 4           0.31              0          0.00            0.00            0
## 5           0.31              0          0.00            0.00            0
## 6           0.00              0          0.00            0.00            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 2             0                0             0             0
## 3             0                0             0             0
## 4             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0              0
## 2              0                0             0              0
## 3              0                0             0              0
## 4              0                0             0              0
## 5              0                0             0              0
## 6              0                0             0              0
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0           0.00
## 2             0            0                    0           0.07
## 3             0            0                    0           0.00
## 4             0            0                    0           0.00
## 5             0            0                    0           0.00
## 6             0            0                    0           0.00
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0             0.00            0
## 2               0            0             0.00            0
## 3               0            0             0.06            0
## 4               0            0             0.00            0
## 5               0            0             0.00            0
## 6               0            0             0.00            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0               0.00                 0         0.00
## 2                 0               0.00                 0         0.00
## 3                 0               0.12                 0         0.06
## 4                 0               0.00                 0         0.00
## 5                 0               0.00                 0         0.00
## 6                 0               0.00                 0         0.00
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1          0.00               0                    0        0.00
## 2          0.00               0                    0        0.00
## 3          0.06               0                    0        0.01
## 4          0.00               0                    0        0.00
## 5          0.00               0                    0        0.00
## 6          0.00               0                    0        0.00
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 2         0.132             0         0.372         0.180         0.048
## 3         0.143             0         0.276         0.184         0.010
## 4         0.137             0         0.137         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 2                      5.114                        101
## 3                      9.821                        485
## 4                      3.537                         40
## 5                      3.537                         40
## 6                      3.000                         15
##   capital_run_length_total status
## 1                      278      1
## 2                     1028      1
## 3                     2259      1
## 4                      191      1
## 5                      191      1
## 6                       54      1

Dataframe Stucture

str(dfrDataset)

## 'data.frame':    4601 obs. of  58 variables:
##  $ word_freq_make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ word_freq_address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ word_freq_all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ word_freq_over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ word_freq_remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ word_freq_internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ word_freq_order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ word_freq_mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ word_freq_receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ word_freq_will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ word_freq_people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ word_freq_report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ word_freq_free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ word_freq_business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ word_freq_you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ word_freq_credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ word_freq_your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ word_freq_money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ word_freq_hp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ word_freq_re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ char_freq_..1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ char_freq_..2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ char_freq_..4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ char_freq_..5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capital_run_length_average: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capital_run_length_longest: int  61 101 485 40 40 15 4 11 445 43 ...
##  $ capital_run_length_total  : int  278 1028 2259 191 191 54 112 49 1257 749 ...
##  $ status                    : int  1 1 1 1 1 1 1 1 1 1 ...

Dataframe Summary

lapply(dfrDataset, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.213   0.000  14.280 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2807  0.4200  5.1000 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06542  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3122  0.3800 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0959  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1142  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1053  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2394  0.1600 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5417  0.8000  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05863  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0492  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2488  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1426  0.0000  7.1400 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1847  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.662   2.640  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08558  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2200  0.8098  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1212  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1016  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09427  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5495  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2654  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7673  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1248  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09892  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1029  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06475  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000 
## 
## $word_freq_data
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09723  0.00000 18.18000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1054  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.137   0.000   6.890 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0132  0.0000  8.3300 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07863  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1323  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0461  0.0000  3.5700 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0792  0.0000 20.0000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3012  0.1100 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1798  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03187  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.065   0.139   0.188   9.752 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2691  0.3150 32.4800 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04424  0.00000 19.83000 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.588    2.276    5.192    3.706 1102.000 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   52.17   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.0   283.3   266.0 15840.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.394   1.000   1.000

Missing Data

lapply(dfrDataset, FUN=detectNA)

## $word_freq_make
## [1] 0
## 
## $word_freq_address
## [1] 0
## 
## $word_freq_all
## [1] 0
## 
## $word_freq_3d
## [1] 0
## 
## $word_freq_our
## [1] 0
## 
## $word_freq_over
## [1] 0
## 
## $word_freq_remove
## [1] 0
## 
## $word_freq_internet
## [1] 0
## 
## $word_freq_order
## [1] 0
## 
## $word_freq_mail
## [1] 0
## 
## $word_freq_receive
## [1] 0
## 
## $word_freq_will
## [1] 0
## 
## $word_freq_people
## [1] 0
## 
## $word_freq_report
## [1] 0
## 
## $word_freq_addresses
## [1] 0
## 
## $word_freq_free
## [1] 0
## 
## $word_freq_business
## [1] 0
## 
## $word_freq_email
## [1] 0
## 
## $word_freq_you
## [1] 0
## 
## $word_freq_credit
## [1] 0
## 
## $word_freq_your
## [1] 0
## 
## $word_freq_font
## [1] 0
## 
## $word_freq_000
## [1] 0
## 
## $word_freq_money
## [1] 0
## 
## $word_freq_hp
## [1] 0
## 
## $word_freq_hpl
## [1] 0
## 
## $word_freq_george
## [1] 0
## 
## $word_freq_650
## [1] 0
## 
## $word_freq_lab
## [1] 0
## 
## $word_freq_labs
## [1] 0
## 
## $word_freq_telnet
## [1] 0
## 
## $word_freq_857
## [1] 0
## 
## $word_freq_data
## [1] 0
## 
## $word_freq_415
## [1] 0
## 
## $word_freq_85
## [1] 0
## 
## $word_freq_technology
## [1] 0
## 
## $word_freq_1999
## [1] 0
## 
## $word_freq_parts
## [1] 0
## 
## $word_freq_pm
## [1] 0
## 
## $word_freq_direct
## [1] 0
## 
## $word_freq_cs
## [1] 0
## 
## $word_freq_meeting
## [1] 0
## 
## $word_freq_original
## [1] 0
## 
## $word_freq_project
## [1] 0
## 
## $word_freq_re
## [1] 0
## 
## $word_freq_edu
## [1] 0
## 
## $word_freq_table
## [1] 0
## 
## $word_freq_conference
## [1] 0
## 
## $char_freq_.
## [1] 0
## 
## $char_freq_..1
## [1] 0
## 
## $char_freq_..2
## [1] 0
## 
## $char_freq_..3
## [1] 0
## 
## $char_freq_..4
## [1] 0
## 
## $char_freq_..5
## [1] 0
## 
## $capital_run_length_average
## [1] 0
## 
## $capital_run_length_longest
## [1] 0
## 
## $capital_run_length_total
## [1] 0
## 
## $status
## [1] 0

Check output

dfrStatusFreq <- summarise(group_by(dfrDataset, status), count=n())

ggplot(dfrStatusFreq, aes(x=status, y=count)) +
    geom_bar(stat="identity", aes(fill=count)) +
    labs(title="Status Frequency Distribution") +
    labs(x="Status") +
    labs(y="Counts")

Find Corelations

vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor)) 
summary(vcnCorsData)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002525 0.148800 0.253300 0.276900 0.354700 1.000000

Show Corelations

vcnCorsData

##             word_freq_make          word_freq_address 
##                 0.24069974                 0.29750940 
##              word_freq_all               word_freq_3d 
##                 0.33283147                 0.09077776 
##              word_freq_our             word_freq_over 
##                 0.40913946                 0.31864550 
##           word_freq_remove         word_freq_internet 
##                 0.51877779                 0.34379623 
##            word_freq_order             word_freq_mail 
##                 0.30073703                 0.29682394 
##          word_freq_receive             word_freq_will 
##                 0.35496682                 0.14847653 
##           word_freq_people           word_freq_report 
##                 0.21287588                 0.14977533 
##        word_freq_addresses             word_freq_free 
##                 0.26515743                 0.50416922 
##         word_freq_business            word_freq_email 
##                 0.35290749                 0.29909391 
##              word_freq_you           word_freq_credit 
##                 0.36110406                 0.32418657 
##             word_freq_your             word_freq_font 
##                 0.50159062                 0.13797471 
##              word_freq_000            word_freq_money 
##                 0.42580256                 0.47215455 
##               word_freq_hp              word_freq_hpl 
##                 0.39981558                 0.34188069 
##           word_freq_george              word_freq_650 
##                 0.35393063                 0.22619064 
##              word_freq_lab             word_freq_labs 
##                 0.22068802                 0.24580530 
##           word_freq_telnet              word_freq_857 
##                 0.20467400                 0.16983798 
##             word_freq_data              word_freq_415 
##                 0.15756347                 0.15802818 
##               word_freq_85       word_freq_technology 
##                 0.21413087                 0.16680254 
##             word_freq_1999            word_freq_parts 
##                 0.26070752                 0.00252536 
##               word_freq_pm           word_freq_direct 
##                 0.14721389                 0.02813193 
##               word_freq_cs          word_freq_meeting 
##                 0.14453750                 0.19574176 
##         word_freq_original          word_freq_project 
##                 0.10781412                 0.14453744 
##               word_freq_re              word_freq_edu 
##                 0.07176763                 0.19702549 
##            word_freq_table       word_freq_conference 
##                 0.02266674                 0.13903044 
##                char_freq_.              char_freq_..1 
##                 0.05683530                 0.03263555 
##              char_freq_..2              char_freq_..3 
##                 0.11122690                 0.59785363 
##              char_freq_..4              char_freq_..5 
##                 0.56563314                 0.26668614 
## capital_run_length_average capital_run_length_longest 
##                 0.48794983                 0.51515693 
##   capital_run_length_total                     status 
##                 0.44397367                 1.00000000

Plot Corelations

corrgram(dfrDataset)

High Corelations

vcnCorsData[vcnCorsData>0.6]

## status 
##      1

Create Column spam

dfrDataset <- mutate(dfrDataset, Result= ifelse(dfrDataset$status >0,'Spam','Not Spam'))
dfrDataset$Result <- as.factor(dfrDataset$Result)
table(dfrDataset$Result)

## 
## Not Spam     Spam 
##     2788     1813

Dataset Split

set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.7, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]

Training Dataset RowCount & ColCount

dim(dfrTrnData)

## [1] 3221   59

Testing Dataset RowCount & ColCount

dim(dfrTstData)

## [1] 1380   59

Training Dataset Head

head(dfrTrnData)

##    word_freq_make word_freq_address word_freq_all word_freq_3d
## 2            0.21              0.28          0.50            0
## 3            0.06              0.00          0.71            0
## 4            0.00              0.00          0.00            0
## 10           0.06              0.12          0.77            0
## 11           0.00              0.00          0.00            0
## 12           0.00              0.00          0.25            0
##    word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2           0.14           0.28             0.21               0.07
## 3           1.23           0.19             0.19               0.12
## 4           0.63           0.00             0.31               0.63
## 10          0.19           0.32             0.38               0.00
## 11          0.00           0.00             0.96               0.00
## 12          0.38           0.25             0.25               0.00
##    word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2             0.00           0.94              0.21           0.79
## 3             0.64           0.25              0.38           0.45
## 4             0.31           0.63              0.31           0.31
## 10            0.06           0.00              0.00           0.64
## 11            0.00           1.92              0.96           0.00
## 12            0.00           0.00              0.12           0.12
##    word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2              0.65             0.21                0.14           0.14
## 3              0.12             0.00                1.75           0.06
## 4              0.31             0.00                0.00           0.31
## 10             0.25             0.00                0.12           0.00
## 11             0.00             0.00                0.00           0.00
## 12             0.12             0.00                0.00           0.00
##    word_freq_business word_freq_email word_freq_you word_freq_credit
## 2                0.07            0.28          3.47             0.00
## 3                0.06            1.03          1.36             0.32
## 4                0.00            0.00          3.18             0.00
## 10               0.00            0.12          1.67             0.06
## 11               0.00            0.96          3.84             0.00
## 12               0.00            0.00          1.16             0.00
##    word_freq_your word_freq_font word_freq_000 word_freq_money
## 2            1.59              0          0.43            0.43
## 3            0.51              0          1.16            0.06
## 4            0.31              0          0.00            0.00
## 10           0.71              0          0.19            0.00
## 11           0.96              0          0.00            0.00
## 12           0.77              0          0.00            0.00
##    word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2             0             0                0             0             0
## 3             0             0                0             0             0
## 4             0             0                0             0             0
## 10            0             0                0             0             0
## 11            0             0                0             0             0
## 12            0             0                0             0             0
##    word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2               0                0             0              0
## 3               0                0             0              0
## 4               0                0             0              0
## 10              0                0             0              0
## 11              0                0             0              0
## 12              0                0             0              0
##    word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2              0            0                    0           0.07
## 3              0            0                    0           0.00
## 4              0            0                    0           0.00
## 10             0            0                    0           0.00
## 11             0            0                    0           0.00
## 12             0            0                    0           0.00
##    word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2                0            0             0.00            0
## 3                0            0             0.06            0
## 4                0            0             0.00            0
## 10               0            0             0.00            0
## 11               0            0             0.96            0
## 12               0            0             0.00            0
##    word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2                  0               0.00              0.00         0.00
## 3                  0               0.12              0.00         0.06
## 4                  0               0.00              0.00         0.00
## 10                 0               0.00              0.06         0.00
## 11                 0               0.00              0.00         0.00
## 12                 0               0.00              0.00         0.00
##    word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2           0.00               0                    0       0.000
## 3           0.06               0                    0       0.010
## 4           0.00               0                    0       0.000
## 10          0.00               0                    0       0.040
## 11          0.00               0                    0       0.000
## 12          0.00               0                    0       0.022
##    char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2          0.132             0         0.372         0.180         0.048
## 3          0.143             0         0.276         0.184         0.010
## 4          0.137             0         0.137         0.000         0.000
## 10         0.030             0         0.244         0.081         0.000
## 11         0.000             0         0.462         0.000         0.000
## 12         0.044             0         0.663         0.000         0.000
##    capital_run_length_average capital_run_length_longest
## 2                       5.114                        101
## 3                       9.821                        485
## 4                       3.537                         40
## 10                      1.729                         43
## 11                      1.312                          6
## 12                      1.243                         11
##    capital_run_length_total status Result
## 2                      1028      1   Spam
## 3                      2259      1   Spam
## 4                       191      1   Spam
## 10                      749      1   Spam
## 11                       21      1   Spam
## 12                      184      1   Spam

Testing Dataset Head

head(dfrTstData)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
## 7           0.00              0.00          0.00            0
## 8           0.00              0.00          0.00            0
## 9           0.15              0.00          0.46            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32              0             0.00               0.00
## 5          0.63              0             0.31               0.63
## 6          1.85              0             0.00               1.85
## 7          1.92              0             0.00               0.00
## 8          1.88              0             0.00               1.88
## 9          0.61              0             0.30               0.00
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
## 7            0.00           0.64              0.96           1.28
## 8            0.00           0.00              0.00           0.00
## 9            0.92           0.76              0.76           0.92
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00                0                   0           0.32
## 5             0.31                0                   0           0.31
## 6             0.00                0                   0           0.00
## 7             0.00                0                   0           0.96
## 8             0.00                0                   0           0.00
## 9             0.00                0                   0           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1                  0            1.29          1.93             0.00
## 5                  0            0.00          3.18             0.00
## 6                  0            0.00          0.00             0.00
## 7                  0            0.32          3.85             0.00
## 8                  0            0.00          0.00             0.00
## 9                  0            0.15          1.23             3.53
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0             0            0.00            0
## 5           0.31              0             0            0.00            0
## 6           0.00              0             0            0.00            0
## 7           0.64              0             0            0.00            0
## 8           0.00              0             0            0.00            0
## 9           2.00              0             0            0.15            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
## 7             0                0             0             0
## 8             0                0             0             0
## 9             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0           0.00
## 5              0                0             0           0.00
## 6              0                0             0           0.00
## 7              0                0             0           0.00
## 8              0                0             0           0.00
## 9              0                0             0           0.15
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0              0
## 5             0            0                    0              0
## 6             0            0                    0              0
## 7             0            0                    0              0
## 8             0            0                    0              0
## 9             0            0                    0              0
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0                0            0
## 5               0            0                0            0
## 6               0            0                0            0
## 7               0            0                0            0
## 8               0            0                0            0
## 9               0            0                0            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0                0.0                 0            0
## 5                 0                0.0                 0            0
## 6                 0                0.0                 0            0
## 7                 0                0.0                 0            0
## 8                 0                0.0                 0            0
## 9                 0                0.3                 0            0
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1             0               0                    0           0
## 5             0               0                    0           0
## 6             0               0                    0           0
## 7             0               0                    0           0
## 8             0               0                    0           0
## 9             0               0                    0           0
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
## 7         0.054             0         0.164         0.054         0.000
## 8         0.206             0         0.000         0.000         0.000
## 9         0.271             0         0.181         0.203         0.022
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 5                      3.537                         40
## 6                      3.000                         15
## 7                      1.671                          4
## 8                      2.450                         11
## 9                      9.744                        445
##   capital_run_length_total status Result
## 1                      278      1   Spam
## 5                      191      1   Spam
## 6                       54      1   Spam
## 7                      112      1   Spam
## 8                       49      1   Spam
## 9                     1257      1   Spam

Training Dataset Summary

lapply(dfrTrnData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1001  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1901  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.279   0.400   5.100 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05219  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3027  0.3600 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09372 0.00000 2.63000 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1146  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.112   0.000  11.110 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08978 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2288  0.1300 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05888 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5349  0.7700  7.6900 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09223 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05904  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04824 0.00000 4.41000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2464  0.1000 16.6600 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.138   0.000   7.140 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1818  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.290   1.671   2.660  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08912  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.7921  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1223  0.0000 15.4300 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1051  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09648  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5416  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2673  0.0000 10.8600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7712  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1322  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09695  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06896  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04833 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1035  0.0000 18.1800 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04945 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1092  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1037  0.0000  7.6900 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1381  0.0000  5.0500 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01352 0.00000 8.33000 
## 
## $word_freq_pm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07996 0.00000 9.75000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06304 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04513 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1367  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04738 0.00000 3.57000 
## 
## $word_freq_project
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08829  0.00000 20.00000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2993  0.1200 20.0000 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1828  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.004713 0.000000 1.910000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03038  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04054 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0660  0.1399  0.1890  5.2770 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01506 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2741  0.3110 32.4800 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07403 0.04800 5.30000 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03887  0.00000 13.13000 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.564    2.250    5.280    3.706 1102.000 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   14.00   49.71   43.00 2204.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      35      94     282     268   15840 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3971  1.0000  1.0000 
## 
## $Result
## Not Spam     Spam 
##     1942     1279

Testing Dataset Summary

lapply(dfrTstData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.115   0.000   4.000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2666  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2844  0.4500  4.5400 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09632  0.00000 40.13000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3346  0.4325  7.1400 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.101   0.000   5.880 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1133  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08964 0.00000 4.62000 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09074 0.00000 3.33000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2642  0.2200 11.1100 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06204 0.00000 2.00000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1200  0.5576  0.8500  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09789 0.00000 5.55000 
## 
## $word_freq_report
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05767 0.00000 5.55000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05145 0.00000 2.31000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2546  0.1125 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1532  0.0000  4.8700 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1917  0.0000  4.5100 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.330   1.641   2.590  14.280 
## 
## $word_freq_credit
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07731 0.00000 6.25000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2700  0.8509  1.2800 10.7100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1186  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0937  0.0000  3.3800 
## 
## $word_freq_money
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0891  0.0000  5.9800 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5679  0.0000 20.0000 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2609  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7582  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1078  0.0000  4.7600 
## 
## $word_freq_lab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1035  0.0000 10.0000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09869 0.00000 4.76000 
## 
## $word_freq_telnet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05493 0.00000 4.76000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04405 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08264 0.00000 8.33000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04407 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09651 0.00000 4.76000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08303 0.00000 4.76000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1344  0.0000  6.8900 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01246 0.00000 7.40000 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07551  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06901 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04025 0.00000 4.75000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1221  0.0000  7.6900 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04311 0.00000 3.44000 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05797 0.00000 4.54000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3056  0.0800 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1729  0.0000  9.5200 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.007152 0.000000 2.170000 
## 
## $word_freq_conference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03534 0.00000 8.33000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03398 0.00000 4.18700 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0600  0.1371  0.1772  9.7520 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02145 0.00000 2.77700 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2573  0.3302  5.8440 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07996 0.06225 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05676  0.00000 19.83000 
## 
## $capital_run_length_average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.630   2.318   4.984   3.706 443.700 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   57.92   45.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.5   286.3   259.2 10060.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.387   1.000   1.000 
## 
## $Result
## Not Spam     Spam 
##      846      534

Random Forest Approach

Create Model - Random Forest (Default)

## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2) 
myNtrees=500
# start time
vctProcStrt <- proc.time()
#Proc.time gives statistics of the process.

# random forest (default)
mdlRndForDef <- randomForest(Result~.-status, data=dfrTrnData, 
                             mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 21.72"

View Model - Default Random Forest

mdlRndForDef

## 
## Call:
##  randomForest(formula = Result ~ . - status, data = dfrTrnData,      mtry = myMtry, ntree = myNtrees) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 4.84%
## Confusion matrix:
##          Not Spam Spam class.error
## Not Spam     1883   59  0.03038105
## Spam           97 1182  0.07584050

View Model Summary - Default Random Forest

summary(mdlRndForDef)

##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       3221   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           6442   matrix numeric  
## oob.times       3221   -none- numeric  
## classes            2   -none- character
## importance        57   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3221   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

Prediction - Test Data - Random Forest (Default)

vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef  <- confusionMatrix(vctRndForDef, dfrTstData$Result)
cmxRndForDef

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      807   28
##   Spam           39  506
##                                           
##                Accuracy : 0.9514          
##                  95% CI : (0.9387, 0.9622)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8981          
##  Mcnemar's Test P-Value : 0.2218          
##                                           
##             Sensitivity : 0.9539          
##             Specificity : 0.9476          
##          Pos Pred Value : 0.9665          
##          Neg Pred Value : 0.9284          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5848          
##    Detection Prevalence : 0.6051          
##       Balanced Accuracy : 0.9507          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (RFM)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3) 
myMetric <- "Accuracy"   
myMtry <- sqrt(ncol(dfrTrnData)-2) 
myNtrees <- 500  
myTuneGrid <-  expand.grid(.mtry=myMtry)   
mdlRndForRfm <- train(Result~.-status, data=dfrTrnData, method="rf",
                        verbose=F, metric=myMetric, trControl=myControl,
                        tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 190.11"

View Model - Random Forest (RFM)

mdlRndForRfm

## Random Forest 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2898, 2899, 2900, 2898, 2899, 2899, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9515719  0.8982131
## 
## Tuning parameter 'mtry' was held constant at a value of 7.549834

View Model Summary - Random Forest (RFM)

summary(mdlRndForRfm)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3221   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           6442   matrix     numeric  
## oob.times       3221   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3221   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (RFM)

vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm  <- confusionMatrix(vctRndForRfm, dfrTstData$Result)
cmxRndForRfm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      807   26
##   Spam           39  508
##                                           
##                Accuracy : 0.9529          
##                  95% CI : (0.9404, 0.9635)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9012          
##  Mcnemar's Test P-Value : 0.1366          
##                                           
##             Sensitivity : 0.9539          
##             Specificity : 0.9513          
##          Pos Pred Value : 0.9688          
##          Neg Pred Value : 0.9287          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5848          
##    Detection Prevalence : 0.6036          
##       Balanced Accuracy : 0.9526          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (GBM)- General bagging method.

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(Result~.-status, data=dfrTrnData, method="gbm",
                 verbose=F, metric=myMetric, trControl=myControl)

                
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 202.63"

View Model - Random Forest (GBM)

mdlRndForGbm

## Stochastic Gradient Boosting 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2898, 2899, 2900, 2898, 2899, 2899, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9119364  0.8120901
##   1                  100      0.9305600  0.8530141
##   1                  150      0.9349085  0.8625545
##   2                   50      0.9303539  0.8527418
##   2                  100      0.9398768  0.8733088
##   2                  150      0.9440131  0.8821706
##   3                   50      0.9347011  0.8622468
##   3                  100      0.9437057  0.8814933
##   3                  150      0.9484635  0.8917012
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

View Model Summary - Random Forest (GBM)

summary(mdlRndForGbm)

##                                                   var     rel.inf
## char_freq_..3                           char_freq_..3 25.13448479
## char_freq_..4                           char_freq_..4 17.73569393
## word_freq_remove                     word_freq_remove 10.29166207
## word_freq_free                         word_freq_free  8.03319338
## word_freq_hp                             word_freq_hp  7.32712662
## capital_run_length_average capital_run_length_average  5.39872005
## word_freq_your                         word_freq_your  5.17722583
## capital_run_length_longest capital_run_length_longest  3.06771805
## capital_run_length_total     capital_run_length_total  2.37637288
## word_freq_edu                           word_freq_edu  2.36886986
## word_freq_george                     word_freq_george  2.23737649
## word_freq_our                           word_freq_our  1.91405678
## word_freq_money                       word_freq_money  1.86852638
## word_freq_1999                         word_freq_1999  1.19302329
## word_freq_000                           word_freq_000  0.89483913
## word_freq_you                           word_freq_you  0.56772121
## word_freq_meeting                   word_freq_meeting  0.56184568
## word_freq_re                             word_freq_re  0.49338930
## word_freq_internet                 word_freq_internet  0.45674727
## char_freq_.                               char_freq_.  0.38814204
## word_freq_will                         word_freq_will  0.35496666
## word_freq_receive                   word_freq_receive  0.31982874
## word_freq_650                           word_freq_650  0.30196695
## char_freq_..1                           char_freq_..1  0.28366668
## word_freq_font                         word_freq_font  0.18729687
## word_freq_email                       word_freq_email  0.18116648
## word_freq_business                 word_freq_business  0.14833434
## word_freq_hpl                           word_freq_hpl  0.14274872
## word_freq_technology             word_freq_technology  0.10444119
## word_freq_over                         word_freq_over  0.09742430
## word_freq_report                     word_freq_report  0.09059711
## word_freq_all                           word_freq_all  0.07754641
## word_freq_3d                             word_freq_3d  0.06375843
## word_freq_credit                     word_freq_credit  0.03340249
## word_freq_project                   word_freq_project  0.03264960
## word_freq_make                         word_freq_make  0.02808895
## word_freq_pm                             word_freq_pm  0.02398858
## word_freq_conference             word_freq_conference  0.01519494
## word_freq_mail                         word_freq_mail  0.01326804
## word_freq_parts                       word_freq_parts  0.01292947
## word_freq_address                   word_freq_address  0.00000000
## word_freq_order                       word_freq_order  0.00000000
## word_freq_people                     word_freq_people  0.00000000
## word_freq_addresses               word_freq_addresses  0.00000000
## word_freq_lab                           word_freq_lab  0.00000000
## word_freq_labs                         word_freq_labs  0.00000000
## word_freq_telnet                     word_freq_telnet  0.00000000
## word_freq_857                           word_freq_857  0.00000000
## word_freq_data                         word_freq_data  0.00000000
## word_freq_415                           word_freq_415  0.00000000
## word_freq_85                             word_freq_85  0.00000000
## word_freq_direct                     word_freq_direct  0.00000000
## word_freq_cs                             word_freq_cs  0.00000000
## word_freq_original                 word_freq_original  0.00000000
## word_freq_table                       word_freq_table  0.00000000
## char_freq_..2                           char_freq_..2  0.00000000
## char_freq_..5                           char_freq_..5  0.00000000

Prediction - Test Data - Random Forest (GBM)

vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm  <- confusionMatrix(vctRndForGbm, dfrTstData$Result)
cmxRndForGbm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      804   29
##   Spam           42  505
##                                           
##                Accuracy : 0.9486          
##                  95% CI : (0.9355, 0.9596)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.892           
##  Mcnemar's Test P-Value : 0.1544          
##                                           
##             Sensitivity : 0.9504          
##             Specificity : 0.9457          
##          Pos Pred Value : 0.9652          
##          Neg Pred Value : 0.9232          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5826          
##    Detection Prevalence : 0.6036          
##       Balanced Accuracy : 0.9480          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (OOB)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2) #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForOob <- train(Result~.-status, data=dfrTrnData, 
                    verbose=F, metric=myMetric, trControl=myControl, 
                    tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 39.72"

View Model - Random Formbest (OOB)

mdlRndForOob

## Random Forest 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9531201  0.9015534
## 
## Tuning parameter 'mtry' was held constant at a value of 7.549834

View Model Summary - Random Forest (OOB)

summary(mdlRndForOob)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3221   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           6442   matrix     numeric  
## oob.times       3221   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3221   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (OOB)

vctRndForOob <- predict(mdlRndForOob, newdata=dfrTstData)
cmxRndForOob <- confusionMatrix(vctRndForOob, dfrTstData$Result)
cmxRndForOob

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      809   27
##   Spam           37  507
##                                           
##                Accuracy : 0.9536          
##                  95% CI : (0.9412, 0.9641)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9026          
##  Mcnemar's Test P-Value : 0.2606          
##                                           
##             Sensitivity : 0.9563          
##             Specificity : 0.9494          
##          Pos Pred Value : 0.9677          
##          Neg Pred Value : 0.9320          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5862          
##    Detection Prevalence : 0.6058          
##       Balanced Accuracy : 0.9529          
##                                           
##        'Positive' Class : Not Spam        
##

Conclusion
From all the above models executed we can see that Default Random Forest gave us an accruacy of 0.9507 , Random Forest Method gave us an accruacy of 0.9529 , General bagging method gave us an accruacy of 0.9486 and Random Forest (OOB) gave us an accruacy of 0.9543. Hence we go with Out of bag model for the prediction.

Wind Up

print("Wind Up")

## [1] "Wind Up"

Random Forest Assignment

Sonia Mathew

September 12, 2017