Assignment on Random Forest

Introduction Random Forest

The steps to predict using Random Forest:
* Step 1
* Step 2

Problem Definition
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

So Our Problem Statement is-
To create the model to predict the mail is Spam or Not.
Test the model on test dataset
Check the accuracy.

Data Location
This file: ‘spambase.DOCUMENTATION’ at the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html

Data Description

Number of Attributes: 58 (57 continuous, 1 nominal class label)

The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Setup

Load Libs

library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("caret")
library(caret)
#install.packages("randomForest")
library(randomForest)
#install.packages("gbm")
library(gbm)
#install.packages("corrgram")
library(corrgram)
#install.packages("corrplot")
library(corrplot)

Functions

detectNA <- function(inp) {
  sum(is.na(inp))
}
detectCor <- function(x) {
  cor(as.numeric(dfrDataset[, x]), 
    as.numeric(dfrDataset$status), 
    method="spearman")
}

Load Dataset

setwd("C:/Users/SarveshKumar/Desktop/R/machine learning/assignment/random forest")
dfrDataset <- read.csv("./spambase.csv", header=T, stringsAsFactors=T)
head(dfrDataset)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 2           0.21              0.28          0.50            0
## 3           0.06              0.00          0.71            0
## 4           0.00              0.00          0.00            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32           0.00             0.00               0.00
## 2          0.14           0.28             0.21               0.07
## 3          1.23           0.19             0.19               0.12
## 4          0.63           0.00             0.31               0.63
## 5          0.63           0.00             0.31               0.63
## 6          1.85           0.00             0.00               1.85
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 2            0.00           0.94              0.21           0.79
## 3            0.64           0.25              0.38           0.45
## 4            0.31           0.63              0.31           0.31
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00             0.00                0.00           0.32
## 2             0.65             0.21                0.14           0.14
## 3             0.12             0.00                1.75           0.06
## 4             0.31             0.00                0.00           0.31
## 5             0.31             0.00                0.00           0.31
## 6             0.00             0.00                0.00           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1               0.00            1.29          1.93             0.00
## 2               0.07            0.28          3.47             0.00
## 3               0.06            1.03          1.36             0.32
## 4               0.00            0.00          3.18             0.00
## 5               0.00            0.00          3.18             0.00
## 6               0.00            0.00          0.00             0.00
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0          0.00            0.00            0
## 2           1.59              0          0.43            0.43            0
## 3           0.51              0          1.16            0.06            0
## 4           0.31              0          0.00            0.00            0
## 5           0.31              0          0.00            0.00            0
## 6           0.00              0          0.00            0.00            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 2             0                0             0             0
## 3             0                0             0             0
## 4             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0              0
## 2              0                0             0              0
## 3              0                0             0              0
## 4              0                0             0              0
## 5              0                0             0              0
## 6              0                0             0              0
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0           0.00
## 2             0            0                    0           0.07
## 3             0            0                    0           0.00
## 4             0            0                    0           0.00
## 5             0            0                    0           0.00
## 6             0            0                    0           0.00
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0             0.00            0
## 2               0            0             0.00            0
## 3               0            0             0.06            0
## 4               0            0             0.00            0
## 5               0            0             0.00            0
## 6               0            0             0.00            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0               0.00                 0         0.00
## 2                 0               0.00                 0         0.00
## 3                 0               0.12                 0         0.06
## 4                 0               0.00                 0         0.00
## 5                 0               0.00                 0         0.00
## 6                 0               0.00                 0         0.00
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1          0.00               0                    0        0.00
## 2          0.00               0                    0        0.00
## 3          0.06               0                    0        0.01
## 4          0.00               0                    0        0.00
## 5          0.00               0                    0        0.00
## 6          0.00               0                    0        0.00
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 2         0.132             0         0.372         0.180         0.048
## 3         0.143             0         0.276         0.184         0.010
## 4         0.137             0         0.137         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 2                      5.114                        101
## 3                      9.821                        485
## 4                      3.537                         40
## 5                      3.537                         40
## 6                      3.000                         15
##   capital_run_length_total status
## 1                      278      1
## 2                     1028      1
## 3                     2259      1
## 4                      191      1
## 5                      191      1
## 6                       54      1

Observations
1. Data loaded successfully.
2. 4601 records found in the data set.

Dataframe Stucture

str(dfrDataset)

## 'data.frame':    4601 obs. of  58 variables:
##  $ word_freq_make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ word_freq_address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ word_freq_all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ word_freq_over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ word_freq_remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ word_freq_internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ word_freq_order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ word_freq_mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ word_freq_receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ word_freq_will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ word_freq_people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ word_freq_report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ word_freq_free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ word_freq_business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ word_freq_you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ word_freq_credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ word_freq_your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ word_freq_money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ word_freq_hp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ word_freq_re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ char_freq_..1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ char_freq_..2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ char_freq_..4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ char_freq_..5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capital_run_length_average: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capital_run_length_longest: int  61 101 485 40 40 15 4 11 445 43 ...
##  $ capital_run_length_total  : int  278 1028 2259 191 191 54 112 49 1257 749 ...
##  $ status                    : int  1 1 1 1 1 1 1 1 1 1 ...

#RF deals with non numeric as well as numeric data

Obsrvations
58 columns are of class Numbers or Integers.

Dataframe Summary

lapply(dfrDataset, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.213   0.000  14.280 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2807  0.4200  5.1000 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06542  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3122  0.3800 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0959  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1142  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1053  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2394  0.1600 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5417  0.8000  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05863  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0492  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2488  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1426  0.0000  7.1400 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1847  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.662   2.640  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08558  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2200  0.8098  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1212  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1016  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09427  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5495  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2654  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7673  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1248  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09892  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1029  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06475  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000 
## 
## $word_freq_data
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09723  0.00000 18.18000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1054  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.137   0.000   6.890 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0132  0.0000  8.3300 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07863  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1323  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0461  0.0000  3.5700 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0792  0.0000 20.0000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3012  0.1100 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1798  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03187  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.065   0.139   0.188   9.752 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2691  0.3150 32.4780 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04424  0.00000 19.82900 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.588    2.276    5.191    3.706 1102.500 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   52.17   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.0   283.3   266.0 15841.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.394   1.000   1.000

Missing Data

lapply(dfrDataset, FUN=detectNA)

## $word_freq_make
## [1] 0
## 
## $word_freq_address
## [1] 0
## 
## $word_freq_all
## [1] 0
## 
## $word_freq_3d
## [1] 0
## 
## $word_freq_our
## [1] 0
## 
## $word_freq_over
## [1] 0
## 
## $word_freq_remove
## [1] 0
## 
## $word_freq_internet
## [1] 0
## 
## $word_freq_order
## [1] 0
## 
## $word_freq_mail
## [1] 0
## 
## $word_freq_receive
## [1] 0
## 
## $word_freq_will
## [1] 0
## 
## $word_freq_people
## [1] 0
## 
## $word_freq_report
## [1] 0
## 
## $word_freq_addresses
## [1] 0
## 
## $word_freq_free
## [1] 0
## 
## $word_freq_business
## [1] 0
## 
## $word_freq_email
## [1] 0
## 
## $word_freq_you
## [1] 0
## 
## $word_freq_credit
## [1] 0
## 
## $word_freq_your
## [1] 0
## 
## $word_freq_font
## [1] 0
## 
## $word_freq_000
## [1] 0
## 
## $word_freq_money
## [1] 0
## 
## $word_freq_hp
## [1] 0
## 
## $word_freq_hpl
## [1] 0
## 
## $word_freq_george
## [1] 0
## 
## $word_freq_650
## [1] 0
## 
## $word_freq_lab
## [1] 0
## 
## $word_freq_labs
## [1] 0
## 
## $word_freq_telnet
## [1] 0
## 
## $word_freq_857
## [1] 0
## 
## $word_freq_data
## [1] 0
## 
## $word_freq_415
## [1] 0
## 
## $word_freq_85
## [1] 0
## 
## $word_freq_technology
## [1] 0
## 
## $word_freq_1999
## [1] 0
## 
## $word_freq_parts
## [1] 0
## 
## $word_freq_pm
## [1] 0
## 
## $word_freq_direct
## [1] 0
## 
## $word_freq_cs
## [1] 0
## 
## $word_freq_meeting
## [1] 0
## 
## $word_freq_original
## [1] 0
## 
## $word_freq_project
## [1] 0
## 
## $word_freq_re
## [1] 0
## 
## $word_freq_edu
## [1] 0
## 
## $word_freq_table
## [1] 0
## 
## $word_freq_conference
## [1] 0
## 
## $char_freq_.
## [1] 0
## 
## $char_freq_..1
## [1] 0
## 
## $char_freq_..2
## [1] 0
## 
## $char_freq_..3
## [1] 0
## 
## $char_freq_..4
## [1] 0
## 
## $char_freq_..5
## [1] 0
## 
## $capital_run_length_average
## [1] 0
## 
## $capital_run_length_longest
## [1] 0
## 
## $capital_run_length_total
## [1] 0
## 
## $status
## [1] 0

Observations
1. 58 columns not detected with a single NA record in the ‘spambase’ dataset.

Check output

dfrQltyFreq <- summarize(group_by(dfrDataset, status), count=n())
# boxplot
ggplot(dfrQltyFreq, aes(x=status, y=count)) +
    geom_bar(stat="identity", aes(fill=count)) +
    labs(title="Status Frequency Distribution") +
    labs(x="Status") +
    labs(y="Counts")

Find Corelations

## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor)) #absolute value
summary(vcnCorsData)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002525 0.148801 0.253256 0.276879 0.354708 1.000000

Show Corelations

vcnCorsData

##             word_freq_make          word_freq_address 
##                 0.24069974                 0.29750940 
##              word_freq_all               word_freq_3d 
##                 0.33283147                 0.09077776 
##              word_freq_our             word_freq_over 
##                 0.40913946                 0.31864550 
##           word_freq_remove         word_freq_internet 
##                 0.51877779                 0.34379623 
##            word_freq_order             word_freq_mail 
##                 0.30073703                 0.29682394 
##          word_freq_receive             word_freq_will 
##                 0.35496682                 0.14847653 
##           word_freq_people           word_freq_report 
##                 0.21287588                 0.14977533 
##        word_freq_addresses             word_freq_free 
##                 0.26515743                 0.50416922 
##         word_freq_business            word_freq_email 
##                 0.35290749                 0.29909391 
##              word_freq_you           word_freq_credit 
##                 0.36110406                 0.32418657 
##             word_freq_your             word_freq_font 
##                 0.50159062                 0.13797471 
##              word_freq_000            word_freq_money 
##                 0.42580256                 0.47215455 
##               word_freq_hp              word_freq_hpl 
##                 0.39981558                 0.34188069 
##           word_freq_george              word_freq_650 
##                 0.35393063                 0.22619064 
##              word_freq_lab             word_freq_labs 
##                 0.22068802                 0.24580530 
##           word_freq_telnet              word_freq_857 
##                 0.20467400                 0.16983798 
##             word_freq_data              word_freq_415 
##                 0.15756347                 0.15802818 
##               word_freq_85       word_freq_technology 
##                 0.21413087                 0.16680254 
##             word_freq_1999            word_freq_parts 
##                 0.26070752                 0.00252536 
##               word_freq_pm           word_freq_direct 
##                 0.14721389                 0.02813193 
##               word_freq_cs          word_freq_meeting 
##                 0.14453750                 0.19574176 
##         word_freq_original          word_freq_project 
##                 0.10781412                 0.14453744 
##               word_freq_re              word_freq_edu 
##                 0.07176763                 0.19702549 
##            word_freq_table       word_freq_conference 
##                 0.02266674                 0.13903044 
##                char_freq_.              char_freq_..1 
##                 0.05683530                 0.03263555 
##              char_freq_..2              char_freq_..3 
##                 0.11122690                 0.59785363 
##              char_freq_..4              char_freq_..5 
##                 0.56563314                 0.26668614 
## capital_run_length_average capital_run_length_longest 
##                 0.48794983                 0.51515693 
##   capital_run_length_total                     status 
##                 0.44397367                 1.00000000

Plot Corelations

corrgram(dfrDataset)

Observations
1. Correlation has been ploted between ‘status’ variable & others.
2. With respect to ‘status’ variable, a medium to high correlation can be seen for some variables

More than Medium Corelations

vcnCorsData[vcnCorsData>0.4]

##              word_freq_our           word_freq_remove 
##                  0.4091395                  0.5187778 
##             word_freq_free             word_freq_your 
##                  0.5041692                  0.5015906 
##              word_freq_000            word_freq_money 
##                  0.4258026                  0.4721546 
##              char_freq_..3              char_freq_..4 
##                  0.5978536                  0.5656331 
## capital_run_length_average capital_run_length_longest 
##                  0.4879498                  0.5151569 
##   capital_run_length_total                     status 
##                  0.4439737                  1.0000000

Create Column IsSpam

dfrDataset <- mutate(dfrDataset, IsSpam= ifelse(dfrDataset$status ==0,'Not Spam',
                                'Spam'))
dfrDataset$IsSpam <- as.factor(dfrDataset$IsSpam)
table(dfrDataset$IsSpam)

## 
## Not Spam     Spam 
##     2788     1813

#By now u should have checked for correlation and remove columns if required and checked for data imbalance
#Columns with more NA remove that too.

Observations
1. New column ‘IsSpam’ added w.r.t ‘status’ variable using If-Else condition. 2. #Not Spam = 2788 and #Spam = 1813 entries found.

Dataset Split

set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.8, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]

Observations
1. Split dataset into training & test datasets.
2. 80% data is now a part of training data and 20% data now in test data on which train results will be checked for prediction.

Training Dataset RowCount & ColCount

dim(dfrTrnData)

## [1] 3681   59

Observations
1. Dimensions of the training data set shows 3681 data records in it. which is 80%

Testing Dataset RowCount & ColCount

dim(dfrTstData)

## [1] 920  59

Observations
1. Dimensions of the test data set shows 920 data records in it which is 20%

Training Dataset Head

head(dfrTrnData)

##    word_freq_make word_freq_address word_freq_all word_freq_3d
## 2            0.21              0.28          0.50            0
## 3            0.06              0.00          0.71            0
## 4            0.00              0.00          0.00            0
## 10           0.06              0.12          0.77            0
## 11           0.00              0.00          0.00            0
## 12           0.00              0.00          0.25            0
##    word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2           0.14           0.28             0.21               0.07
## 3           1.23           0.19             0.19               0.12
## 4           0.63           0.00             0.31               0.63
## 10          0.19           0.32             0.38               0.00
## 11          0.00           0.00             0.96               0.00
## 12          0.38           0.25             0.25               0.00
##    word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2             0.00           0.94              0.21           0.79
## 3             0.64           0.25              0.38           0.45
## 4             0.31           0.63              0.31           0.31
## 10            0.06           0.00              0.00           0.64
## 11            0.00           1.92              0.96           0.00
## 12            0.00           0.00              0.12           0.12
##    word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2              0.65             0.21                0.14           0.14
## 3              0.12             0.00                1.75           0.06
## 4              0.31             0.00                0.00           0.31
## 10             0.25             0.00                0.12           0.00
## 11             0.00             0.00                0.00           0.00
## 12             0.12             0.00                0.00           0.00
##    word_freq_business word_freq_email word_freq_you word_freq_credit
## 2                0.07            0.28          3.47             0.00
## 3                0.06            1.03          1.36             0.32
## 4                0.00            0.00          3.18             0.00
## 10               0.00            0.12          1.67             0.06
## 11               0.00            0.96          3.84             0.00
## 12               0.00            0.00          1.16             0.00
##    word_freq_your word_freq_font word_freq_000 word_freq_money
## 2            1.59              0          0.43            0.43
## 3            0.51              0          1.16            0.06
## 4            0.31              0          0.00            0.00
## 10           0.71              0          0.19            0.00
## 11           0.96              0          0.00            0.00
## 12           0.77              0          0.00            0.00
##    word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2             0             0                0             0             0
## 3             0             0                0             0             0
## 4             0             0                0             0             0
## 10            0             0                0             0             0
## 11            0             0                0             0             0
## 12            0             0                0             0             0
##    word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2               0                0             0              0
## 3               0                0             0              0
## 4               0                0             0              0
## 10              0                0             0              0
## 11              0                0             0              0
## 12              0                0             0              0
##    word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2              0            0                    0           0.07
## 3              0            0                    0           0.00
## 4              0            0                    0           0.00
## 10             0            0                    0           0.00
## 11             0            0                    0           0.00
## 12             0            0                    0           0.00
##    word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2                0            0             0.00            0
## 3                0            0             0.06            0
## 4                0            0             0.00            0
## 10               0            0             0.00            0
## 11               0            0             0.96            0
## 12               0            0             0.00            0
##    word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2                  0               0.00              0.00         0.00
## 3                  0               0.12              0.00         0.06
## 4                  0               0.00              0.00         0.00
## 10                 0               0.00              0.06         0.00
## 11                 0               0.00              0.00         0.00
## 12                 0               0.00              0.00         0.00
##    word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2           0.00               0                    0       0.000
## 3           0.06               0                    0       0.010
## 4           0.00               0                    0       0.000
## 10          0.00               0                    0       0.040
## 11          0.00               0                    0       0.000
## 12          0.00               0                    0       0.022
##    char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2          0.132             0         0.372         0.180         0.048
## 3          0.143             0         0.276         0.184         0.010
## 4          0.137             0         0.137         0.000         0.000
## 10         0.030             0         0.244         0.081         0.000
## 11         0.000             0         0.462         0.000         0.000
## 12         0.044             0         0.663         0.000         0.000
##    capital_run_length_average capital_run_length_longest
## 2                       5.114                        101
## 3                       9.821                        485
## 4                       3.537                         40
## 10                      1.729                         43
## 11                      1.312                          6
## 12                      1.243                         11
##    capital_run_length_total status IsSpam
## 2                      1028      1   Spam
## 3                      2259      1   Spam
## 4                       191      1   Spam
## 10                      749      1   Spam
## 11                       21      1   Spam
## 12                      184      1   Spam

Testing Dataset Head

head(dfrTstData)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
## 7           0.00              0.00          0.00            0
## 8           0.00              0.00          0.00            0
## 9           0.15              0.00          0.46            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32              0             0.00               0.00
## 5          0.63              0             0.31               0.63
## 6          1.85              0             0.00               1.85
## 7          1.92              0             0.00               0.00
## 8          1.88              0             0.00               1.88
## 9          0.61              0             0.30               0.00
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
## 7            0.00           0.64              0.96           1.28
## 8            0.00           0.00              0.00           0.00
## 9            0.92           0.76              0.76           0.92
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00                0                   0           0.32
## 5             0.31                0                   0           0.31
## 6             0.00                0                   0           0.00
## 7             0.00                0                   0           0.96
## 8             0.00                0                   0           0.00
## 9             0.00                0                   0           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1                  0            1.29          1.93             0.00
## 5                  0            0.00          3.18             0.00
## 6                  0            0.00          0.00             0.00
## 7                  0            0.32          3.85             0.00
## 8                  0            0.00          0.00             0.00
## 9                  0            0.15          1.23             3.53
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0             0            0.00            0
## 5           0.31              0             0            0.00            0
## 6           0.00              0             0            0.00            0
## 7           0.64              0             0            0.00            0
## 8           0.00              0             0            0.00            0
## 9           2.00              0             0            0.15            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
## 7             0                0             0             0
## 8             0                0             0             0
## 9             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0           0.00
## 5              0                0             0           0.00
## 6              0                0             0           0.00
## 7              0                0             0           0.00
## 8              0                0             0           0.00
## 9              0                0             0           0.15
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0              0
## 5             0            0                    0              0
## 6             0            0                    0              0
## 7             0            0                    0              0
## 8             0            0                    0              0
## 9             0            0                    0              0
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0                0            0
## 5               0            0                0            0
## 6               0            0                0            0
## 7               0            0                0            0
## 8               0            0                0            0
## 9               0            0                0            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0                0.0                 0            0
## 5                 0                0.0                 0            0
## 6                 0                0.0                 0            0
## 7                 0                0.0                 0            0
## 8                 0                0.0                 0            0
## 9                 0                0.3                 0            0
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1             0               0                    0           0
## 5             0               0                    0           0
## 6             0               0                    0           0
## 7             0               0                    0           0
## 8             0               0                    0           0
## 9             0               0                    0           0
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
## 7         0.054             0         0.164         0.054         0.000
## 8         0.206             0         0.000         0.000         0.000
## 9         0.271             0         0.181         0.203         0.022
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 5                      3.537                         40
## 6                      3.000                         15
## 7                      1.671                          4
## 8                      2.450                         11
## 9                      9.744                        445
##   capital_run_length_total status IsSpam
## 1                      278      1   Spam
## 5                      191      1   Spam
## 6                       54      1   Spam
## 7                      112      1   Spam
## 8                       49      1   Spam
## 9                     1257      1   Spam

Training Dataset Summary

lapply(dfrTrnData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09885 0.00000 4.54000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2027  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.276   0.400   5.100 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07702  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3066  0.3700 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09361 0.00000 3.57000 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1128  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1111  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09092 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2315  0.1400 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06053 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.090   0.533   0.780   7.690 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09264 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05711  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0495  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2504  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.14    0.00    7.14 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1867  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.681   2.670  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08879  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.8052  1.2800 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1237  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1015  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09806  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5447  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2772  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7443  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1324  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09591  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1049  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06735  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04803 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1003  0.0000 18.1800 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04888 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.111   0.000  20.000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1027  0.0000  7.6900 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1399  0.0000  6.8900 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01477 0.00000 8.33000 
## 
## $word_freq_pm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0794  0.0000  9.7500 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06296 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04437 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1302  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04856 0.00000 3.57000 
## 
## $word_freq_project
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08274  0.00000 20.00000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3028  0.1200 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1815  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005186 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02867  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0403  0.0000  4.3850 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0650  0.1398  0.1890  9.7520 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01525 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2693  0.3110 32.4780 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07428 0.05000 5.30000 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03956  0.00000 13.12900 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.576    2.250    5.180    3.697 1102.500 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   14.00   52.23   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    93.0   282.7   266.0 15841.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3953  1.0000  1.0000 
## 
## $IsSpam
## Not Spam     Spam 
##     2226     1455

Testing Dataset Summary

lapply(dfrTstData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1274  0.0000  4.0000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2542  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2994  0.5000  4.5400 
## 
## $word_freq_3d
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01904 0.00000 7.18000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3349  0.4300  7.1400 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1051  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1199  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08196 0.00000 4.00000 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08664 0.00000 3.33000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2709  0.2200 11.1100 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.00000 2.00000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1650  0.5763  0.8625  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09909 0.00000 2.94000 
## 
## $word_freq_report
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06469 0.00000 5.55000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04803 0.00000 2.31000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2427  0.1100 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1529  0.0000  4.8700 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.177   0.000   4.160 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.585   2.500  14.000 
## 
## $word_freq_credit
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07271 0.00000 6.25000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.270   0.828   1.260   8.000 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1112  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1022  0.0000  3.3800 
## 
## $word_freq_money
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07909 0.00000 4.41000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5688  0.0000 20.0000 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2181  0.0000  7.6900 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.8594  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09478 0.00000 4.76000 
## 
## $word_freq_lab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1109  0.0000 10.0000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09458 0.00000 4.76000 
## 
## $word_freq_telnet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05438 0.00000 4.76000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04312 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08496 0.00000 8.33000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04366 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08305 0.00000 4.76000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07655 0.00000 4.76000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.125   0.000   3.700 
## 
## $word_freq_parts
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006913 0.000000 1.560000 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07554  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07235 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04086 0.00000 4.75000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1408  0.0000  7.6900 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03627 0.00000 1.69000 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06501 0.00000 4.54000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.295   0.070  16.660 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1731  0.0000  9.0900 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006478 0.000000 2.120000 
## 
## $word_freq_conference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04469 0.00000 8.33000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03168 0.00000 3.67200 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0675  0.1360  0.1782  4.2710 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0239  0.0000  2.7770 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2681  0.3352  5.8280 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08193 0.06125 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06296  0.00000 19.82900 
## 
## $capital_run_length_average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.625   2.333   5.236   3.796 443.666 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   16.00   51.94   44.00 1325.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    37.0   101.5   285.5   259.2  9088.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3891  1.0000  1.0000 
## 
## $IsSpam
## Not Spam     Spam 
##      562      358

Random Forest Approach

Default Method

myCtrl <- trainControl(method=“cv”, number=10, repeats=3)

Increase in the iterations increases accruacy and thus takes time to make the model. for train we require caret package, for RF we require RF package.

m1 <- train(predictor~., data=dataFrame, method=“rf”,
verbose=F, trControl=myCtrl) ## Boost Method cvCtrl <- trainControl(method=“repeatedcv”, number=10, repeats=3) m2 <- train(predictor~., data=dataFrame, method=“gmb”, verbose=F, trControl=cvCtrl) ## Custom Algorithm … notice method is not mentioned here myCtrl <- trainControl(method=“oob”, number=10, repeats=3) m3 <- train(predictor~., data=dataFrame, tuneGrid=data.frame(mtry=10), trControl=myCtrl)

We could also try one of these if required

Support Vector Machines Model

myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o1 <- train(predictor~., data=dataFrame, method=“svm”, verbose=F, trControl=cvCtrl) ## KNN Model myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o2 <- train(predictor~., data=dataFrame, method=“knn”, verbose=F, trControl=cvCtrl) ## Bagged Model myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o3 <- train(predictor~., data=dataFrame, method=“bag”, verbose=F, trControl=cvCtrl)

Create Model - Random Forest (Default)

## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2)  #total number of predictor columns. hence u minus 2 coz u not taking taste and quality
myNtrees=500
# start time
vctProcStrt <- proc.time()
#Proc.time gives statistics of the process.

# random forest (default)
#mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData)
mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData, 
                             mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 10.92"

Observations
1. 10.06 in first run and 9.75 seconds in another run were taken to create the model by Default method.

View Model - Default Random Forest

mdlRndForDef

## 
## Call:
##  randomForest(formula = IsSpam ~ . - status, data = dfrTrnData,      mtry = myMtry, ntree = myNtrees) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 4.62%
## Confusion matrix:
##          Not Spam Spam class.error
## Not Spam     2162   64  0.02875112
## Spam          106 1349  0.07285223

#Anything above 70% is good.

Observations
1. OOB est is only 4.62% which means that the High Accuracy around 95.38% achieved.

View Model Summary - Default Random Forest

summary(mdlRndForDef)

##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       3681   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           7362   matrix numeric  
## oob.times       3681   -none- numeric  
## classes            2   -none- character
## importance        57   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3681   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

#mdlRndForDef[3]
#It gives you the lengthof the vector. 
#It prints the attributes of the predictor model.
#If you want to know the values in the attributes.
# Type mdlRndForDef$ntree[1] in the console

Prediction - Test Data - Random Forest (Default)

vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef  <- confusionMatrix(vctRndForDef, dfrTstData$IsSpam)
cmxRndForDef

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      534   20
##   Spam           28  338
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.9314, 0.9613)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8907          
##  Mcnemar's Test P-Value : 0.3123          
##                                           
##             Sensitivity : 0.9502          
##             Specificity : 0.9441          
##          Pos Pred Value : 0.9639          
##          Neg Pred Value : 0.9235          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5804          
##    Detection Prevalence : 0.6022          
##       Balanced Accuracy : 0.9472          
##                                           
##        'Positive' Class : Not Spam        
##

#Look for accruacy an 95% CI. The highest accruacy is 75.
#p value < 0.05. hence your model is good.

Observations
1. Reference table shows only 20(not spam predicted as spam) and 28 (spam predicted as not spam). Which is low error rate. 2. P value = 0.31 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.72% near to train data. Hence model is good.

Create Model - Random Forest (RFM)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3)  #change these params to increase accruacy. insted of cv u can write repeated cv, number can be more than 10 and repeats more than 3
myMetric <- "Accuracy"   #Consider this as syntax
myMtry <- sqrt(ncol(dfrTrnData)-2) #can change this also.
myNtrees <- 500  #to increase the accruacy increase this. Change only on thing at a time
myTuneGrid <-  expand.grid(.mtry=myMtry)  #How to configure mtry in train method we use this. consider this as syntax.
mdlRndForRfm <- train(IsSpam~.-status, data=dfrTrnData, method="rf",
                        verbose=F, metric=myMetric, trControl=myControl,
                        tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 94.19"

Observations
1. It took 94.71 seconds to create the model by Random Forest method.

View Model - Random Forest (RFM)

mdlRndForRfm

## Random Forest 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9521943  0.8994474
## 
## Tuning parameter 'mtry' was held constant at a value of 7.549834

Observations
1. A very high accuracy around 95.21% was seen which is less than default method that gave 95.38%

View Model Summary - Random Forest (RFM)

summary(mdlRndForRfm)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3681   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           7362   matrix     numeric  
## oob.times       3681   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3681   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (RFM)

vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm  <- confusionMatrix(vctRndForRfm, dfrTstData$IsSpam)
cmxRndForRfm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      534   21
##   Spam           28  337
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9302, 0.9603)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8884          
##  Mcnemar's Test P-Value : 0.3914          
##                                           
##             Sensitivity : 0.9502          
##             Specificity : 0.9413          
##          Pos Pred Value : 0.9622          
##          Neg Pred Value : 0.9233          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5804          
##    Detection Prevalence : 0.6033          
##       Balanced Accuracy : 0.9458          
##                                           
##        'Positive' Class : Not Spam        
##

Observations
1. Reference table shows only 21(not spam predicted as spam) and 28 (spam predicted as not spam). Which is low error rate. 2. P value = 0.39 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.58% near to train data. Hence model is good.

Create Model - Random Forest (GBM)- General bagging method. Creating groups of decision trees.GBM is generic. RF model is specific

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(IsSpam~.-status, data=dfrTrnData, method="gbm",
                 verbose=F, metric=myMetric, trControl=myControl)
                         #ntree=myNtrees)
#                 tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 105.14"

Observations
1. It took 97.89 seconds to create the model by Random Forest method which is much higher than default method.

View Model - Random Forest (GBM)

mdlRndForGbm

## Stochastic Gradient Boosting 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9125264  0.8130164
##   1                  100      0.9307222  0.8532164
##   1                  150      0.9367051  0.8663083
##   2                   50      0.9312693  0.8546294
##   2                  100      0.9405110  0.8744321
##   2                  150      0.9432274  0.8803238
##   3                   50      0.9363428  0.8654668
##   3                  100      0.9449460  0.8839671
##   3                  150      0.9471209  0.8886790
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Observations
1. A very high accuracy of 94.71% but lesser than the default method.

View Model Summary - Random Forest (GBM)

summary(mdlRndForGbm)

##                                                   var     rel.inf
## char_freq_..3                           char_freq_..3 21.71328125
## char_freq_..4                           char_freq_..4 19.17273719
## word_freq_remove                     word_freq_remove 11.66807935
## word_freq_hp                             word_freq_hp  9.17395240
## word_freq_free                         word_freq_free  7.94587755
## capital_run_length_longest capital_run_length_longest  5.19101659
## capital_run_length_average capital_run_length_average  4.55363849
## word_freq_your                         word_freq_your  3.03793007
## word_freq_our                           word_freq_our  2.49322457
## word_freq_money                       word_freq_money  2.43499996
## word_freq_george                     word_freq_george  2.39065969
## word_freq_edu                           word_freq_edu  1.74627237
## word_freq_1999                         word_freq_1999  1.28704265
## capital_run_length_total     capital_run_length_total  1.13584861
## word_freq_meeting                   word_freq_meeting  0.61307150
## word_freq_you                           word_freq_you  0.59182671
## word_freq_000                           word_freq_000  0.56908488
## word_freq_receive                   word_freq_receive  0.51898504
## word_freq_re                             word_freq_re  0.44280485
## word_freq_650                           word_freq_650  0.40545253
## char_freq_.                               char_freq_.  0.40499187
## word_freq_business                 word_freq_business  0.35117130
## word_freq_hpl                           word_freq_hpl  0.33315065
## word_freq_will                         word_freq_will  0.31637881
## word_freq_internet                 word_freq_internet  0.29897659
## word_freq_email                       word_freq_email  0.23535307
## word_freq_technology             word_freq_technology  0.18272811
## char_freq_..1                           char_freq_..1  0.15921637
## word_freq_over                         word_freq_over  0.13902156
## word_freq_font                         word_freq_font  0.12361794
## word_freq_mail                         word_freq_mail  0.05960259
## word_freq_report                     word_freq_report  0.05752083
## word_freq_pm                             word_freq_pm  0.05072411
## word_freq_project                   word_freq_project  0.03704862
## word_freq_credit                     word_freq_credit  0.03502091
## word_freq_order                       word_freq_order  0.03055389
## word_freq_conference             word_freq_conference  0.02965122
## word_freq_3d                             word_freq_3d  0.02071226
## word_freq_original                 word_freq_original  0.01781143
## word_freq_address                   word_freq_address  0.01701609
## word_freq_make                         word_freq_make  0.01394553
## word_freq_all                           word_freq_all  0.00000000
## word_freq_people                     word_freq_people  0.00000000
## word_freq_addresses               word_freq_addresses  0.00000000
## word_freq_lab                           word_freq_lab  0.00000000
## word_freq_labs                         word_freq_labs  0.00000000
## word_freq_telnet                     word_freq_telnet  0.00000000
## word_freq_857                           word_freq_857  0.00000000
## word_freq_data                         word_freq_data  0.00000000
## word_freq_415                           word_freq_415  0.00000000
## word_freq_85                             word_freq_85  0.00000000
## word_freq_parts                       word_freq_parts  0.00000000
## word_freq_direct                     word_freq_direct  0.00000000
## word_freq_cs                             word_freq_cs  0.00000000
## word_freq_table                       word_freq_table  0.00000000
## char_freq_..2                           char_freq_..2  0.00000000
## char_freq_..5                           char_freq_..5  0.00000000

Prediction - Test Data - Random Forest (GBM)

vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm  <- confusionMatrix(vctRndForGbm, dfrTstData$IsSpam)
cmxRndForGbm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      532   19
##   Spam           30  339
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9302, 0.9603)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8886          
##  Mcnemar's Test P-Value : 0.1531          
##                                           
##             Sensitivity : 0.9466          
##             Specificity : 0.9469          
##          Pos Pred Value : 0.9655          
##          Neg Pred Value : 0.9187          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5783          
##    Detection Prevalence : 0.5989          
##       Balanced Accuracy : 0.9468          
##                                           
##        'Positive' Class : Not Spam        
##

Observations
1. Reference table shows only 19(not spam predicted as spam) and 30 (spam predicted as not spam). Which is low error rate. 2. P value = 0.15 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.68% near to train data. Hence model is good.

Create Model - Random Forest (OOB)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForOob <- train(IsSpam~.-status, data=dfrTrnData, 
                    verbose=F, metric=myMetric, trControl=myControl, 
                    tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 20.48"

Observations
1. It took 21.5 seconds to create the model by Random Forest method which is much higher than default method.

View Model - Random Formbest (OOB)

mdlRndForOob

## Random Forest 
## 
## 3681 samples
##   58 predictor
##    2 classes: 'Not Spam', 'Spam' 
## 
## No pre-processing
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9538169  0.9030031
## 
## Tuning parameter 'mtry' was held constant at a value of 10

Observations
1. A very high accuracyof 95.38% but lesser than the default method.

View Model Summary - Random Forest (OOB)

summary(mdlRndForOob)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3681   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           7362   matrix     numeric  
## oob.times       3681   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3681   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (OOB)

vctRndForOob <- predict(mdlRndForOob, newdata=dfrTstData)
cmxRndForOob <- confusionMatrix(vctRndForOob, dfrTstData$IsSpam)
cmxRndForOob

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam Spam
##   Not Spam      534   21
##   Spam           28  337
##                                           
##                Accuracy : 0.9467          
##                  95% CI : (0.9302, 0.9603)
##     No Information Rate : 0.6109          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8884          
##  Mcnemar's Test P-Value : 0.3914          
##                                           
##             Sensitivity : 0.9502          
##             Specificity : 0.9413          
##          Pos Pred Value : 0.9622          
##          Neg Pred Value : 0.9233          
##              Prevalence : 0.6109          
##          Detection Rate : 0.5804          
##    Detection Prevalence : 0.6033          
##       Balanced Accuracy : 0.9458          
##                                           
##        'Positive' Class : Not Spam        
##

Observations
1. Reference table shows only 21 (not spam predicted as spam) and 29 (spam predicted as not spam). Which is low error rate. 2. P value = 0.32 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.49% near to train data. Hence model is good.

End of Project

Assignment on Random Forest

Shubhendu Awasthi

September 12, 2017

Default Method

Support Vector Machines Model