RandomForest

Title: SPAM E-mail Database
Sources:

Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835
Generated: June-July 1999

Past Usage:

Hewlett-Packard Internal-only Technical Report. External forthcoming.
Determine whether a given email is spam or not.
~7% misclassification error. False positives (marking good mail as spam) are very undesirable. If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.

Relevant Information: The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
```
For background on spam:
Cranor, Lorrie F., LaMacchia, Brian A.  Spam! 
Communications of the ACM, 41(8):74-83, 1998.
```
Number of Instances: 4601 (1813 Spam = 39.4%)
Number of Attributes: 58 (57 continuous, 1 nominal class label)
Attribute Information: The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters

1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail

1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Missing Attribute Values: None
Class Distribution: Spam 1813 (39.4%) Non-Spam 2788 (60.6%)

Load Libs

library(plyr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
#install.packages("caret")
library(caret)

## Loading required package: lattice

#install.packages("randomForest")
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

#install.packages("gbm")
library(gbm)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.3

#install.packages("corrgram")
library(corrgram)

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:plyr':
## 
##     baseball

Functions

detectNA <- function(inp) {
  sum(is.na(inp))
}
detectCor <- function(x) {
  cor(as.numeric(dfrDataset[, x]), 
    as.numeric(dfrDataset$status), 
    method="spearman")
}

Load Dataset

dfrDataset <- read.csv("C:/firstproject/spambase.csv",  header=T, stringsAsFactors=T)
head(dfrDataset)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 2           0.21              0.28          0.50            0
## 3           0.06              0.00          0.71            0
## 4           0.00              0.00          0.00            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32           0.00             0.00               0.00
## 2          0.14           0.28             0.21               0.07
## 3          1.23           0.19             0.19               0.12
## 4          0.63           0.00             0.31               0.63
## 5          0.63           0.00             0.31               0.63
## 6          1.85           0.00             0.00               1.85
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 2            0.00           0.94              0.21           0.79
## 3            0.64           0.25              0.38           0.45
## 4            0.31           0.63              0.31           0.31
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00             0.00                0.00           0.32
## 2             0.65             0.21                0.14           0.14
## 3             0.12             0.00                1.75           0.06
## 4             0.31             0.00                0.00           0.31
## 5             0.31             0.00                0.00           0.31
## 6             0.00             0.00                0.00           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1               0.00            1.29          1.93             0.00
## 2               0.07            0.28          3.47             0.00
## 3               0.06            1.03          1.36             0.32
## 4               0.00            0.00          3.18             0.00
## 5               0.00            0.00          3.18             0.00
## 6               0.00            0.00          0.00             0.00
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0          0.00            0.00            0
## 2           1.59              0          0.43            0.43            0
## 3           0.51              0          1.16            0.06            0
## 4           0.31              0          0.00            0.00            0
## 5           0.31              0          0.00            0.00            0
## 6           0.00              0          0.00            0.00            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 2             0                0             0             0
## 3             0                0             0             0
## 4             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0              0
## 2              0                0             0              0
## 3              0                0             0              0
## 4              0                0             0              0
## 5              0                0             0              0
## 6              0                0             0              0
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0           0.00
## 2             0            0                    0           0.07
## 3             0            0                    0           0.00
## 4             0            0                    0           0.00
## 5             0            0                    0           0.00
## 6             0            0                    0           0.00
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0             0.00            0
## 2               0            0             0.00            0
## 3               0            0             0.06            0
## 4               0            0             0.00            0
## 5               0            0             0.00            0
## 6               0            0             0.00            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0               0.00                 0         0.00
## 2                 0               0.00                 0         0.00
## 3                 0               0.12                 0         0.06
## 4                 0               0.00                 0         0.00
## 5                 0               0.00                 0         0.00
## 6                 0               0.00                 0         0.00
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1          0.00               0                    0        0.00
## 2          0.00               0                    0        0.00
## 3          0.06               0                    0        0.01
## 4          0.00               0                    0        0.00
## 5          0.00               0                    0        0.00
## 6          0.00               0                    0        0.00
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 2         0.132             0         0.372         0.180         0.048
## 3         0.143             0         0.276         0.184         0.010
## 4         0.137             0         0.137         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 2                      5.114                        101
## 3                      9.821                        485
## 4                      3.537                         40
## 5                      3.537                         40
## 6                      3.000                         15
##   capital_run_length_total status
## 1                      278      1
## 2                     1028      1
## 3                     2259      1
## 4                      191      1
## 5                      191      1
## 6                       54      1

Dataframe Stucture

str(dfrDataset)

## 'data.frame':    4601 obs. of  58 variables:
##  $ word_freq_make            : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ word_freq_address         : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ word_freq_all             : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ word_freq_3d              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_our             : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ word_freq_over            : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ word_freq_remove          : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ word_freq_internet        : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ word_freq_order           : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ word_freq_mail            : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ word_freq_receive         : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ word_freq_will            : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ word_freq_people          : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ word_freq_report          : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ word_freq_addresses       : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ word_freq_free            : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ word_freq_business        : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_email           : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ word_freq_you             : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ word_freq_credit          : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ word_freq_your            : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ word_freq_font            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_000             : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ word_freq_money           : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ word_freq_hp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_hpl             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_george          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_650             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_lab             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_labs            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_telnet          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_857             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_data            : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ word_freq_415             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_85              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_technology      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_1999            : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ word_freq_parts           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_pm              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_direct          : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_cs              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_meeting         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_original        : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ word_freq_project         : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ word_freq_re              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_edu             : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ word_freq_table           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ word_freq_conference      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_.               : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ char_freq_..1             : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ char_freq_..2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ char_freq_..3             : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ char_freq_..4             : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ char_freq_..5             : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capital_run_length_average: num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capital_run_length_longest: int  61 101 485 40 40 15 4 11 445 43 ...
##  $ capital_run_length_total  : int  278 1028 2259 191 191 54 112 49 1257 749 ...
##  $ status                    : int  1 1 1 1 1 1 1 1 1 1 ...

Dataframe Summary

lapply(dfrDataset, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.213   0.000  14.280 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2807  0.4200  5.1000 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06542  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3122  0.3800 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0959  0.0000  5.8800 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1142  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1053  0.0000 11.1100 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2394  0.1600 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5417  0.8000  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05863  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0492  0.0000  4.4100 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2488  0.1000 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1426  0.0000  7.1400 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1847  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.310   1.662   2.640  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08558  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2200  0.8098  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1212  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1016  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09427  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5495  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2654  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7673  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1248  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09892  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1029  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06475  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000 
## 
## $word_freq_data
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09723  0.00000 18.18000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1054  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.137   0.000   6.890 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0132  0.0000  8.3300 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07863  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1323  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0461  0.0000  3.5700 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0792  0.0000 20.0000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3012  0.1100 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1798  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03187  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.065   0.139   0.188   9.752 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2691  0.3150 32.4800 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04424  0.00000 19.83000 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.588    2.276    5.192    3.706 1102.000 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   52.17   43.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.0   283.3   266.0 15840.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.394   1.000   1.000

Missing Data

lapply(dfrDataset, FUN=detectNA)

## $word_freq_make
## [1] 0
## 
## $word_freq_address
## [1] 0
## 
## $word_freq_all
## [1] 0
## 
## $word_freq_3d
## [1] 0
## 
## $word_freq_our
## [1] 0
## 
## $word_freq_over
## [1] 0
## 
## $word_freq_remove
## [1] 0
## 
## $word_freq_internet
## [1] 0
## 
## $word_freq_order
## [1] 0
## 
## $word_freq_mail
## [1] 0
## 
## $word_freq_receive
## [1] 0
## 
## $word_freq_will
## [1] 0
## 
## $word_freq_people
## [1] 0
## 
## $word_freq_report
## [1] 0
## 
## $word_freq_addresses
## [1] 0
## 
## $word_freq_free
## [1] 0
## 
## $word_freq_business
## [1] 0
## 
## $word_freq_email
## [1] 0
## 
## $word_freq_you
## [1] 0
## 
## $word_freq_credit
## [1] 0
## 
## $word_freq_your
## [1] 0
## 
## $word_freq_font
## [1] 0
## 
## $word_freq_000
## [1] 0
## 
## $word_freq_money
## [1] 0
## 
## $word_freq_hp
## [1] 0
## 
## $word_freq_hpl
## [1] 0
## 
## $word_freq_george
## [1] 0
## 
## $word_freq_650
## [1] 0
## 
## $word_freq_lab
## [1] 0
## 
## $word_freq_labs
## [1] 0
## 
## $word_freq_telnet
## [1] 0
## 
## $word_freq_857
## [1] 0
## 
## $word_freq_data
## [1] 0
## 
## $word_freq_415
## [1] 0
## 
## $word_freq_85
## [1] 0
## 
## $word_freq_technology
## [1] 0
## 
## $word_freq_1999
## [1] 0
## 
## $word_freq_parts
## [1] 0
## 
## $word_freq_pm
## [1] 0
## 
## $word_freq_direct
## [1] 0
## 
## $word_freq_cs
## [1] 0
## 
## $word_freq_meeting
## [1] 0
## 
## $word_freq_original
## [1] 0
## 
## $word_freq_project
## [1] 0
## 
## $word_freq_re
## [1] 0
## 
## $word_freq_edu
## [1] 0
## 
## $word_freq_table
## [1] 0
## 
## $word_freq_conference
## [1] 0
## 
## $char_freq_.
## [1] 0
## 
## $char_freq_..1
## [1] 0
## 
## $char_freq_..2
## [1] 0
## 
## $char_freq_..3
## [1] 0
## 
## $char_freq_..4
## [1] 0
## 
## $char_freq_..5
## [1] 0
## 
## $capital_run_length_average
## [1] 0
## 
## $capital_run_length_longest
## [1] 0
## 
## $capital_run_length_total
## [1] 0
## 
## $status
## [1] 0

Check output

dfrstatus <- summarise(group_by(dfrDataset, status), count=n())
# boxplot of mpg by car cylinders
ggplot(dfrstatus, aes(x=status, y=count)) +
    geom_bar(stat="identity", aes(fill=count)) +
    labs(title="status Frequency Distribution") +
    labs(x="status") +
    labs(y="Counts")

Find Corelations

## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor))
summary(vcnCorsData)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002525 0.148800 0.253300 0.276900 0.354700 1.000000

Show Corelations

vcnCorsData

##             word_freq_make          word_freq_address 
##                 0.24069974                 0.29750940 
##              word_freq_all               word_freq_3d 
##                 0.33283147                 0.09077776 
##              word_freq_our             word_freq_over 
##                 0.40913946                 0.31864550 
##           word_freq_remove         word_freq_internet 
##                 0.51877779                 0.34379623 
##            word_freq_order             word_freq_mail 
##                 0.30073703                 0.29682394 
##          word_freq_receive             word_freq_will 
##                 0.35496682                 0.14847653 
##           word_freq_people           word_freq_report 
##                 0.21287588                 0.14977533 
##        word_freq_addresses             word_freq_free 
##                 0.26515743                 0.50416922 
##         word_freq_business            word_freq_email 
##                 0.35290749                 0.29909391 
##              word_freq_you           word_freq_credit 
##                 0.36110406                 0.32418657 
##             word_freq_your             word_freq_font 
##                 0.50159062                 0.13797471 
##              word_freq_000            word_freq_money 
##                 0.42580256                 0.47215455 
##               word_freq_hp              word_freq_hpl 
##                 0.39981558                 0.34188069 
##           word_freq_george              word_freq_650 
##                 0.35393063                 0.22619064 
##              word_freq_lab             word_freq_labs 
##                 0.22068802                 0.24580530 
##           word_freq_telnet              word_freq_857 
##                 0.20467400                 0.16983798 
##             word_freq_data              word_freq_415 
##                 0.15756347                 0.15802818 
##               word_freq_85       word_freq_technology 
##                 0.21413087                 0.16680254 
##             word_freq_1999            word_freq_parts 
##                 0.26070752                 0.00252536 
##               word_freq_pm           word_freq_direct 
##                 0.14721389                 0.02813193 
##               word_freq_cs          word_freq_meeting 
##                 0.14453750                 0.19574176 
##         word_freq_original          word_freq_project 
##                 0.10781412                 0.14453744 
##               word_freq_re              word_freq_edu 
##                 0.07176763                 0.19702549 
##            word_freq_table       word_freq_conference 
##                 0.02266674                 0.13903044 
##                char_freq_.              char_freq_..1 
##                 0.05683530                 0.03263555 
##              char_freq_..2              char_freq_..3 
##                 0.11122690                 0.59785363 
##              char_freq_..4              char_freq_..5 
##                 0.56563314                 0.26668614 
## capital_run_length_average capital_run_length_longest 
##                 0.48794983                 0.51515693 
##   capital_run_length_total                     status 
##                 0.44397367                 1.00000000

Plot Corelations

corrgram(dfrDataset)

High Corelations

vcnCorsData[vcnCorsData>0.8]

## status 
##      1

Create Column Result

dfrDataset <- mutate(dfrDataset, Result= ifelse(dfrDataset$status < 1,'Not Spam','spam'))
dfrDataset$Result <- as.factor(dfrDataset$Result)
table(dfrDataset$Result)

## 
## Not Spam     spam 
##     2788     1813

Dataset Split

set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.7
                                  , list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs  ,]

Training Dataset RowCount & ColCount

dim(dfrTrnData)

## [1] 3221   59

Training Dataset Head

head(dfrTrnData)

##    word_freq_make word_freq_address word_freq_all word_freq_3d
## 2            0.21              0.28          0.50            0
## 3            0.06              0.00          0.71            0
## 4            0.00              0.00          0.00            0
## 10           0.06              0.12          0.77            0
## 11           0.00              0.00          0.00            0
## 12           0.00              0.00          0.25            0
##    word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2           0.14           0.28             0.21               0.07
## 3           1.23           0.19             0.19               0.12
## 4           0.63           0.00             0.31               0.63
## 10          0.19           0.32             0.38               0.00
## 11          0.00           0.00             0.96               0.00
## 12          0.38           0.25             0.25               0.00
##    word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2             0.00           0.94              0.21           0.79
## 3             0.64           0.25              0.38           0.45
## 4             0.31           0.63              0.31           0.31
## 10            0.06           0.00              0.00           0.64
## 11            0.00           1.92              0.96           0.00
## 12            0.00           0.00              0.12           0.12
##    word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2              0.65             0.21                0.14           0.14
## 3              0.12             0.00                1.75           0.06
## 4              0.31             0.00                0.00           0.31
## 10             0.25             0.00                0.12           0.00
## 11             0.00             0.00                0.00           0.00
## 12             0.12             0.00                0.00           0.00
##    word_freq_business word_freq_email word_freq_you word_freq_credit
## 2                0.07            0.28          3.47             0.00
## 3                0.06            1.03          1.36             0.32
## 4                0.00            0.00          3.18             0.00
## 10               0.00            0.12          1.67             0.06
## 11               0.00            0.96          3.84             0.00
## 12               0.00            0.00          1.16             0.00
##    word_freq_your word_freq_font word_freq_000 word_freq_money
## 2            1.59              0          0.43            0.43
## 3            0.51              0          1.16            0.06
## 4            0.31              0          0.00            0.00
## 10           0.71              0          0.19            0.00
## 11           0.96              0          0.00            0.00
## 12           0.77              0          0.00            0.00
##    word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2             0             0                0             0             0
## 3             0             0                0             0             0
## 4             0             0                0             0             0
## 10            0             0                0             0             0
## 11            0             0                0             0             0
## 12            0             0                0             0             0
##    word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2               0                0             0              0
## 3               0                0             0              0
## 4               0                0             0              0
## 10              0                0             0              0
## 11              0                0             0              0
## 12              0                0             0              0
##    word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2              0            0                    0           0.07
## 3              0            0                    0           0.00
## 4              0            0                    0           0.00
## 10             0            0                    0           0.00
## 11             0            0                    0           0.00
## 12             0            0                    0           0.00
##    word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2                0            0             0.00            0
## 3                0            0             0.06            0
## 4                0            0             0.00            0
## 10               0            0             0.00            0
## 11               0            0             0.96            0
## 12               0            0             0.00            0
##    word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2                  0               0.00              0.00         0.00
## 3                  0               0.12              0.00         0.06
## 4                  0               0.00              0.00         0.00
## 10                 0               0.00              0.06         0.00
## 11                 0               0.00              0.00         0.00
## 12                 0               0.00              0.00         0.00
##    word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2           0.00               0                    0       0.000
## 3           0.06               0                    0       0.010
## 4           0.00               0                    0       0.000
## 10          0.00               0                    0       0.040
## 11          0.00               0                    0       0.000
## 12          0.00               0                    0       0.022
##    char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2          0.132             0         0.372         0.180         0.048
## 3          0.143             0         0.276         0.184         0.010
## 4          0.137             0         0.137         0.000         0.000
## 10         0.030             0         0.244         0.081         0.000
## 11         0.000             0         0.462         0.000         0.000
## 12         0.044             0         0.663         0.000         0.000
##    capital_run_length_average capital_run_length_longest
## 2                       5.114                        101
## 3                       9.821                        485
## 4                       3.537                         40
## 10                      1.729                         43
## 11                      1.312                          6
## 12                      1.243                         11
##    capital_run_length_total status Result
## 2                      1028      1   spam
## 3                      2259      1   spam
## 4                       191      1   spam
## 10                      749      1   spam
## 11                       21      1   spam
## 12                      184      1   spam

Testing Dataset Head

head(dfrTstData)

##   word_freq_make word_freq_address word_freq_all word_freq_3d
## 1           0.00              0.64          0.64            0
## 5           0.00              0.00          0.00            0
## 6           0.00              0.00          0.00            0
## 7           0.00              0.00          0.00            0
## 8           0.00              0.00          0.00            0
## 9           0.15              0.00          0.46            0
##   word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1          0.32              0             0.00               0.00
## 5          0.63              0             0.31               0.63
## 6          1.85              0             0.00               1.85
## 7          1.92              0             0.00               0.00
## 8          1.88              0             0.00               1.88
## 9          0.61              0             0.30               0.00
##   word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1            0.00           0.00              0.00           0.64
## 5            0.31           0.63              0.31           0.31
## 6            0.00           0.00              0.00           0.00
## 7            0.00           0.64              0.96           1.28
## 8            0.00           0.00              0.00           0.00
## 9            0.92           0.76              0.76           0.92
##   word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1             0.00                0                   0           0.32
## 5             0.31                0                   0           0.31
## 6             0.00                0                   0           0.00
## 7             0.00                0                   0           0.96
## 8             0.00                0                   0           0.00
## 9             0.00                0                   0           0.00
##   word_freq_business word_freq_email word_freq_you word_freq_credit
## 1                  0            1.29          1.93             0.00
## 5                  0            0.00          3.18             0.00
## 6                  0            0.00          0.00             0.00
## 7                  0            0.32          3.85             0.00
## 8                  0            0.00          0.00             0.00
## 9                  0            0.15          1.23             3.53
##   word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1           0.96              0             0            0.00            0
## 5           0.31              0             0            0.00            0
## 6           0.00              0             0            0.00            0
## 7           0.64              0             0            0.00            0
## 8           0.00              0             0            0.00            0
## 9           2.00              0             0            0.15            0
##   word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1             0                0             0             0
## 5             0                0             0             0
## 6             0                0             0             0
## 7             0                0             0             0
## 8             0                0             0             0
## 9             0                0             0             0
##   word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1              0                0             0           0.00
## 5              0                0             0           0.00
## 6              0                0             0           0.00
## 7              0                0             0           0.00
## 8              0                0             0           0.00
## 9              0                0             0           0.15
##   word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1             0            0                    0              0
## 5             0            0                    0              0
## 6             0            0                    0              0
## 7             0            0                    0              0
## 8             0            0                    0              0
## 9             0            0                    0              0
##   word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1               0            0                0            0
## 5               0            0                0            0
## 6               0            0                0            0
## 7               0            0                0            0
## 8               0            0                0            0
## 9               0            0                0            0
##   word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1                 0                0.0                 0            0
## 5                 0                0.0                 0            0
## 6                 0                0.0                 0            0
## 7                 0                0.0                 0            0
## 8                 0                0.0                 0            0
## 9                 0                0.3                 0            0
##   word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1             0               0                    0           0
## 5             0               0                    0           0
## 6             0               0                    0           0
## 7             0               0                    0           0
## 8             0               0                    0           0
## 9             0               0                    0           0
##   char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1         0.000             0         0.778         0.000         0.000
## 5         0.135             0         0.135         0.000         0.000
## 6         0.223             0         0.000         0.000         0.000
## 7         0.054             0         0.164         0.054         0.000
## 8         0.206             0         0.000         0.000         0.000
## 9         0.271             0         0.181         0.203         0.022
##   capital_run_length_average capital_run_length_longest
## 1                      3.756                         61
## 5                      3.537                         40
## 6                      3.000                         15
## 7                      1.671                          4
## 8                      2.450                         11
## 9                      9.744                        445
##   capital_run_length_total status Result
## 1                      278      1   spam
## 5                      191      1   spam
## 6                       54      1   spam
## 7                      112      1   spam
## 8                       49      1   spam
## 9                     1257      1   spam

Training Dataset Summary

lapply(dfrTrnData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1001  0.0000  4.5400 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1901  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.279   0.400   5.100 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05219  0.00000 42.81000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3027  0.3600 10.0000 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09372 0.00000 2.63000 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1146  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.112   0.000  11.110 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08978 0.00000 5.26000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2288  0.1300 18.1800 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05888 0.00000 2.61000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1000  0.5349  0.7700  7.6900 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09223 0.00000 5.55000 
## 
## $word_freq_report
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05904  0.00000 10.00000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04824 0.00000 4.41000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2464  0.1000 16.6600 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.138   0.000   7.140 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1818  0.0000  9.0900 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.290   1.671   2.660  18.750 
## 
## $word_freq_credit
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08912  0.00000 18.18000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.7921  1.2700 11.1100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1223  0.0000 15.4300 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1051  0.0000  5.4500 
## 
## $word_freq_money
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09648  0.00000 12.50000 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5416  0.0000 20.8300 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2673  0.0000 10.8600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7712  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1322  0.0000  9.0900 
## 
## $word_freq_lab
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09695  0.00000 14.28000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1046  0.0000  5.8800 
## 
## $word_freq_telnet
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.06896  0.00000 12.50000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04833 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1035  0.0000 18.1800 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04945 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1092  0.0000 20.0000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1037  0.0000  7.6900 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1381  0.0000  5.0500 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01352 0.00000 8.33000 
## 
## $word_freq_pm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07996 0.00000 9.75000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06304 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04513 0.00000 7.14000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1367  0.0000 14.2800 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04738 0.00000 3.57000 
## 
## $word_freq_project
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.08829  0.00000 20.00000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2993  0.1200 20.0000 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1828  0.0000 22.0500 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.004713 0.000000 1.910000 
## 
## $word_freq_conference
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03038  0.00000 10.00000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04054 0.00000 4.38500 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0660  0.1399  0.1890  5.2770 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01506 0.00000 4.08100 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2741  0.3110 32.4800 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07403 0.04800 5.30000 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.03887  0.00000 13.13000 
## 
## $capital_run_length_average
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.564    2.250    5.280    3.706 1102.000 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   14.00   49.71   43.00 2204.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      35      94     282     268   15840 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3971  1.0000  1.0000 
## 
## $Result
## Not Spam     spam 
##     1942     1279

Testing Dataset Summary

lapply(dfrTstData, FUN=summary)

## $word_freq_make
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.115   0.000   4.000 
## 
## $word_freq_address
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2666  0.0000 14.2800 
## 
## $word_freq_all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2844  0.4500  4.5400 
## 
## $word_freq_3d
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09632  0.00000 40.13000 
## 
## $word_freq_our
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3346  0.4325  7.1400 
## 
## $word_freq_over
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.101   0.000   5.880 
## 
## $word_freq_remove
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1133  0.0000  7.2700 
## 
## $word_freq_internet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08964 0.00000 4.62000 
## 
## $word_freq_order
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09074 0.00000 3.33000 
## 
## $word_freq_mail
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2642  0.2200 11.1100 
## 
## $word_freq_receive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06204 0.00000 2.00000 
## 
## $word_freq_will
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1200  0.5576  0.8500  9.6700 
## 
## $word_freq_people
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09789 0.00000 5.55000 
## 
## $word_freq_report
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05767 0.00000 5.55000 
## 
## $word_freq_addresses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05145 0.00000 2.31000 
## 
## $word_freq_free
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2546  0.1125 20.0000 
## 
## $word_freq_business
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1532  0.0000  4.8700 
## 
## $word_freq_email
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1917  0.0000  4.5100 
## 
## $word_freq_you
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.330   1.641   2.590  14.280 
## 
## $word_freq_credit
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07731 0.00000 6.25000 
## 
## $word_freq_your
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2700  0.8509  1.2800 10.7100 
## 
## $word_freq_font
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1186  0.0000 17.1000 
## 
## $word_freq_000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0937  0.0000  3.3800 
## 
## $word_freq_money
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0891  0.0000  5.9800 
## 
## $word_freq_hp
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5679  0.0000 20.0000 
## 
## $word_freq_hpl
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2609  0.0000 16.6600 
## 
## $word_freq_george
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7582  0.0000 33.3300 
## 
## $word_freq_650
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1078  0.0000  4.7600 
## 
## $word_freq_lab
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1035  0.0000 10.0000 
## 
## $word_freq_labs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09869 0.00000 4.76000 
## 
## $word_freq_telnet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05493 0.00000 4.76000 
## 
## $word_freq_857
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04405 0.00000 4.76000 
## 
## $word_freq_data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08264 0.00000 8.33000 
## 
## $word_freq_415
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04407 0.00000 4.76000 
## 
## $word_freq_85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09651 0.00000 4.76000 
## 
## $word_freq_technology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.08303 0.00000 4.76000 
## 
## $word_freq_1999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1344  0.0000  6.8900 
## 
## $word_freq_parts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01246 0.00000 7.40000 
## 
## $word_freq_pm
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.07551  0.00000 11.11000 
## 
## $word_freq_direct
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06901 0.00000 4.76000 
## 
## $word_freq_cs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04025 0.00000 4.75000 
## 
## $word_freq_meeting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1221  0.0000  7.6900 
## 
## $word_freq_original
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04311 0.00000 3.44000 
## 
## $word_freq_project
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05797 0.00000 4.54000 
## 
## $word_freq_re
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3056  0.0800 21.4200 
## 
## $word_freq_edu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1729  0.0000  9.5200 
## 
## $word_freq_table
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.007152 0.000000 2.170000 
## 
## $word_freq_conference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03534 0.00000 8.33000 
## 
## $char_freq_.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03398 0.00000 4.18700 
## 
## $char_freq_..1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0600  0.1371  0.1772  9.7520 
## 
## $char_freq_..2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02145 0.00000 2.77700 
## 
## $char_freq_..3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2573  0.3302  5.8440 
## 
## $char_freq_..4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.07996 0.06225 6.00300 
## 
## $char_freq_..5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.05676  0.00000 19.83000 
## 
## $capital_run_length_average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.630   2.318   4.984   3.706 443.700 
## 
## $capital_run_length_longest
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   57.92   45.00 9989.00 
## 
## $capital_run_length_total
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    35.0    95.5   286.3   259.2 10060.0 
## 
## $status
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.387   1.000   1.000 
## 
## $Result
## Not Spam     spam 
##      846      534

Create Model - Random Forest (Default)

## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2)
myNtrees=500
# start time
vctProcStrt <- proc.time()
# random forest (default)
#mdlRndForDef <- randomForest(taste~.-quality, data=dfrTrnData)
mdlRndForDef <- randomForest(Result~.-status, data=dfrTrnData, 
                             mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 26.7"

View Model - Default Random Forest

mdlRndForDef

## 
## Call:
##  randomForest(formula = Result ~ . - status, data = dfrTrnData,      mtry = myMtry, ntree = myNtrees) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 4.75%
## Confusion matrix:
##          Not Spam spam class.error
## Not Spam     1887   55  0.02832132
## spam           98 1181  0.07662236

View Model Summary - Default Random Forest

summary(mdlRndForDef)

##                 Length Class  Mode     
## call               5   -none- call     
## type               1   -none- character
## predicted       3221   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           6442   matrix numeric  
## oob.times       3221   -none- numeric  
## classes            2   -none- character
## importance        57   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3221   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

Prediction - Test Data - Random Forest (Default)

vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef  <- confusionMatrix(vctRndForDef, dfrTstData$Result)
cmxRndForDef

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam spam
##   Not Spam      808   28
##   spam           38  506
##                                           
##                Accuracy : 0.9522          
##                  95% CI : (0.9396, 0.9628)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8995          
##  Mcnemar's Test P-Value : 0.2679          
##                                           
##             Sensitivity : 0.9551          
##             Specificity : 0.9476          
##          Pos Pred Value : 0.9665          
##          Neg Pred Value : 0.9301          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5855          
##    Detection Prevalence : 0.6058          
##       Balanced Accuracy : 0.9513          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (RFM)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
#myNtrees <- 500
myTuneGrid <-  expand.grid(.mtry=myMtry)
mdlRndForRfm <- train(Result~.-status, data=dfrTrnData, method="rf",
                        verbose=F,metric=myMetric, trControl=myControl,
                        tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 182.2"

View Model - Random Forest (RFM)

mdlRndForRfm

## Random Forest 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2898, 2899, 2900, 2898, 2899, 2899, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9537468  0.9028854
## 
## Tuning parameter 'mtry' was held constant at a value of 7.549834

View Model Summary - Random Forest (RFM)

summary(mdlRndForRfm)

##                 Length Class      Mode     
## call               5   -none-     call     
## type               1   -none-     character
## predicted       3221   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           6442   matrix     numeric  
## oob.times       3221   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3221   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              1   -none-     list

Prediction - Test Data - Random Forest (RFM)

vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm  <- confusionMatrix(vctRndForRfm, dfrTstData$Result)
cmxRndForRfm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam spam
##   Not Spam      807   30
##   spam           39  504
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.9371, 0.9609)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8949          
##  Mcnemar's Test P-Value : 0.3355          
##                                           
##             Sensitivity : 0.9539          
##             Specificity : 0.9438          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9282          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5848          
##    Detection Prevalence : 0.6065          
##       Balanced Accuracy : 0.9489          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (GBM)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
#myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(Result~.-status, data=dfrTrnData, method="gbm",
                 verbose=F, metric=myMetric, trControl=myControl)
#                        tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 184.11"

View Model - Random Forest (GBM)

mdlRndForGbm

## Stochastic Gradient Boosting 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'spam' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2898, 2899, 2900, 2898, 2899, 2899, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9119364  0.8120901
##   1                  100      0.9305600  0.8530141
##   1                  150      0.9349085  0.8625545
##   2                   50      0.9303539  0.8527418
##   2                  100      0.9398768  0.8733088
##   2                  150      0.9437035  0.8814997
##   3                   50      0.9347011  0.8622468
##   3                  100      0.9437057  0.8814933
##   3                  150      0.9485670  0.8919259
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

View Model Summary - Random Forest (GBM)

summary(mdlRndForGbm)

##                                                   var     rel.inf
## char_freq_..3                           char_freq_..3 25.13448479
## char_freq_..4                           char_freq_..4 17.73569393
## word_freq_remove                     word_freq_remove 10.29166207
## word_freq_free                         word_freq_free  8.03319338
## word_freq_hp                             word_freq_hp  7.32712662
## capital_run_length_average capital_run_length_average  5.39872005
## word_freq_your                         word_freq_your  5.17722583
## capital_run_length_longest capital_run_length_longest  3.06771805
## capital_run_length_total     capital_run_length_total  2.37637288
## word_freq_edu                           word_freq_edu  2.36886986
## word_freq_george                     word_freq_george  2.23737649
## word_freq_our                           word_freq_our  1.91405678
## word_freq_money                       word_freq_money  1.86852638
## word_freq_1999                         word_freq_1999  1.19302329
## word_freq_000                           word_freq_000  0.89483913
## word_freq_you                           word_freq_you  0.56772121
## word_freq_meeting                   word_freq_meeting  0.56184568
## word_freq_re                             word_freq_re  0.49338930
## word_freq_internet                 word_freq_internet  0.45674727
## char_freq_.                               char_freq_.  0.38814204
## word_freq_will                         word_freq_will  0.35496666
## word_freq_receive                   word_freq_receive  0.31982874
## word_freq_650                           word_freq_650  0.30196695
## char_freq_..1                           char_freq_..1  0.28366668
## word_freq_font                         word_freq_font  0.18729687
## word_freq_email                       word_freq_email  0.18116648
## word_freq_business                 word_freq_business  0.14833434
## word_freq_hpl                           word_freq_hpl  0.14274872
## word_freq_technology             word_freq_technology  0.10444119
## word_freq_over                         word_freq_over  0.09742430
## word_freq_report                     word_freq_report  0.09059711
## word_freq_all                           word_freq_all  0.07754641
## word_freq_3d                             word_freq_3d  0.06375843
## word_freq_credit                     word_freq_credit  0.03340249
## word_freq_project                   word_freq_project  0.03264960
## word_freq_make                         word_freq_make  0.02808895
## word_freq_pm                             word_freq_pm  0.02398858
## word_freq_conference             word_freq_conference  0.01519494
## word_freq_mail                         word_freq_mail  0.01326804
## word_freq_parts                       word_freq_parts  0.01292947
## word_freq_address                   word_freq_address  0.00000000
## word_freq_order                       word_freq_order  0.00000000
## word_freq_people                     word_freq_people  0.00000000
## word_freq_addresses               word_freq_addresses  0.00000000
## word_freq_lab                           word_freq_lab  0.00000000
## word_freq_labs                         word_freq_labs  0.00000000
## word_freq_telnet                     word_freq_telnet  0.00000000
## word_freq_857                           word_freq_857  0.00000000
## word_freq_data                         word_freq_data  0.00000000
## word_freq_415                           word_freq_415  0.00000000
## word_freq_85                             word_freq_85  0.00000000
## word_freq_direct                     word_freq_direct  0.00000000
## word_freq_cs                             word_freq_cs  0.00000000
## word_freq_original                 word_freq_original  0.00000000
## word_freq_table                       word_freq_table  0.00000000
## char_freq_..2                           char_freq_..2  0.00000000
## char_freq_..5                           char_freq_..5  0.00000000

Prediction - Test Data - Random Forest (GBM)

vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm  <- confusionMatrix(vctRndForGbm, dfrTstData$Result)
cmxRndForGbm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam spam
##   Not Spam      804   29
##   spam           42  505
##                                           
##                Accuracy : 0.9486          
##                  95% CI : (0.9355, 0.9596)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.892           
##  Mcnemar's Test P-Value : 0.1544          
##                                           
##             Sensitivity : 0.9504          
##             Specificity : 0.9457          
##          Pos Pred Value : 0.9652          
##          Neg Pred Value : 0.9232          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5826          
##    Detection Prevalence : 0.6036          
##       Balanced Accuracy : 0.9480          
##                                           
##        'Positive' Class : Not Spam        
##

Create Model - Random Forest (OOB)

## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForoob <- train(Result~.-status, data=dfrTrnData, 
                    verbose=F, metric=myMetric, trControl=myControl, 
                    tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))

## [1] "Model Created ... 31.34"

View Model - Random Formbest (OOB)

mdlRndForoob

## Random Forest 
## 
## 3221 samples
##   58 predictor
##    2 classes: 'Not Spam', 'spam' 
## 
## No pre-processing
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9509469  0.8971414
## 
## Tuning parameter 'mtry' was held constant at a value of 10

View Model Summary - Random Forest (OOB)

summary(mdlRndForoob)

##                 Length Class      Mode     
## call               6   -none-     call     
## type               1   -none-     character
## predicted       3221   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           6442   matrix     numeric  
## oob.times       3221   -none-     numeric  
## classes            2   -none-     character
## importance        57   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               3221   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames            57   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              2   -none-     list

Prediction - Test Data - Random Forest (OOB)

vctRndForoob <- predict(mdlRndForoob, newdata=dfrTstData)
cmxRndForoob <- confusionMatrix(vctRndForoob, dfrTstData$Result)
cmxRndForoob

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Not Spam spam
##   Not Spam      807   28
##   spam           39  506
##                                           
##                Accuracy : 0.9514          
##                  95% CI : (0.9387, 0.9622)
##     No Information Rate : 0.613           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8981          
##  Mcnemar's Test P-Value : 0.2218          
##                                           
##             Sensitivity : 0.9539          
##             Specificity : 0.9476          
##          Pos Pred Value : 0.9665          
##          Neg Pred Value : 0.9284          
##              Prevalence : 0.6130          
##          Detection Rate : 0.5848          
##    Detection Prevalence : 0.6051          
##       Balanced Accuracy : 0.9507          
##                                           
##        'Positive' Class : Not Spam        
##

Thank You

print("Thank You")

## [1] "Thank You"

RandomForest

Rishabh Sabarwal

12 September 2017